Wow that still sounds really long for simply tokenizing a file? I worked on parsers a while ago and for reference I benchmarked e.g. the Python parser at 300.000 loc / second (for tokenization and building an AST) on my machine (a i7 laptop). Also tokenization complexity should not increase quadratically with the length of the file?
You probably know what you’re doing, just curious why these numbers seem to be off so much to what I would expect. What approach did you use for tokenization if I may ask?
I don't recall the exact numbers to be honest. I know the original was in many seconds, and in the end it was sub 1.
As mentioned in another comment:
The go extension is a thin wrapper around standard go tooling, we weren’t tokenizing ourselves just converting between their tokens and ones we could process; a large part of that was converting from byte offsets to UTC-8 character offsets.
The quadratic behavior was a bug caused by reconverting segments over and over again instead of converting deltas between previously converted subsegments.
You probably know what you’re doing, just curious why these numbers seem to be off so much to what I would expect. What approach did you use for tokenization if I may ask?