To me it's fascinating that you indeed _can_ speed up what appears to be a non-parallelisable task like parsing HTML (or JSON to a lesser extent) using SIMD. And that the gains are so substantial too.
There's a surprising amount of things that are possible to parse in a parallel fashion if one tries hard enough, though keeping significant or any gains is more tricky. The operation in the article though is only a rather tiny part of parsing HTML though.
Can anyone recommend pointers to find out more about creating programming languages / markup languages that would be more parallel friendly for parsing? Or maybe even embarrassingly parallel? How does that affect the grammar, etc.. Maybe you need a different formalism rather than BNF do describe it, and you don't use regex to tokenize, and there is no recursive descent, or state machines in parsing, etc..
There's not really that much specific to change I think; the general thing is just reducing the number of rules, as in a parallel thing you'll be applying every rule to every part versus just the rules given preceding context (there is the option of filtering parts to process if there are a bunch of rules applying to the same small subset of the input, but that requires such cont).
Are you still operating on a single stream of characters? I was wondering about something more radical, like you start in the middle of the stream, and one thread parses the forward to the end, and the other parses backwards towards the beginning.
That's an option for a potential 2x boost, which I suppose could be good enough in certain contexts, but that's it. In particular, if you have a parser system able to utilize SIMD for all parsing, that'll get arbitrary parallelizability for "free" (at the cost of some log(n) increase in total operations across all threads).
The approach to these vectorized parsers is generally to find some part of the parsing task that doesn't have a dependency chain through every dang byte, like a naively written parser would, then do that first, then do some harder stuff after. I'm pretty used to it now so I lack perspective on whether this is weird, but in general finding long dependency chains and getting rid of them or replacing them with several shorter dependency chains will make software a lot faster.
As a very rough approximation, I suspect that html parsing (beyond the streaming of the data anyhow) isn't really all that fundamentally serial. After all, if you seek into an html document at some random point, you can generally interpret the html structure mostly locally, i.e. it's likely possible to subdivide the task of parsing an html document.
Given how leniently html deals with errors and that some of the parsing rules have some context sensitivity (the various modes) actually exploiting large-scale parallelism is probably a very difficult bit of engineering. Additionally, there may be corner cases that have much more context dependence than normal. Also, parsing is already probably not a huge part of a browsers runtime, and aggressive parallelism might be faster but more power hungry for a very small latency win. But I can't think of a specific fundamental limitation that prevents the normal parsing case from being quite parallelizable; I believe it's "just" an engineering problem.