Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Recently I got a speedup that was literally 6-7 orders of magnitude by changing a few characters in a regex, which made it match its input in linear time instead of O(k^N) due to a crazy backtrack.

The speedup I measured could have been even greater if longer strings were being matched, but in practice the longest string I tested it with took over 14 hours to match before the change, and ~2 milliseconds after the change. This alone is a ~25 million times speedup.

It was such a fun problem to solve, and by far the largest speedup I had ever seen. It was also the first time I encountered this issue outside of compilation class in college, but the way this regex was constructed made the issue pretty obvious.



What was the actual change?


It was a long regex, made of multiple parts with two OR'd together that looked very similar except for the last few characters. These two parts were trying to match a structured format, think of something like a domain name for example; it's not just [a-z0-9.-]+ since you can't have the - or . at the start or at the end, and maybe the end part has a limited length (say you want to match only short TLDs). I can't really say what these strings were but they had these sorts of restrictions and a range that was a bit larger than what domains can use (like capital letters).

So they had something like this for this structured part:

    [a-zA-Z]+[a-zA-Z0-9-.]*[a-zA-Z0-9]?(\.[a-zA-Z]{2,3})?

Actually kind of like that, but with multiple ()? and (){N} wrapping the different layers.

Let's call the regex above STRUCTURED. Now how do you match either one of these structured strings followed by either `.foo` or `.bar`? They had it as:

    STRUCTURED\.foo|STRUCTURED\.bar
(again, where STRUCTURED is replaced with the whole long regex from above). Meaning that the regex engine, as it consumes the first characters that potentially match the structured string, can't tell whether it's in the left or right branch of this OR until it reaches either the `.foo` or `.bar` suffix.

The change I made was simply to match it with:

    STRUCTURED(\.foo|\.bar)
In this case the engine can consume each character of the input and make its way through STRUCTURED without having to maintain k branches per character (given as k=2 in the example above but it was much more than that).

I hope the general idea still comes through despite the simplification and required obfuscation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: