Hacker News new | past | comments | ask | show | jobs | submit login

(This comment is pretty informal; if you want to know more, look up terms like "context-free grammar", "Parsing Expression Grammar", "Backus-Naur Form", "regular language", or even "formal language theory".)

The problem is not "Turing-complete" (roughly, "can express arbitrarily complex stuff"), but "context-free" (roughly, "you can parse without considering what you've seen so far").

For instance, Brainfuck is Turing-complete (in the "Turing tarpit" sense) but really easy to turn into tokens (in fact, an informal approach may not even distinguish between "+" and "the token 'increment'"). Even realistic languages like C can be parsed without using anything much more advanced than regexes (you need some ugly kludges to support typedefs, and you should pretty much ignore newlines; one would typically use something more like yacc, but that's still not a very sophisticated tool.)

On the other hand, XML or HTML (which are not Turing-complete, and, informally, "not that expressive") are pretty much impossible to usefully parse without extensively considering context - <a><b /><c /><d /></a> gives and <e><d /></e> are "very different <d />'s".

I don't know XScript, but regexes may be a completely viable approach. In fact, if your current approach works, it's likely good enough - it requires some theoretical background to express why you can't parse HTML with regexes, but you'll run into lots of trouble if you try (leading to stuff like http://stackoverflow.com/questions/1732348/regex-match-open-...). Of course, there's value in finding out your approach won't work before you've spent weeks of effort. ;-)




Thanks for the information! Looks like I've got some reading to do :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: