Hacker News new | past | comments | ask | show | jobs | submit login

Parsing JSON is indeed several orders of magnitude less complex than parsing HTML. In both cases there are excellent parsing libraries, but it would be very unwise to create your own HTML parser for production use, or to use any parser that hasn't seen some serious scrutiny.



JSON at least has the concept of "invalid JSON". That's a big step forward. A JSON parser, like an XML parser, can say "Syntax error - rejected." There's no such thing as "invalid HTML". For that reason, parsing HTML is a huge pain.

As someone who has a web crawler, I'm painfully aware of how much syntactically incorrect HTML is out there. HTML5 has a whole section which standardizes how to parse bad HTML. That's just the syntax needed to parse it into a tree, without considering the semantics at all.


There's no such thing as "invalid HTML". For that reason, parsing HTML is a huge pain.

Actually, as someone who has written an HTML parser by following the HTML5 spec, I see it as the opposite: because every string of bytes essentially corresponds to some HTML tree, there are no special "invalid" edge cases to consider and everything is fully specified. That's the best situation, since bugs tend to arise at the edge cases, the boundaries between valid and invalid. But the HTML5 spec has no edge cases: the effect of any byte in any state of the parser has been specified.

HTML5 has a whole section which standardizes how to parse bad HTML.

In some ways, I think it's a bit of a moot point what is "bad HTML" if all parsers that conform to the standard parse it in the same way. The spec does mention certain points as being parse errors, but are they really errors after all, if the behaviour is fully specified (and isn't "give up")? In fact, disregarding whether some states are parse errors actually simplifies the spec greatly because many of the states can be coalesced, and for something like a browser or crawler, it's completely irrelevant whether any of these "parse errors" actually occurred during the parsing. One example that comes to mind is spaces after quoted attribute values; a="b" c="d" and a="b"c="d" are parsed identically, except the latter is supposedly a parse error. Yet both are unambiguous.


I've written implementations of the HTML5 color algorithms. There are some sequences of bytes which, when given as a color value in HTML5, don't correspond to an RGB color, which makes things interesting.

(for the record, they are the empty string, and any string that is an ASCII case-insensitive match for the string "transparent")


It's worthwhile pointing out that HTML parsers are allowed to abort parsing the first time they hit each parse error, if they so choose. As such, not all implementations are guaranteed to parse content that contains parse errors, hence why it matters for authoring purposes.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: