How to parse HTML

thristian · on Jan 7, 2012

In the bad old days, parsing real-world HTML was a horrible task because every web-browser had a huge collection of undocumented corner-cases and hacks; some accidental, some the result of reverse-engineering other vendors' corner-cases and hacks. Most standalone HTML parsers could generate some document tree from a given input file; whether or not it would match the one generated by an actual browser was another matter.

These days, however, we have the HTML5 parsing algorithm, reverse-engineered from various vendors' web browsers but actually documented and implementable (still horribly complicated, but that's legacy content for you). Not only is the HTML5 parsing algorithm designed to be compatible with legacy browsers, modern browsers are replacing their old parsing code with new HTML5-compatible implementations, so parsing should be even more consistent (I know Firefox has switched to an HTML5 parser, I think IE has made a bunch of noise about it too; I don't follow WebKit all that closely, but I'd be surprised if they haven't moved towards an HTML5 parser).

masklinn · on Jan 7, 2012

> I know Firefox has switched to an HTML5 parser

Yep, this was mainlined in Firefox 4 (with Gecko 2.0).

> I think IE has made a bunch of noise about it too

Support is being built, it's planned for IE10.

> I don't follow WebKit all that closely, but I'd be surprised if they haven't moved towards an HTML5 parser

The HTML5 parsing algorithm has been in Webkit since the second half of 2010.

And you have not asked, but HTML5 parsing was officially released in Opera 11.6 last month.

kkolev · on Jan 7, 2012

> HTML5 parsing was officially released in Opera 11.6 last month.

I hope that's not related to the annoying freezes the community's been complaining about since that release...

justinschuh · on Jan 7, 2012

WebKit has been using the new algorithm since mid 2010. And I'm pretty sure Chrome 7 (Octoboer 2010) was the first major browser to ship an HTML5 compliant parser.

hoppipolla · on Jan 7, 2012

Right, WebKit shipped code based on the spec first but the spec itself underwent subsequent revision as the Gecko / Presto implementors found site compatibility issues and bugs. I think the WebKit implementation was recently updated to the spec, so we are at, or at least very close to, having very interoperable HTML parsing in Opera/Firefox/Safari/Chrome. I also believe that Microsoft are aiming to implement the new algorithm in IE 10.

From an interoperability point of view the HTML parsing algorithm is the poster child for the success of the HTML effort; there is a testsuite of several thousand tests [1] (also submitted to the W3C [2]) that has contributions from multiple browser vendors and a number of unaffiliated individuals. Although parsing isn't sexy in the way that, say, <canvas> is, getting interoperable parsing makes it much easier to create cross-browser content (at Opera we closed a huge number of site-compatibilty bugs when we landed the new algorithm).

There are also a few open-source implementations that are not tied to browsers e.g. for python (and kind of also PHP) [3], for java [4] (fun fact: the gecko C++ implementation is generated from that java implementation) and javascript [5] https://github.com/andreasgal/dom.js It would be great to see more conforming implementations for other languages, or to see libraries like libxml2 that have existing ad-hoc HTML parsers update their implementations to match the spec.

[1] http://code.google.com/p/html5lib/source/browse/#hg%2Ftestda...

[2] http://w3c-test.org/html/tests/submission/Opera/html5lib/

[3] http://code.google.com/p/html5lib/

[4] http://about.validator.nu/htmlparser/

[5] https://github.com/andreasgal/dom.js

_delirium · on Jan 7, 2012

Theres a Perl implementation in CPAN as well, though I haven't made nontrivial use of it, so I'm not sure how fast/robust it is: http://search.cpan.org/~tobyink/Task-HTML5-0.103/lib/Task/HT...

justincormack · on Jan 7, 2012

Yes webkit has too http://www.webkit.org/blog/1273/the-html5-parsing-algorithm/

justincormack · on Jan 7, 2012

Now that html5 defines how to parse all html fragments there is really no reason not to use that algorithm.

gambler · on Jan 7, 2012

Could you provide some working examples of using HTML 5 parser for input sanitization?

justincormack · on Jan 7, 2012

http://code.google.com/p/html5lib/

gambler · on Jan 7, 2012

I see that they mention sanitizer and give an example of how to call it, but I can't find any real-life code doing sanitization. Am I missing something? (I'm curious about the level of complexity such library would require in 'client' code.)

hoppipolla · on Jan 7, 2012

Bleach [1] is a sanitizer that uses html5lib on the backend. I think that Mozilla use it.

[1] http://pypi.python.org/pypi/bleach

arkitaip · on Jan 7, 2012

You're assuming that web sites consist of compliant html; which is never the case.

masklinn · on Jan 7, 2012

The HTML5 parsing algorithm was designed to standardize parsing of real-world pages, including error recovery (for invalid and/or legacy markup), that's the whole bloody point of it.

Better, using an implementation of the HTML5 parsing algorithm means you're parsing pages the same way browsers do: Gecko (Firefox), Webkit (Chrome and Safari) and Presto (Opera) have all landed the HTML5 parsing algorithm, and Trident (IE) is in the process of getting it (the feature is planned for IE10's Trident 6.0)

AndrewDucker · on Jan 7, 2012

The HTML5 parsing algorithm states how to parse invalid HTML.

jrockway · on Jan 7, 2012

This article should be called, "how to write a Marpa-based HTML parser", not "how to parse HTML". If you're a Perl programmer and want to parse HTML into an XML-style DOM, use XML::LibXML. If you can't handle the libxml2 dependency, use HTML::Parser.

perfunctory · on Jan 7, 2012

The fact that browsers accept defective html is the most evil thing that happened to the web. Any library that tries to parse "real world" html just contributes to that evil. I am astonished that we tolerate this and still call ourselves (software) engineers.

donut · on Jan 7, 2012

"Be liberal in what you accept, and conservative in what you send." - http://en.wikipedia.org/wiki/Robustness_principle

perfunctory · on Jan 8, 2012

http://queue.acm.org/detail.cfm?id=1999945

donut · on Jan 12, 2012

Good read, thanks for that. I agree with the conclusion: there is no one-size-fits-all rule for interoperability.

The way I see it, it's ultimately about tradeoffs. I can only imagine what things would be like today if web browsers implemented a strict parsing of HTML and refused to render invalid pages. One possibility is hindered adoption of HTML by the masses. Another is that two vendors would disagree about the HTML spec and cause pages to be browser-specific. (Turns out this happened anyway :-))

Is the HTML5 spec better in terms of interop and compatibility than the previous ones? http://www.tbray.org/ongoing/When/201x/2010/02/15/HTML5

olavk · on Jan 7, 2012

There are pages on the web which will never be updated because the author is dead. Browsers have to be able to render what is out there.

You could argue that we would have been better off new if all browsers from day one had only rendered valid html, but you need a time machine to fix that.

perfunctory · on Jan 8, 2012

This problem can trivially be solved by introducing a new doctype. <!doctype newhtml> - strict parsing, otherwise sloppy parsing. I honestly don't understand why the web community doesn't adopt it.

derobert · on Jan 8, 2012

Ummm, that already exists. That's how XHTML (delivered as XML) works. Make a syntax error in the page? Browser gives up, displays an error. It didn't catch on.

The various HTML strict modes turn off "quirks" mode as well.

olavk · on Jan 8, 2012

It doesn't solve any problem, since the invalid html will still be in the wild and you still need to parse it. You just introduce a new parsing mode without graceful recovery.

Some authors might use the newhtml doctype (because they have read somewhere it is better) but only test in a browser which dont support newhtml mode, so they still don't discover that the html is invalid. So we are back to square one.

AndrewDucker · on Jan 8, 2012

As engineers our job is to make it easy for people to do things. Being tolerant of ordinary people's mistakes makes it possible for non-engineers to make web pages, and that's a good thing.

perfunctory · on Jan 8, 2012

Show me a non engineer who creates web-pages by writing raw html. And even if they did, wouldn't they be better off if the browser gave them helpful error messages to help them fix their html, rather than just silently rendering nonsense.

gambler · on Jan 7, 2012

Since this seems to be aimed (among other things) towards input sanitization, here is a semi-relevant entry that might amuse someone.

https://gist.github.com/1575452

This is a sanitizing HTML "parser" done in roughly 100 lines of PHP code. It does tag and attribute whitelisting, checks for protocols to prevent XSS, deals with unclosed and unopened tags, and does some other things. The biggest issue is that it's not well-factored. However, its shortness is appealing, because I understand how it works. I would have hard time trusting a library with thousands of lines of code to do input validation.

skadamat · on Jan 7, 2012

For you python users, the BeautifulSoup module has a prettify module which does the same thing.

masklinn · on Jan 7, 2012

Bleach[0] might be a better idea, it's based on html5lib

[0] http://pypi.python.org/pypi/bleach

ypcx · on Jan 7, 2012

If you want to go serious about web crawling and/or web scraping (within legal boundaries of course), you want to use Node.js and appropriate modules (don't remember the exact names right now). This is because Node.js being based on the V8 JavaScript engine, can completely emulate a real web browser - it can load and parse the HTML, as well as JavaScript. And many sites won't load properly without JavaScript.

masklinn · on Jan 7, 2012

What you're saying makes no sense whatsoever, at any level of resolution.

Chrome's rendering engine, and the library used to deal with parsing HTML and building a DOM tree is Webkit's Webcore[0]. V8 and Webcore are not the same thing and V8 does not provide a DOM implementation (that's webcore's job) nor does it handle any HTML parsing (that's also) webcore's job.

V8 is a javascript VM. That's it. It does not "emulate a real web browser" (let alone completely), and nor does Node.

[0] http://trac.webkit.org/browser/trunk/WebCore?rev=64712

ypcx · on Jan 17, 2012

That's why I said emulate. V8 (Node) with appropriate modules can emulate the browser - both parse the DOM, and then run scripts on that DOM. PHP/Perl/etc. can't do that. Java could do that with Rhino I assume, but I'd say V8 is much closer. I'm also not saying anything about emulating exactly Chrome. I wish I had time to dig up that module for Node now, but I don't (I don't remember the name).