In the bad old days, parsing real-world HTML was a horrible task because every web-browser had a huge collection of undocumented corner-cases and hacks; some accidental, some the result of reverse-engineering other vendors' corner-cases and hacks. Most standalone HTML parsers could generate some document tree from a given input file; whether or not it would match the one generated by an actual browser was another matter.
These days, however, we have the HTML5 parsing algorithm, reverse-engineered from various vendors' web browsers but actually documented and implementable (still horribly complicated, but that's legacy content for you). Not only is the HTML5 parsing algorithm designed to be compatible with legacy browsers, modern browsers are replacing their old parsing code with new HTML5-compatible implementations, so parsing should be even more consistent (I know Firefox has switched to an HTML5 parser, I think IE has made a bunch of noise about it too; I don't follow WebKit all that closely, but I'd be surprised if they haven't moved towards an HTML5 parser).
WebKit has been using the new algorithm since mid 2010. And I'm pretty sure Chrome 7 (Octoboer 2010) was the first major browser to ship an HTML5 compliant parser.
Right, WebKit shipped code based on the spec first but the spec itself underwent subsequent revision as the Gecko / Presto implementors found site compatibility issues and bugs. I think the WebKit implementation was recently updated to the spec, so we are at, or at least very close to, having very interoperable HTML parsing in Opera/Firefox/Safari/Chrome. I also believe that Microsoft are aiming to implement the new algorithm in IE 10.
From an interoperability point of view the HTML parsing algorithm is the poster child for the success of the HTML effort; there is a testsuite of several thousand tests [1] (also submitted to the W3C [2]) that has contributions from multiple browser vendors and a number of unaffiliated individuals. Although parsing isn't sexy in the way that, say, <canvas> is, getting interoperable parsing makes it much easier to create cross-browser content (at Opera we closed a huge number of site-compatibilty bugs when we landed the new algorithm).
There are also a few open-source implementations that are not tied to browsers e.g. for python (and kind of also PHP) [3], for java [4] (fun fact: the gecko C++ implementation is generated from that java implementation) and javascript [5] https://github.com/andreasgal/dom.js It would be great to see more conforming implementations for other languages, or to see libraries like libxml2 that have existing ad-hoc HTML parsers update their implementations to match the spec.
I see that they mention sanitizer and give an example of how to call it, but I can't find any real-life code doing sanitization. Am I missing something? (I'm curious about the level of complexity such library would require in 'client' code.)
The HTML5 parsing algorithm was designed to standardize parsing of real-world pages, including error recovery (for invalid and/or legacy markup), that's the whole bloody point of it.
Better, using an implementation of the HTML5 parsing algorithm means you're parsing pages the same way browsers do: Gecko (Firefox), Webkit (Chrome and Safari) and Presto (Opera) have all landed the HTML5 parsing algorithm, and Trident (IE) is in the process of getting it (the feature is planned for IE10's Trident 6.0)
This article should be called, "how to write a Marpa-based HTML parser", not "how to parse HTML". If you're a Perl programmer and want to parse HTML into an XML-style DOM, use XML::LibXML. If you can't handle the libxml2 dependency, use HTML::Parser.
The fact that browsers accept defective html is the most evil thing that happened to the web. Any library that tries to parse "real world" html just contributes to that evil. I am astonished that we tolerate this and still call ourselves (software) engineers.
Good read, thanks for that. I agree with the conclusion: there is no one-size-fits-all rule for interoperability.
The way I see it, it's ultimately about tradeoffs. I can only imagine what things would be like today if web browsers implemented a strict parsing of HTML and refused to render invalid pages. One possibility is hindered adoption of HTML by the masses. Another is that two vendors would disagree about the HTML spec and cause pages to be browser-specific. (Turns out this happened anyway :-))
There are pages on the web which will never be updated because the author is dead. Browsers have to be able to render what is out there.
You could argue that we would have been better off new if all browsers from day one had only rendered valid html, but you need a time machine to fix that.
This problem can trivially be solved by introducing a new doctype. <!doctype newhtml> - strict parsing, otherwise sloppy parsing. I honestly don't understand why the web community doesn't adopt it.
Ummm, that already exists. That's how XHTML (delivered as XML) works. Make a syntax error in the page? Browser gives up, displays an error. It didn't catch on.
The various HTML strict modes turn off "quirks" mode as well.
It doesn't solve any problem, since the invalid html will still be in the wild and you still need to parse it. You just introduce a new parsing mode without graceful recovery.
Some authors might use the newhtml doctype (because they have read somewhere it is better) but only test in a browser which dont support newhtml mode, so they still don't discover that the html is invalid. So we are back to square one.
As engineers our job is to make it easy for people to do things. Being tolerant of ordinary people's mistakes makes it possible for non-engineers to make web pages, and that's a good thing.
Show me a non engineer who creates web-pages by writing raw html. And even if they did, wouldn't they be better off if the browser gave them helpful error messages to help them fix their html, rather than just silently rendering nonsense.
This is a sanitizing HTML "parser" done in roughly 100 lines of PHP code. It does tag and attribute whitelisting, checks for protocols to prevent XSS, deals with unclosed and unopened tags, and does some other things. The biggest issue is that it's not well-factored. However, its shortness is appealing, because I understand how it works. I would have hard time trusting a library with thousands of lines of code to do input validation.
If you want to go serious about web crawling and/or web scraping (within legal boundaries of course), you want to use Node.js and appropriate modules (don't remember the exact names right now). This is because Node.js being based on the V8 JavaScript engine, can completely emulate a real web browser - it can load and parse the HTML, as well as JavaScript. And many sites won't load properly without JavaScript.
What you're saying makes no sense whatsoever, at any level of resolution.
Chrome's rendering engine, and the library used to deal with parsing HTML and building a DOM tree is Webkit's Webcore[0]. V8 and Webcore are not the same thing and V8 does not provide a DOM implementation (that's webcore's job) nor does it handle any HTML parsing (that's also) webcore's job.
V8 is a javascript VM. That's it. It does not "emulate a real web browser" (let alone completely), and nor does Node.
That's why I said emulate. V8 (Node) with appropriate modules can emulate the browser - both parse the DOM, and then run scripts on that DOM. PHP/Perl/etc. can't do that. Java could do that with Rhino I assume, but I'd say V8 is much closer. I'm also not saying anything about emulating exactly Chrome. I wish I had time to dig up that module for Node now, but I don't (I don't remember the name).
These days, however, we have the HTML5 parsing algorithm, reverse-engineered from various vendors' web browsers but actually documented and implementable (still horribly complicated, but that's legacy content for you). Not only is the HTML5 parsing algorithm designed to be compatible with legacy browsers, modern browsers are replacing their old parsing code with new HTML5-compatible implementations, so parsing should be even more consistent (I know Firefox has switched to an HTML5 parser, I think IE has made a bunch of noise about it too; I don't follow WebKit all that closely, but I'd be surprised if they haven't moved towards an HTML5 parser).