> 1. lxml is way faster than BeautifulSoup - this may not matter if all you're w...

aumerle · on Oct 24, 2017

https://github.com/kovidgoyal/html5-parser

masklinn · on Oct 24, 2017

Nice. Seems to only do lxml tree building?

aumerle · on Oct 24, 2017

http://html5-parser.readthedocs.io/en/latest/#html5_parser.p...

masklinn · on Oct 24, 2017

Damn.

Are you the author? If so, well done, that looks like a great package.

dmn001 · on Oct 24, 2017

On the contrary, I have found lxml suitable for all of my scraping projects where the objective is to write some XPath to parse or extract some data from some element.

masklinn · on Oct 24, 2017

LXML itself is, the problem is that its HTML parser (libxml's really) is an ad-hoc "HTML4" parser which means the tree it builds routinely diverges from a proper HTML5 tree as you'd find in e.g. your browser's developer tools and the way it fixes (or whether it fixes it at all) markup is completely ad-hoc and hard to predict.

staticautomatic · on Oct 24, 2017

Are you talking about etree.HTML() being garbage? And what are your thoughts on parsing it as xml (e.g. etree.fromstring(), etree.parse() )?

masklinn · on Oct 24, 2017

> Are you talking about etree.HTML() being garbage?

Yes.

> And what are your thoughts on parsing it as xml (e.g. etree.fromstring(), etree.parse() )

No problem there. XML is much stricter and thus easier to "get right" so to speak. lxml's html parser is built upon libxml's HTML parser[0], which predates HTML5, has not been updated to handle it, and is as its documentation notes

> an HTML 4.0 non-verifying parser

This means it harks back to an era where every parser did its thing and tried its best on the garbage it was given without necessarily taking in account the neighbour.

[0] http://xmlsoft.org/html/libxml-HTMLparser.html