Hacker News new | past | comments | ask | show | jobs | submit login

> 1. lxml is way faster than BeautifulSoup - this may not matter if all you're waiting for is the network. But if you're parsing something on disk, this may be significant.

Caveat: lxml's HTML parser is garbage, so is BS's, they will parse pages in non-obvious ways which do not reflect what you see in your browser, because your browser follows HTML5 tree building.

html5lib fixes that (and can construct both lxml and bs trees, and both libraries have html5lib integration), however it's slow. I don't know that there is a native compatible parser (there are plenty of native HTML5 parsers e.g. gumbo or html5ever but I don't remember them being able to generate lxml or bs trees).

> 2. Don't forget to check the status code of r (r.status_code or less generally r.ok)

Alternatively (depending on use case) `r.raise_for_status()`. I'm still annoyed that there's no way to ask requests to just check it outright.

> Those with a background in coding might prefer the .cssselect method available in whatever object the parsed document results in. That's obviously a tad slower than find/findall/xpath, but it's oftentimes too convenient to pass upon.

FWIW cssselect simply translates CSS selectors to XPath, and while I don't know for sure I'm guessing it has an expression cache, so it should not be noticeably slower than XPath (CSS selectors are not a hugely complex language anyway)





Nice. Seems to only do lxml tree building?



Damn.

Are you the author? If so, well done, that looks like a great package.


On the contrary, I have found lxml suitable for all of my scraping projects where the objective is to write some XPath to parse or extract some data from some element.


LXML itself is, the problem is that its HTML parser (libxml's really) is an ad-hoc "HTML4" parser which means the tree it builds routinely diverges from a proper HTML5 tree as you'd find in e.g. your browser's developer tools and the way it fixes (or whether it fixes it at all) markup is completely ad-hoc and hard to predict.


Are you talking about etree.HTML() being garbage? And what are your thoughts on parsing it as xml (e.g. etree.fromstring(), etree.parse() )?


> Are you talking about etree.HTML() being garbage?

Yes.

> And what are your thoughts on parsing it as xml (e.g. etree.fromstring(), etree.parse() )

No problem there. XML is much stricter and thus easier to "get right" so to speak. lxml's html parser is built upon libxml's HTML parser[0], which predates HTML5, has not been updated to handle it, and is as its documentation notes

> an HTML 4.0 non-verifying parser

This means it harks back to an era where every parser did its thing and tried its best on the garbage it was given without necessarily taking in account the neighbour.

[0] http://xmlsoft.org/html/libxml-HTMLparser.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: