Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> However, very often we do not have to deal with arbitrary HTML.

That really depends on who "we" are and what you mean by"very often".

I used to develop web crawlers, HTML parsers, document analysis infrastructure and various other things that come into contact with "content" for web crawlers at various search engine companies. If you assume people can produce valid, or even half way sane HTML, you'll be disappointed. As for how you parse insane HTML: with difficulty.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: