> However, very often we do not have to deal with arbitrary HTML.
That really depends on who "we" are and what you mean by"very often".
I used to develop web crawlers, HTML parsers, document analysis infrastructure and various other things that come into contact with "content" for web crawlers at various search engine companies. If you assume people can produce valid, or even half way sane HTML, you'll be disappointed. As for how you parse insane HTML: with difficulty.
That really depends on who "we" are and what you mean by"very often".
I used to develop web crawlers, HTML parsers, document analysis infrastructure and various other things that come into contact with "content" for web crawlers at various search engine companies. If you assume people can produce valid, or even half way sane HTML, you'll be disappointed. As for how you parse insane HTML: with difficulty.