> Very often we have to deal with documents that only use a subset of HTML and they can be parsed by regular expressions just fine.
I agree, assuming the emitter of the document to be processed is a) correct (as in bug free, and as in knows the expected HTML subset and the corner cases of its ad hoc parsing method, and will never make a mistake nor assume the other end has an actual HTML parser) and b) has no ill intent.
In practice, it is neither.
> people thinking they are smarter than they really are
I genuinely think it is the quippy expression of someone who has been burned way too much by the practical side of this that they prefer to frantically laugh at themselves as much as the issue out of despair of people trying to be smart with regexes. They chose to actually not be smart at all, and just use an HTML parser to parse HTML documents.
One of the situations I've seen is roughly:
System is up: returns XML or JSON or pretty-format-of-the-week
System is down: It returns a HTML IIS error page.
In the second case, all you might want is to extract the content of the first <h1> tag out of that error page. That's predictable enough of a task that a Regex might be able to handle it, especially if at that point you've already iven up on a full success and you're just salvaging a prettier error message than "system error".
I agree, assuming the emitter of the document to be processed is a) correct (as in bug free, and as in knows the expected HTML subset and the corner cases of its ad hoc parsing method, and will never make a mistake nor assume the other end has an actual HTML parser) and b) has no ill intent.
In practice, it is neither.
> people thinking they are smarter than they really are
I genuinely think it is the quippy expression of someone who has been burned way too much by the practical side of this that they prefer to frantically laugh at themselves as much as the issue out of despair of people trying to be smart with regexes. They chose to actually not be smart at all, and just use an HTML parser to parse HTML documents.