Is the point a modernized "You can write Fortran in any language"?
I am getting a little tired of these types of responses to "you can't parse X with regular expressions." Someone always feels the need to come up with a tremendously complex solution that mostly works. Usually using backref matchings, (making the regular expression not so ... regular). Meanwhile, the author never seems to talk about the parse time. Which, will be exponential. See http://swtch.com/~rsc/regexp/ for all the glory details.
Specific points:
> I’m talking about parsing XML, not checking whether some input actually is XML. Correctness is a Boolean, after all: invalid XML is not XML.
However, from a particularly pedantic point of view: To parse is to check whether the input is in your language. If your parser excepts input that is not XML it is not an XML parser. It is a parser which accepts some language which largely overlaps with XML, but is not XML.
I know it's an academic point, but it is an important one. When you have properly parsed your input, you should be sure that the input is what you were expecting. A parser that excepts a different language than the intended one can be highly misleading.
So I am left to wonder what is the point of it all. You never say in the article.
tl;dr
Read the dragon book. Learn what a Language actually is. Study some Chomsky. Write a recursive descent parser. Lay in the green green grass. Think about Kurt Gödel. Write a parsing framework for LALR grammars. Think. Then, I believe, the desire for using regular expressions for in-appropriate purposes will have left you. Your other tools are just so much cooler.
I’ve actually done all of those things, and using regular expressions for inappropriate purposes remains entertaining, especially for the reactions. Perhaps you need to revisit the joy of making something work in completely the wrong way.
It's the job of the XML parser to answer the question "is this properly formed XML?" -- so you have indeed made something that "works" in completely the wrong way, by not working correctly.
I am getting a little tired of these types of responses to "you can't parse X with regular expressions." Someone always feels the need to come up with a tremendously complex solution that mostly works. Usually using backref matchings, (making the regular expression not so ... regular). Meanwhile, the author never seems to talk about the parse time. Which, will be exponential. See http://swtch.com/~rsc/regexp/ for all the glory details.
Specific points:
> I’m talking about parsing XML, not checking whether some input actually is XML. Correctness is a Boolean, after all: invalid XML is not XML.
However, from a particularly pedantic point of view: To parse is to check whether the input is in your language. If your parser excepts input that is not XML it is not an XML parser. It is a parser which accepts some language which largely overlaps with XML, but is not XML.
I know it's an academic point, but it is an important one. When you have properly parsed your input, you should be sure that the input is what you were expecting. A parser that excepts a different language than the intended one can be highly misleading.
So I am left to wonder what is the point of it all. You never say in the article.
tl;dr
Read the dragon book. Learn what a Language actually is. Study some Chomsky. Write a recursive descent parser. Lay in the green green grass. Think about Kurt Gödel. Write a parsing framework for LALR grammars. Think. Then, I believe, the desire for using regular expressions for in-appropriate purposes will have left you. Your other tools are just so much cooler.