Hacker News new | past | comments | ask | show | jobs | submit login
Regexes Parse XML Just Fine, Actually (evincarofautumn.blogspot.com)
12 points by evincarofautumn on Oct 8, 2011 | hide | past | favorite | 27 comments



You've already said, "Of course, they’re not regular expressions in the strict mathematical sense, but rather in the sense with which most of us are familiar—Perl regexen." And that's just it. Perl's "regular expressions" aren't regular at all! They match a superset of the regular languages.

What are you going to do with the output of this script? If you want to do anything useful with XML, you usually build a DOM out of it during the parse.

What do you mean by this? "I would call this a good ... practical solution to a problem that’s theoretically unsolvable." I'd say that parsing XML is definitely a solvable (and solved) problem.


The problem is “parsing XML with a regex”, which you can’t do unless you take “regex” to mean “Perl regex”. This isn’t a useful script; it only serves to illustrate a point.


Is the point a modernized "You can write Fortran in any language"?

I am getting a little tired of these types of responses to "you can't parse X with regular expressions." Someone always feels the need to come up with a tremendously complex solution that mostly works. Usually using backref matchings, (making the regular expression not so ... regular). Meanwhile, the author never seems to talk about the parse time. Which, will be exponential. See http://swtch.com/~rsc/regexp/ for all the glory details.

Specific points:

> I’m talking about parsing XML, not checking whether some input actually is XML. Correctness is a Boolean, after all: invalid XML is not XML.

However, from a particularly pedantic point of view: To parse is to check whether the input is in your language. If your parser excepts input that is not XML it is not an XML parser. It is a parser which accepts some language which largely overlaps with XML, but is not XML.

I know it's an academic point, but it is an important one. When you have properly parsed your input, you should be sure that the input is what you were expecting. A parser that excepts a different language than the intended one can be highly misleading.

So I am left to wonder what is the point of it all. You never say in the article.

tl;dr

Read the dragon book. Learn what a Language actually is. Study some Chomsky. Write a recursive descent parser. Lay in the green green grass. Think about Kurt Gödel. Write a parsing framework for LALR grammars. Think. Then, I believe, the desire for using regular expressions for in-appropriate purposes will have left you. Your other tools are just so much cooler.


I’ve actually done all of those things, and using regular expressions for inappropriate purposes remains entertaining, especially for the reactions. Perhaps you need to revisit the joy of making something work in completely the wrong way.


It's the job of the XML parser to answer the question "is this properly formed XML?" -- so you have indeed made something that "works" in completely the wrong way, by not working correctly.


...I don't think it illustrates the point you intended it to...


Yes, you can parse XML that way, but you shouldn't. It's more work than you originally thought.

Example: I've tried two parts of the xml syntax, <?xml version="1.0"?> and namespaces <foo:bar />, and it didn't seem to parse either. (At least if I understood the output right, it doesn't say "parsed" or "not parsed").

So it's the typical "author's idea of xml" parser, not xml parser.


It doesn’t handle self-closing tags, but that’s not the point. It would be trivial to add that, but again, that’s not the point.

The point is that you probably don’t want to do this with regular expressions, even though, contrary to popular opinion, you pretty much can. The article even calls the solution “slow and overspecified, but still kinda neat”.


Self closing tags are to the best of my knowledge not a part of XML.


The XML spec refers to them as "empty-element" tags.

http://www.w3.org/TR/REC-xml/#sec-starttags


Oh ofcourse, I thought evincarofautumn meant the SGML/HTML thing where you don't need to close certain tags.


Whenever this topic arises, we should never forget one of the best answers ever given concerning it.

http://stackoverflow.com/questions/1732348/regex-match-open-...


This is why I wrote the article.


No one claimed that it's impossible to use regexes in the process of parsing XML. It is however impossible to parse arbitrary XML with a single regex.

For example: write a regular expression that will return the contents of the first <div> that is found in the following XML:

For example: <div> whatever <div> more stoff </div> foo </div>

or

<div> whatever foo </div>

You can't because XML is not a regular language and regexes only match regular languages.


Every specific counterexample can be countered with a specifically tailored countersolution. If you only need to parse a certain subset of an XML document, then you don’t need to parse XML, and (ordinary) regular expressions will work.

Also, the article does say that you can parse well-formed XML with a single regex, you just can’t validate it that way.


By "just fine" do you mean "not quite"?

You missed the part about the first job of a parser being "check for correct syntax":

http://en.wikipedia.org/wiki/Parser

Parser

In computing, a parser is one of the components in an interpreter or compiler, which checks for correct syntax and builds a data structure (often some kind of parse tree, abstract syntax tree or other hierarchical structure) implicit in the input tokens. The parser often uses a separate lexical analyser to create tokens from the sequence of input characters. Parsers may be programmed by hand or may be (semi-)automatically generated (in some programming languages) by a tool.


We have literally tens-of-thousands of XML documents that had been delivered with one-off regexes and Perl for almost 20 years. Worked like a champ until we wanted to do an update to the look and feel of these pages, then spent the last six years regularizing the SGML and XML, as well as the transforms to just be able to manage these documents in some way.


Jeez, why is this on the home page. This drives me crazy. Please, stop.


I’ve been writing a lot lately, and I’ve gotten in the habit of submitting all of my blog posts to HN in order to get feedback and see what interests the people for whom I’m trying to write. It’s on the home page because enough people thought it was interesting that they upvoted it. If it drives you crazy, all you have to do to make it go away is refrain from upvoting.


Could you please stop submitting everything you write, just because you can? If everyone would do that, it'd be terrible!


There’s a huge difference between “everything I write” and “everything I write on that blog”, but I agree. The readers here have shown some interest in what I write, and for that reason alone I continue to submit it for your judgement. Hopefully my writing will inspire someone to make something cool, or change their teaching style, or design a better programming language. Besides, this is the best platform on which I can promote the work I’ve been doing lately, and I have nothing else to promote: most of the things I find online that are worthy of posting on HN are things that I found via HN. What else can I do but continue to throw links into the void and quietly ask that someone care?


Bolting a stack onto your regular language just makes a Frankenstein context free one, just as bolting a stack onto a DFA just makes a push down automata. This whole blog post is knocking down a straw man.


Can one really call this sort searching parsing? That's certainly somewhat subjective but this would never be called a "parser" in my world.

Everyone who thinks that http://www.w3.org/TR/xml/ can be parsed with a few lines of perl is just mislead. Example: Does the parser handle XML namespaces?

At best it's a trickier way of searching xml files.


All parsing is is discerning structure in flat input, so yes, it’s a parser.


I've had to do this a few times because PHP's XML parser can't parse Microsoft Office MSXML. Works good enough for stuff like uploading a list of hotels and restaurants to a tourism website when the office manager only knows how to use Excel. And the PCRE code is a lot shorter and simpler than the equivalent XML parser code.


> This isn’t a problem if you assume your input is well-formed. I’m talking about parsing XML, not checking whether some input actually is XML. Correctness is a Boolean, after all: invalid XML is not XML. ... [proceeds to parse HTML]

even validated HTML isn't XML, let alone tag soup.


That string does serve to show what you get when you throw some tag soup at the thing, though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: