The fact that some people don't properly configure their parsers isn't an argume...

0x0 · on Nov 30, 2014

It kind of is, though, since the format specifies a very non-obvious feature which may have serious security consequences if left enabled, which isn't easily discoverable for users starting with xml, and, frankly has little use. A better format would bring less surprises.

schoen · on Nov 30, 2014

Yeah, many users would actually not like some of these features. If you're using XML to serialize static tree-structured data in a fixed schema (like people commonly use JSON), would you expect that your parser would be vulnerable to this?

https://en.wikipedia.org/wiki/Billion_laughs

NelsonMinar · on Nov 30, 2014

"Some people" is a pretty big category with some good company. Google in April 2014, for instance. Or PostgreSQL in 2012. http://blog.detectify.com/post/82370846588/how-we-got-read-a... http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-3489

ssclafani · on Dec 1, 2014

Facebook as well: http://www.ubercomp.com/posts/2014-01-16_facebook_remote_cod... XXE to Remote Code Execution. They paid out their largest bounty ever for it, $33,500.

lmm · on Dec 1, 2014

When choosing a format to use for a real-world problem, the real-world parsing libraries and their real-world behaviour (including how they tend to be configured in practice) are important considerations.

rodgerd · on Nov 30, 2014

If making your parser spec-compliant makes it more vulnerable to security problems that would tend to suggest something other than the parser is problematic.

See also: https://www.youtube.com/watch?v=PE9fXM7aOxo

mrweasel · on Dec 1, 2014

If people are using an XML parser they're still ahead of the majority of the software business.

We have one partner that discovered a bug in our XML feed (developed by a consultant), the bug on surfaces because they actually use an XML parser. They're the first of our customers to find this bug, in 10 years. The rest simple view the feed as plain text.

The consulting company that did the feed generation original also have something against using XML parsers. The only code they have that doesn't just concatenate strings is a logging library (A library we asked them to stop using because ther way to doing XML logs are useless with Splunk and pretty much any other tools).

im2w1l · on Dec 1, 2014

I once made a consumer for an xml api. I made a simple regex hack that worked fine. After reading some rants about "parsing" xml with regex I swapped it out for a real parser.

A few minutes later the parsings started failing because there were unescaped <> characters in attributes. Reported the bug, got a wontfix back.

I reverted to regex and it has been working fine ever since.

bambax · on Dec 1, 2014

If there are unescaped <> chars in the "xml" then it's NOT xml and the API shouldn't be called "xml" but "plain text made to look a little like xml".

The people producing such garbage should be ashamed of themselves and should be publicly shamed.

Of course, as a consumer of the API we often don't have any power over the producer and have to swallow what we're given as is; but even in that case the correct approach is to have a first step of cleaning/correcting the xml (with something like Beautiful soup for example) and then feeding the clean xml to a proper parser.

ajkjk · on Dec 1, 2014

What!? Of course it is. It might not be an argument that you buy, but it's definitely an argument.

jimmaswell · on Dec 3, 2014

Invalid argument then.