Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
XML for databases: a dead idea (lemire.me)
46 points by joeyespo on Jan 15, 2013 | hide | past | favorite | 47 comments


I had a project where I used an xml column[1] in MSSQL2008, since we had variable schema data and didn't want to store it vertically (one row per value), and definitely didn't want to alter schema on the fly. While I wouldn't necessarily recommend doing what I did, I don't see anything wrong with it and it solved that specific problem rather eloquently. We got the data out just fine with XQuery.

Believe it or not it was extremely fast...but only after playing with the xml format for awhile and making sure the indexes all fit nicely. You can, almost surprisingly, index specific XML fields inside of xml data-type columns[2].

I can almost hear you all cringing after reading that :)

[1] http://msdn.microsoft.com/en-us/library/ms190936(v=sql.90).a... [2] http://msdn.microsoft.com/en-us/library/ms191497(v=sql.90).a...


We had an entire system based on SQL Server's "FOR XML" output clause and XML data islands in IE. It was an unmitigated fucking disaster zone that took 5 years to get rid of.

That put us off the above pretty sharpish.


FOR XML and XML data type (e.g. XML column) are pretty much unrelated.


Yep but they share one feature: shitty idea.


I was on the dev team that put together the XML data type support in SQL Server. We had bigger dreams back then, and I still do. :)


I don't think you should feel bad about that.

Though if you were using Postgres instead of MSSQL, you would probably consider JSON (or HSTORE) instead of XML since Postgres 9.2 has a built-in JSON column type.


Postgres has had an XML type for a lot longer than that. There's also better built-in support for XML than JSON because it's been in there longer.


For better or worse it's only dead in the same sense that the "relational model" is dead - as a research topic. It's certainly alive in the real world where it's more properly used as an interchange format. E.g. The IRS's MeF[1]. It's also finding interesting adoption in other industries[2].

1- http://www.irs.gov/uac/Modernized-e-File-(MeF)-Program-Infor...

2- http://www.service-architecture.com/xml/articles/oil_and_gas...


Good. XML for data-storage was always a bad idea and I (and others http://www.dbdebunk.com/) have been ranting about it since the beginning. XML is verbose, hard to parse, and enforces a ridged hierarchy, even if your data model didn't need it.

Presumably XML is good for "interchange" but I've never found that to be the case. JSON seems to do just fine. :)


Don't forget that XML came around 5-10 years before JSON got anywhere.

XML changed things, because it shifted focus from syntax and parsing to the actual meaning of the document. It's easy now to say XML sucks, but having a standard format (that was self-documenting, too) that _everyone_ jumped on, is a major jump.

Go look at RFCs for HTTP or SIP and see how idiotic it is to have to parse yet another custom format with all sorts of cutesy exceptions. XML eliminates that. Can it be done better? Now, sure.


Actually XML is more "reach" format than JSON. XML nodes have attributes and values, when in JSON - values only. And lot of databases are "document-oriented": http://en.wikipedia.org/wiki/Document-oriented_database and what is matters in comparison is parsing performance and size. JSON just more compact, but not better in all aspects. I think XML is an excellent format, especially when size matters less than "dimensions" of presented information. I think it's perfect format for RSS and for REST API and for web-pages.


XML seen in the wild rarely has good separation of attributes and values. It can be useful to distinguish data and metadata, but it’s more expressivity than perhaps most applications need.


> Presumably XML is good for "interchange" but I've never found that to be the case. JSON seems to do just fine. :)

It's not; they are as well many rants about using it as a data interchange format, 'cause pretty much everything, from S-expressions to JSON, is better at this task.


I prefer S-expressions to JSON.


I don't know if this was your intent, but reading your comment there seems an implication XML is no longer a good fit for anything, really.

But from the article:

"When XML was originally conceived, it was meant for document formats. And by that standard… boy! did it succeed! Virtually all word processing and e-book formats are in XML today."

Document formats is a large, important area of technology. I suppose each of us can decide if that means XML's glass is half-full, or half-empty.


XML wasn't a good fit for databases or configuration files, but it was basically inevitable that it would wind up used in those capacities. Playing around with new technology and misusing it in fun ways like that is an important part of the growth of a new technology--that's how we learn what the strengths and weaknesses of a technology are.


For any non-flat configuration file XML is far from a bad choice. It obviously has its issues, often YAML or something less syntactically burdensome might be better, but I'd need to hear a convincing argument before I would say that it's always the "wrong" choice for configuration files.


It's not always wrong, but you have to weigh the advantages against the disadvantages. If you want mechanical editing or syntactic validation it's an advantage. I don't like hand-editing launchd configuration files, but it's helpful that software can ship with launchd configuration files to be installed and it all just works. On the other side of the coin, if there's a lot of scripting going on in the configuration file, the XML will either become burdensome to write or it will wind up carrying around big strings with their own syntax buried within the elements, circumventing the whole point of using XML.


XML wasn't a good fit for databases or configuration files

Except where the database is used to index a collection of XML data. E.g. we store dependency graphs (the output of a natural language parser) in an XML database (Berkeley DB XML), which allows us to quickly select a subset of interesting parses using XPath queries.

Sure, we could use a different storage format and query language. But there are good XML databases and XPath is well-known and standardized. The actual XML is normally only touched by machines.


I should be more clear: XML isn't a good fit for all databases, but it should be understood that if you're mainly storing and querying XML, an XML database ought to be a better choice.


How did the trend develop, especially for things like configuration data?

Some warped sense of ease from using Java and .NET libraries to read / write XML?


XML is still a good solution for human-readable data. JSON, for instance, doesn't allow comments because of some dogmatic idea of the author. The author of JSON even goes as far as suggesting that you use a non-JSON parser to strip out comments before parsing. In other words, not using JSON for configuration files that need comments.[1]

XML is also very ubiquitous in library support, having been around for over 15 years.

I think the most negative reaction to XML is the extreme overuse for things like component configuration, where there's no defaults, and every property and class name must be specified. Also, the misguided tag closing (<tag>content</> would have sufficed) adds to the verbosity.

1: https://plus.google.com/118095276221607585885/posts/RK8qyGVa...


God knows something to ease the closing tag verbosity would make XML at least twice as useable. I wonder if there's anything in the grammar that would conflict with </> as a closing tag? I'm almost tempted to muck with libxml and add it for personal use.


A personal dialect of XML? Wouldn't you be better off making your own grammar from scratch?


If I installed it at the system level, then anything that linked against libxml/libxml2 would have the enhancement. Which is also the biggest reason I'm not going to do it...I'd invariably end up distributing XML with the sugarfied close tag.


> XML is still a good solution for human-readable data

Urgh... maybe the data I have had the misfortune of working with are not exercising the proper options. The ratio of metadata to data always seems too high.

Overuse is likely the cause of my dislike.


As someone who wrote an XML config file for a project way back when, I can provide a few answers.

For me, it was either use an off the shelf parser or write my own. The config files at the time were all like apache's config - various combinations of weird syntax.

XML came with parsers off the shelf and gave you some validation up front. If there were other tools at the time available, I probably hadn't found them, and XML had a lot of buzz so it was easy to find.

These days all the libraries have parsers for a variety of easy to use data formats, from JSON and YAML to Protobufs and ini files.


"Warped"? Java didn't exactly have built-in JSON libraries in 1999. The plethora of alternatives manifested largely as a response to the complexity and weight of using an XML parser. Prior to XML you mostly rolled your own parser by hand, use lex/yacc or (if you were using Lisp) used s-exps. Compared to the first two options XML still has a number of meaningful benefits: uniform parsing across languages, uniform validation, and proper handling of encodings. Those aren't anything to sneeze at. "Unicode everywhere" is a fairly new phenomenon (and XML is partly to thank for that), even if Unicode isn't. And different Lisps handle encodings differently as well.


I chose the term warped as at some point it seemed like XML was used for _everything_, in many use cases the problem was warped to fit an XML solution. For example, simple configuration data made less readable via a poor fit of XML, and the excessive number of these cases.

One of my favorite sayings is still "XML is like violence. If it does not solve the problem, you are not using enough."


My favorite is "XML is something you inflict on others, never yourself."


No. XML was the first serialization language with widely available drop-in libraries at all - it essentially pioneered that concept. As such, it was a freaking Godsend in the late 90's or so. Believe me, the ability to just link in expat and never have to worry about data serialization from C was so liberating you can't imagine it. (To me, anyway.)


What I am missing from this article is what replaces XML for databases.

An XML database is just a document store, and that idea has not died. You can s/XML/JSON/g and you'll see that the ideas and research is still very relevant today.

XML databases are decidedly 'NoSQL'. MongoDB is a good example of a straight port of the document store idea to JSON.

MongoDB would be a lot richer if it were to support XQuery in some form, rather than their awkward json querying system.

I think there is a discrepancy between the academic world which likes the orderly nature of XML and the pragmatic programmers world who like the terseness of JSON.


Couchbase also stores documents in JSON (and uses JavaScript for "views"). And it's one of the best in performance.


I used an XML DB in a project back in the early 2000s. The database was populated by running an in-database XQuery that pulled data across the webs in XML. When people visited a web site, XML was queried back out, then XSLT translated it to (drum roll) XHTML. I even experimented with XForms to capture data from visitors, which had the benefit of directly using the DB schema to validate data. All in all, it was pretty nifty when you got all the pieces working together properly. Sort-of XRX without the R.

As to reasons for their "death": DTDs and XSDs were a serious PITA. XML data types didn't map closely to standard relational DB types; this wasted space (which was costly at the time), and made it difficult to interact with a relational DB when that was necessary. The XQuery query planner was braindead. That was really the nail in the coffin.

All in all, they're just document stores, and those are more popular than ever now with the whole NoSQL thing. As the author mentions, XML seems to have won out as a document format... I wonder what sort of DB we could put all those documents into. One that could support storing those documents without a bunch of translating to other formats, and had some method of running queries over them (preferably in a standardized way), and output the data in a similar format. Too bad XML databases are dead, one of those would have worked pretty well.


I will say I don't like it for online storage of any kind, but it's pretty hard to beat XML as an offline interchange or archival format.

Case in point: we can dump an entire client's data from our system into a single gzipped XML blob straight from the ORM. This can also be restored straight back into the ORM as well with very little code.

This means we can switch database engines, do snapshots, backups, manage on site deployments and all sorts of nice things very easily.


Remember all the XML programming languages? I recall Water http://waterlanguage.org/, for example, getting some buzz in the Boston area during the buzzy time Daniel Lemire refers to...


Ugh I remember that. What an utter vomit-bag of a creation that was.

You can see how popular it was considering their web site was (c) 2001 - 2003...


Nice observation:

"I initially wanted to write an actual research article to examine why XML for databases failed [but the article would] be unpublishable because too many people will want to argue against the failure itself. This is probably a great defect of modern science: we are obsessed with success and we work to forget failure."




> Storing data in XML for long-term interoperability is an acceptable use of XML.

What?


Not sure it would be my first choice either, but that's not a crazy statement. In a couple of decades, I'm confident there will still be well-maintained XML libraries for every popular platform.


They just evolved into document databases like Mongo and CouchDB. Same thing is happening in the transport world. XML started the shift, JSON is continuing it.


It's bad for data storage, but it's great for passing hierarchical parameters to stored procedures. It's also great for retrieval of hierarchical data.

Storing the XML in a column, that's just dumb. If I did that I would only ever treat it as a binary blob, same as any file. When you start using the db's XML query support to query that column, that's when you get into trouble and need to rethink your design.


If we exclude stuff like OLAP and spatial databases, there are only 3 types of database: relational, pointer and logical.

Relational means something that accepts Codd's constraints, more-or-less. Pointer is something that doesn't, and which therefore easily allows hierarchies.

Logical is like relational, expanded to allow recursive queries, prolog-style, which is a high level way to have hierarchies and Codd's constraints, but you pay for in terms of query performance.

The thing is, pointer databases have been reinvented again and again: First as the mainframe cobol databases of the 70s, then as the filesystem, then as the windows registry, then as OODBMSs, then as XML databases, and now as NoSQL.

The other thing is, Postgres is designed to cater for all 3, relational, pointer and logical.


XML is fine if used where it fits.

Use the best tool for the job and all that.


> Use the best tool for the job and all that.

I don't know of anyone advocating that they something other than the best tools for the job should ever be used. The question is whether a particular tool is good for a particular job or not.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: