XPath is actually pretty useful once it stops being confusing

kjhughes · on Nov 4, 2013

XPaths are extremely useful. I actually enjoy writing them, much like I enjoy writing regular expressions. In fact, I consider both to return manifold the modest investment they require to learn well.

XPath : XML :: regex : text

6ren · on Nov 5, 2013

Now you have three problems.

(But along with relational algebra, they are among the few abstractions that work really well)

monksy · on Nov 4, 2013

I feel the same way. I frequetly use them, and find them a lot easier to use than Regexes. My site even uses regexes a lot. [ theexceptioncatcher.com ]

gopalv · on Nov 4, 2013

xpath is awesome, especially once you understand what an axis is.

And that is what I've found most people who have trouble with it don't understand - like what exactly following-sibling or child means.

I spent about 2 months writing my own xpath evaluator once and it gets so much easier (to implement too) when you understand this is just a tree-traversal with an iterator following the axis.

Unfortunately the axis syntax makes it very verbose to read.

neves · on Nov 4, 2013

W3C is full of terrible standards: the verbose dom, the obtuse xml schema, the crippled css (you can't have a variable), and others. XPath isn't one of them. It is the best way to query XML documents in a forward compatible way. Maybe someday we will able to use XPath in a CSS file instead of their crazy selectors.

d23 · on Nov 5, 2013

> it's a whopping 11 lines of code.

If you think 11 lines of code is a lot, you're overly focused on concision at the expense of readability. I've never (read: never) worked on any Ruby code, yet I find the posted example more readable than the supposedly more valuable xpath.

At the very least, they're the same. If you're writing code in Ruby, 11 lines is nothing. If you're writing code in Ruby and xpath is used nowhere else in the project, that single line of super-compact xpath might as well be 1000 lines of Ruby -- it doesn't matter.

If you're trying to compact 11 lines of code you're probably doing it wrong.

GuiA · on Nov 5, 2013

I may or may not agree with you- but one could make the argument that xpath is likely heavily tested and proven, and will handle unexpected corner cases that arise in the future, which those 11 lines of ruby will not. In that case, using xpath lowers the amount of future headaches with that code.

graue · on Nov 4, 2013

After a great article on what looked to be a handy tool, this part disappointed me:

for this particular task, XPath is actually considerably slower than the pure-Ruby implementation. Interestingly, that's not true if you take out the part and only look for text at the beginning of paragraphs. My guess is that the following-sibling axis is the culprit, since it has to select all the following siblings of the br tags, and then filter them down to only the first sibling.

I was hoping selectors were lazy, in which case, selecting all the following siblings but then immediately filtering that selection down to the first would be cheap. Lazy or not, can there really be no efficient way to do the equivalent of jQuery next()?

radicalbyte · on Nov 4, 2013

If you have great tools which keep the feedback loop short, then XPath, like Regex, SQL and CSS is extremely powerful and productive.

Just make sure that you document your test cases (i.e. what you should match) or your colleges will hate you.

moron4hire · on Nov 4, 2013

Years ago, I wrote a tool for wrapping .NET XmlDocuments and making them far easier to work with via XPath: https://github.com/capnmidnight/xml-stuff

On its own, .NET's XML libraries are really only good for consuming XML documents, but even that is a rather painful experience, especially as it forces a namespace on all documents, complicating the XPath expressions necessary to query it. Actually authoring documents is a nightmare. My XmlEdit project makes it almost as simple as key-value-pair config files.

pragmatic · on Nov 4, 2013

That's cool. Do you have any example comparing your work to LinqToXml? http://msdn.microsoft.com/en-us/library/bb308960.aspx

moron4hire · on Nov 5, 2013

Well, I wrote this off of the top of my head, and it has been several years since I've used the library heavily (though I have a project now that needs it, so I most likely will be dusting it off and fixing any hairy bits).

https://github.com/capnmidnight/xml-stuff/blob/master/README...

However, just take note that the main concept of the library is "make it work". The idea was that, given an XPath expression with several attribute selectors, it would fill in any necessary nodes to just make it happen. So you can technically chain a ton of editing commands together, by using an appropriately complex XPath expression.

It came out of a need to repair thousands of broken XML documents. It's probably not very complete. It was written for one project--and though I took time to make it generalized--it didn't make it into a key role into any other projects; I just didn't ever again have the need to deal with XML documents on such a scale.

It's actually one of the first "big" things I wrote out of college. I'm not too happy with some of the design right now, but the functionality has held up over the years and it's not as shitty as some of the other code I wrote at the time. I guess I knew that a lot of the project was hinging on how easy it was to write XML documents, so I made sure I did a ton of testing to make it work.

slig · on Nov 4, 2013

> But it gets more interesting if the lyrics are stored as an HTML fragment.

Is there any reason to store the HTML version with s and s instead of a plain text and converting it to HTML with simple rules à la markdown? (single line break = , double line break = )

allworknoplay · on Nov 5, 2013

yeah, this is super surprising considering their business of annotating individual lines; I'd be storing the lines in a list and caching generated html, not storing the tags in with them. it's kind of weird.

matsemann · on Nov 4, 2013

At least for me it was interesting when considering the aspect of finding/modifying stuff in data you cannot control yourself. For instance crawling, indexing or retrieving data in an unspecified format.

slig · on Nov 4, 2013

Yes, I enjoyed the article as well. My question was why they stored the HTML in the first place.

aliakbarkhan · on Nov 4, 2013

Well, at a minimum, it saves the processing time required to format the text, which lessens the server cost of each page hit. It's a small optimization, but when the vast majority of the users are just coming to the site to read text, I'd imagine it would save a lot of CPU time.

vdaniuk · on Nov 4, 2013

Not really, the underlying text is just an HTML page that rarely changes and the requests rarely hit the database because caching.

fphilipe · on Nov 5, 2013

Or you could fragment cache that.

sixbrx · on Nov 4, 2013

I also like XPath for some purposes, but I think it really suffers from (IIRC) having been designed before the xml namespaces, which it only integrates very awkwardly IMO and which ruins the simplicity of XPath. Or maybe XML namespaces spoil everything they affect to some degree :)

fein · on Nov 4, 2013

It's the latter (XML namespaces spoil everything).

In every single task I do that involves munching on XML with xpath, the largest timesink is figuring out how the ns needs to be set up. (Looking you you, PHP simpleXML).

masklinn · on Nov 4, 2013

> In every single task I do that involves munching on XML with xpath

And more generally that's true of every single task involving muching on namespaced XML. Namespaces are a good idea implemented absolutely terribly.

XPath is a good idea well-implemented (no, XPath 2 does not exist, there is only one XPath). One of the few I've found in XML-land. I still hate that we have to use CSS selectors rather than XPath (although that's understandable considering CSS selectors predate XPath), most of the improvements since CSS1 were in XPath day 1, and the rest (pseudo-classes) could probably have been implemented using functions.

Also, that might have finally gotten us a non-eye-stabbing standard function for "match any item of a space-separated list in an attribute" (matching HTML classes in XPath without custom helpers is the worst)

yeahbutbut · on Nov 4, 2013

> Also, that might have finally gotten us a non-eye-stabbing standard function for "match any item of a space-separated list in an attribute"

CSS handles this rather nicely:

    [class~=foo]

https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_s...

masklinn · on Nov 4, 2013

Whereas xpath... does not. Which is a severe understatement considering the equivalent to the CSS selector you wrote up (or to `.foo`) in xpath 1 is something along the lines of:

    [contains(concat(' ', normalize-space(@class), ' '), ' foo ')]

the normalize-space can be dropped IIF you're certain all spaces are normalized, the spaces around the needle not.

xpath 2 does quite a bit better through `tokenize`:

    //*[tokenize(@class, '\s+')='foo']

but still not great. And god forbid you need to match multiple classes in the same selector.

[0] or xpath 1 + exslt if your xpath implementation provides it. exslt actually does slightly better as the pattern is optional and defaults to whitespace characters

voltagex_ · on Nov 5, 2013

Over in .NET land we appear to be stuck on XPath 1.0 forever. A project I used to work on used it extensively, but I now use the HtmlAgilityPack (badly formatted HTML) or XDocument (XHTML strict or XML) where I have the choice.

masklinn · on Nov 5, 2013

> Over in .NET land we appear to be stuck on XPath 1.0 forever.

I don't think it's a bad idea, most of the improvements in XPath 2 are the new standard functions which depending on your XPath implementation may be available as extensions (e.g. tokenize comes from exslt, an xpath 1.0 library) but along with that it brings significantly higher complexity and I think the spec has gone from "difficult to read" to "meaningless word-salad".

I really like XPath, but I can't say I was impressed by XPath 2, it loses much of xpath's simplicity with little to show for the added complexity.

arethuza · on Nov 4, 2013

There has been a spec for XPath 2.0 for quite a while - though I don't think there are many implementations:

http://en.wikipedia.org/wiki/XPath_2.0

mcburton · on Nov 4, 2013

The SAXON XSLT processor implements basic XPATH 2.0

http://saxon.sourceforge.net/

yogo · on Nov 4, 2013

Yep very true. Depending on what you are working on alternative approach is to strip out the namespaces before querying anything. Obviously, this can get you in trouble, and while namespaces are there for a reason, they can be a PITA when working with XML.

atombender · on Nov 4, 2013

I wonder why libraries make it harder to combine XPath and namespaces. For example, in Nokogiri (libxml2 bindings for Ruby, and probably the de facto XML library in Ruby), you have to do:

    doc = Nokogiri::XML(stream)
    doc.xpath("//x:foo", {x: "http://somenamespace/"}).each ...

I always end up declaring the namespaces in a constant and passing it in, because there is no way to specify your mappings globally. It should have been something like:

    doc = Nokogiri::XML(stream)
    doc.namespaces = {x: "http://somenamespace/"}
    doc.xpath("//x:foo").each ...

or at least something functional like:

    doc = Nokogiri::XML(stream)
    doc.using_namespaces(x: "http://somenamespace/").
      xpath("//x:foo").each ...

I have seen this way of integrating namespaces and XPath in several libraries, it's not just Nokogiri.

dionidium · on Nov 4, 2013

My impression when people try to traverse XML while ignoring its namespaces is that they often don't really understand why the namespaces are there in the first place, which indicates to me that they don't really understand the data they're processing, which leads me to believe that any effort they may undertake to extract specific pieces of their document is doomed from the start.

hfsktr · on Nov 4, 2013

"This is a perfectly reasonable solution, but it's a whopping 11 lines of code. Further, it feels like we're using the wrong tool for the job: why are we using Ruby iterators and conditionals to get at DOM nodes?"

Is it really that bad to have 11 lines in Ruby?

Initially I didn't get the wrong tool part but after reading it all that did make more sense. I haven't used XPath more than a few times and they were pretty simple so can't complain. Just something I'll have to keep in mind.

_0nac · on Nov 4, 2013

Agreed. I find the 11 lines of Ruby a lot more readable/obvious than that fairly convoluted XPath. Although I'd probably have opted for RSLT instead:

http://hackpackers.lonelyplanet.com/2013/03/05/XML-Transform...

steveklabnik · on Nov 5, 2013

Most Rubyists try to keep methods between 5-10 lines long. 7 is the number for a Rails controller.

narrator · on Nov 4, 2013

One problem with XPath is that it can be a lot slower than native or JIT'd code depending on the implementation. Interestingly enough you can do xpath like things in Scala with native code using pattern matching:

http://ofps.oreilly.com/titles/9780596155957/HerdingXMLInSca...

tomasien · on Nov 4, 2013

Here's how I use XPath - constant Googling! This is a good explanation, maybe I'll actually learn it now.

Kiro · on Nov 4, 2013

When do you use XPath?

tomasien · on Nov 4, 2013

Scraping mostly

ianbicking · on Nov 4, 2013

If you are curious about XPath and CSS you might want to play with http://css2xpath.appspot.com

gavinpc · on Nov 5, 2013

I'm happy to see many other XPath fans here.

But as far as the OP, this seems like a case of worrying about the code instead of the data structure. This would be easier to address before the lines are transformed into HTML. Which I assume is not how they are stored.

sugnid · on Nov 5, 2013

Actually, they are! We host many document types on RG, not just song lyrics, so the “lyrics” field of a song is just a specific case of the general “body” field of a text. And texts can definitely have rich formatting via HTML markup.

habosa · on Nov 5, 2013

XML gets a bad rap because of how much prettier JSON is, but there are a lot of cool tools associated with it. XPath is pretty awesome, I had to write a XPath parser/executor once (for a class) and it made me appreciate the value and simplicity.

Then there was XSLT, which was a pretty sweet way to turn a data format into a variety of "print" or "display" formats. Definitely been replaced by bigger and better things but it's a pretty awesome technology that does one thing really well.

skrebbel · on Nov 5, 2013

XSLT is a turing-complete programming language. Really, the moment your source XML has a slightly different structure from the target output, XSLT files become monsters.

Programming in XML is never a good idea. It isn't in XSLT, it isn't in Spring, it isn't in Maven. Anything that's XML and has elements or attributes with names like "if", "else" or "while", something went horribly, horribly wrong somewhere. It's horribly verbose, you can't reasonably debug it, and there's virtually no engineering best practices, which results in near-impossible maintenance tasks.

Any modern programming language with a good, consice, XML parsing library is a more effective tool to transform XML into something else than XSLT.

Don't code in XML.

ygra · on Nov 5, 2013

Would you say »Programming in S-expressions is never a good idea« as well?

skrebbel · on Nov 5, 2013

No. Why?

ygra · on Nov 6, 2013

Then where do you draw the line between XML as data vs. XML as code (which seems to be bad) and sexpr as data vs. sexpr as code (which seems to be good)?

Just curious; it's just syntax, after all, and I think that XSLT really has an advantage in transforming XML to other XML (or text) compared to a program in other languages.

habosa · on Nov 5, 2013

I was just saying XML and XSLT is really cool, which it is. It's a tool that does one job, which at a time was a very common job.

nightwolf · on Nov 5, 2013

Just curious: what are the "bigger and better things" that have replaced XSLT? From what I can tell, it's still being used quite a lot, especially in the world of structured documentation. For something like transforming massive amounts of XML with a great amount of structural variety to another format, XSLT would certainly be my first choice.

habosa · on Nov 5, 2013

I'd say a comparable new piece of technology is AngularJS + HTML + CSS with JSON as the data format. In the end you're still transforming an easily-exchanged data format to a visual display, but with all of the new things that come with Web 2.0 and the thick client model.

It still can't turn a pile of XML into a pile of PDFs though, so XSLT is definitely king in some arenas.

GeorgeMac · on Nov 4, 2013

I have made a little XPATH primer. Which is very much a work in progress. Check it out on my github: https://github.com/GeorgeMac/xpath-primer

There are a few issues I am having with my markdown editor, in comparison to githubs markdown support.

d0m · on Nov 4, 2013

I prefer to use html parsers for such problems, such as beautifulsoup in python. I used xpath in the past but the ending code wasn't that much shorter then a more verbose version based on beautifulsoup. And, for someone looking at the code, the beautiful version makes so much more sense.. Xpath also feels like a big regex expression that magically works.

I'm not saying it's not useful. Actually, I believe that if you only have one use-case, then using xpath might be overkill because of all the added-complexity of maintaining a new library/technology/ideology. But if it's the sort of domain that xpath would be useful more than once, then sure use it.

TylerE · on Nov 4, 2013

The thing about xpath is that it runs in highly efficient native code. When iterating over a 100MB+ file it makes an _immense_ difference.

It's conceptually the difference between:

    count = 0 
    for row in db.query("select id from <table>"):
      count += 1

and

    db.query("select count(*) from <table>")

I also don't find it at all confusing, you just have to understand the tree nature of XML.

d0m · on Nov 4, 2013

Oh no, you've used the performance argument against me ;-) Obviously, when performance issues are on the line, you often need to trade simplicity and maintainability.

TylerE · on Nov 4, 2013

I'm not sure how using a platform specific library is more maintainable than an open standard with myriad implementations.

pyre · on Nov 4, 2013

Or you could use lxml in Python, which allows you to mix XPath and CSS Selectors. E.g.:

  from lxml import html

  doc = html.fromstring('<html><body><p class="text"></p></body></html>')

  doc.xpath('//p[@class = "text"]') == doc.cssselect('p.text')

  # or

  doc = html.fromstring('<html><body><p>1</p><p>2</p><p>3</p></body></html>')

  doc.xpath('//p[2]')[0] == doc.cssselect('p')[1]

  # Note: My only annoyance is that .xpath() always returns a list, even
  #       when you know that it will return only a single item.

thangalin · on Nov 4, 2013

http://www.w3.org/Tools/HTML-XML-utils/man1/hxselect.html

"hxselect - extract elements that match a (CSS) selector"

CSS is often a simpler way to extract data.

http://www.w3.org/Tools/HTML-XML-utils/

callmeed · on Nov 4, 2013

I just started getting into xpaths pretty hardcore with my trivia generator for http://playhattrick.com ... I use it for identifying tables of data to scrape. It's not as fun as regex IMO but it is powerful.

Pro Tip: the chrome inspector lets you right-click on an element and get its xpath.

Pro Warning: sometimes the xpath generated by chrome doesn't work when scraping with Nokogiri. I'm not sure why yet, I've just learned not to rely on it.

garethadams · on Nov 4, 2013

There's not "an" XPath for an element, as XPath describes the route you take from the root of a document to the element in question. The correct route for your situation depends on your use case.

Describing an element as "the first child of the fifth child of the second child of the first child of the eighth child of the second child of HTML" is as much the right path to an element as if you described the way to your house as "Walk past the park then walk past the bus stop then walk past the hardware store then walk past the butchers then turn left then walk past the pizza shop then walk past the library"

callmeed · on Nov 4, 2013

I'm aware what XPath is/does.

And I'm not disagreeing with you–I'm only saying Chrome has this feature. I don't know what route they choose for you but I know they don't always work in tools that parse (HT/X)ML.

Here's a screenshot in case you don't believe me: http://imgur.com/9FZSMSt

The XPath Chrome returns for this page is: //*[@id="details"]/article/table[1]

toyg · on Nov 4, 2013

A problem with XPath is that many tools don't actually support the latest spec, or even a spec, and just give you a bastardised syntax that looks like XPath and walks like XPath but will never swim like XPath. This is frustrating because you're never sure which features you can actually rely on.

riffraff · on Nov 4, 2013

additionally, chrome's confole also has a $x function to find elements by xpath

jefflinwood · on Nov 4, 2013

Interesting, (and this could just be a made up problem to illustrate the blog post), but wouldn't it have been much easier to just store the lyrics in another format (not HTML)?

For example, you could use TEI XML (http://www.tei-c.org/index.xml), and then use stanzas and lines. Then when you go to render your lyrics, you can capitalize the first letters in your presentation code.

m_st · on Nov 4, 2013

XPath is alright but as sixbrx noted it suffers from problems with namespaces. I keep using this xslt transformation to remove the ns info. Then it works just fine: http://stackoverflow.com/a/413088/34022

dionidium · on Nov 4, 2013

This is fine, I guess, and it's clearly something people want to do based on how often it gets asked on Stackoverflow, but there's a reason XML zealots get snarky when people ask how to do this. Some questions you might want to ask yourself:

- Why are the namespaces there in the first place?

- Do I really not care if the element is found in a namespace other than the one expected?

- Does my host environment have a way to specify the namespace of the element I want to find (hint: it probably does)?

- Is the reason that I want to remove the namespace that it's actually something I need to do or is it that I am ignorant of the method for specifying namespaces in my host environment?

djur · on Nov 4, 2013

The question is usually "is the author of this XML actually using namespaces in a reasonable manner?"

And the answer is usually "no".

I have seen SOAP responses with 20+ namespaces, all of them being essentially implementation details -- every different section of their internal API getting its own namespace. Inevitably, the elements are also prefixed in a way that makes them distinct, or wrapped in a distinguishing element (i.e. Contact/NameInfo/FirstName rather than FirstName xmlns="contact-name").

In situations like that, your best case scenario is that you do the grunt work of setting up aliases for all the namespaces, putting them into your XPaths, and you're done. The worst case scenario (which I've encountered) is when a version update of the API changes the URIs for half the namespaces, even though the structure of the data hasn't changed. In a case like that, you're actually penalized for doing the 'right' thing and not just stripping the damn things off.

gpsarakis · on Nov 4, 2013

CSS selectors are much easier to remember than XPath. Python's BeautifulSoup allows you to select elements with selectors and is very convenient. XPath is a bit more verbose and most people already are familiar with CSS syntax.

gsnedders · on Nov 4, 2013

And indeed, any CSS selector can be converted to an equivalent XPath query, at least for selectors on XML and HTML. http://pythonhosted.org/cssselect/ is a Python implementation of such a conversion. (Note that there is no XPath to CSS selector converter, as XPath can express certain things CSS selectors cannot, as CSS selectors are designed such that they can be matched using a streaming parser as soon as the first child of the element appears.)

tieTYT · on Nov 4, 2013

I'd say more than a bit. When you have multiple namespaces in your xml it can become so verbose that it's hard to see the signal through the noise. But then, maybe there's a way to reduce that noise in a way that I don't understand.

Long comment short, I agree. CSS selectors are easier to understand and read.

masklinn · on Nov 4, 2013

> the / in an XPath expression plays the same role as the > in a CSS selector:

The `/` in an XPath expression is probably a better match for the space in CSS selectors.

mmastrac · on Nov 4, 2013

I think you are thinking of "//", which acts like the space (ie: any number of children).

masklinn · on Nov 4, 2013

No, that's just a difference in default axis. I mean that `/` is a separator between traversal expressions, much like whitespace in CSS selectors.

mmastrac · on Nov 5, 2013

Ah yeah, I understand what you mean now. XPath's `/` is something like a token that means "separate these two things". In CSS the separation between segments is the zero or more whitespace characters that live between the parts of the selector.

goflyapig · on Nov 5, 2013

This is a great explanation and quick tutorial on XPath, but, like regex, don't think I'd ever use it in production code unless I absolutely had to.

I'm sure I'd have fun coming up with an XPath solution, but for me, the ultimate goal is maintainability. If I wasn't 90% sure that the next person to look at that code already knew XPath, then I'd go with the Ruby solution.

Dealing with 11 lines of code in a language you know is better than dealing with 1 line of code in a language you don't (which ends up forcing you to read 1000 lines of documentation and examples to understand it).

optymizer · on Nov 4, 2013

Writing an XPath 1.0 parser in C was fun. Maybe one day I'll use it to (partially) replace MongoDB's JSON query language.

johnward · on Nov 4, 2013

I use xpath way more than I care to at my job but it get's the job done.