XPaths are extremely useful. I actually enjoy writing them, much like I enjoy writing regular expressions. In fact, I consider both to return manifold the modest investment they require to learn well.
I feel the same way. I frequetly use them, and find them a lot easier to use than Regexes. My site even uses regexes a lot. [ theexceptioncatcher.com ]
xpath is awesome, especially once you understand what an axis is.
And that is what I've found most people who have trouble with it don't understand - like what exactly following-sibling or child means.
I spent about 2 months writing my own xpath evaluator once and it gets so much easier (to implement too) when you understand this is just a tree-traversal with an iterator following the axis.
Unfortunately the axis syntax makes it very verbose to read.
W3C is full of terrible standards: the verbose dom, the obtuse xml schema, the crippled css (you can't have a variable), and others. XPath isn't one of them. It is the best way to query XML documents in a forward compatible way. Maybe someday we will able to use XPath in a CSS file instead of their crazy selectors.
If you think 11 lines of code is a lot, you're overly focused on concision at the expense of readability. I've never (read: never) worked on any Ruby code, yet I find the posted example more readable than the supposedly more valuable xpath.
At the very least, they're the same. If you're writing code in Ruby, 11 lines is nothing. If you're writing code in Ruby and xpath is used nowhere else in the project, that single line of super-compact xpath might as well be 1000 lines of Ruby -- it doesn't matter.
If you're trying to compact 11 lines of code you're probably doing it wrong.
I may or may not agree with you- but one could make the argument that xpath is likely heavily tested and proven, and will handle unexpected corner cases that arise in the future, which those 11 lines of ruby will not. In that case, using xpath lowers the amount of future headaches with that code.
After a great article on what looked to be a handy tool, this part disappointed me:
for this particular task, XPath is actually considerably slower than the pure-Ruby implementation. Interestingly, that's not true if you take out the <br> part and only look for text at the beginning of paragraphs. My guess is that the following-sibling axis is the culprit, since it has to select all the following siblings of the br tags, and then filter them down to only the first sibling.
I was hoping selectors were lazy, in which case, selecting all the following siblings but then immediately filtering that selection down to the first would be cheap. Lazy or not, can there really be no efficient way to do the equivalent of jQuery next()?
On its own, .NET's XML libraries are really only good for consuming XML documents, but even that is a rather painful experience, especially as it forces a namespace on all documents, complicating the XPath expressions necessary to query it. Actually authoring documents is a nightmare. My XmlEdit project makes it almost as simple as key-value-pair config files.
Well, I wrote this off of the top of my head, and it has been several years since I've used the library heavily (though I have a project now that needs it, so I most likely will be dusting it off and fixing any hairy bits).
However, just take note that the main concept of the library is "make it work". The idea was that, given an XPath expression with several attribute selectors, it would fill in any necessary nodes to just make it happen. So you can technically chain a ton of editing commands together, by using an appropriately complex XPath expression.
It came out of a need to repair thousands of broken XML documents. It's probably not very complete. It was written for one project--and though I took time to make it generalized--it didn't make it into a key role into any other projects; I just didn't ever again have the need to deal with XML documents on such a scale.
It's actually one of the first "big" things I wrote out of college. I'm not too happy with some of the design right now, but the functionality has held up over the years and it's not as shitty as some of the other code I wrote at the time. I guess I knew that a lot of the project was hinging on how easy it was to write XML documents, so I made sure I did a ton of testing to make it work.
> But it gets more interesting if the lyrics are stored as an HTML fragment.
Is there any reason to store the HTML version with <p>s and <br>s instead of a plain text and converting it to HTML with simple rules à la markdown? (single line break = <br>, double line break = <p>)
yeah, this is super surprising considering their business of annotating individual lines; I'd be storing the lines in a list and caching generated html, not storing the tags in with them. it's kind of weird.
At least for me it was interesting when considering the aspect of finding/modifying stuff in data you cannot control yourself. For instance crawling, indexing or retrieving data in an unspecified format.
Well, at a minimum, it saves the processing time required to format the text, which lessens the server cost of each page hit. It's a small optimization, but when the vast majority of the users are just coming to the site to read text, I'd imagine it would save a lot of CPU time.
I also like XPath for some purposes, but I think it really suffers from (IIRC) having been designed before the xml namespaces, which it only integrates very awkwardly IMO and which ruins the simplicity of XPath. Or maybe XML namespaces spoil everything they affect to some degree :)
It's the latter (XML namespaces spoil everything).
In every single task I do that involves munching on XML with xpath, the largest timesink is figuring out how the ns needs to be set up. (Looking you you, PHP simpleXML).
> In every single task I do that involves munching on XML with xpath
And more generally that's true of every single task involving muching on namespaced XML. Namespaces are a good idea implemented absolutely terribly.
XPath is a good idea well-implemented (no, XPath 2 does not exist, there is only one XPath). One of the few I've found in XML-land. I still hate that we have to use CSS selectors rather than XPath (although that's understandable considering CSS selectors predate XPath), most of the improvements since CSS1 were in XPath day 1, and the rest (pseudo-classes) could probably have been implemented using functions.
Also, that might have finally gotten us a non-eye-stabbing standard function for "match any item of a space-separated list in an attribute" (matching HTML classes in XPath without custom helpers is the worst)
Whereas xpath... does not. Which is a severe understatement considering the equivalent to the CSS selector you wrote up (or to `.foo`) in xpath 1 is something along the lines of:
the normalize-space can be dropped IIF you're certain all spaces are normalized, the spaces around the needle not.
xpath 2 does quite a bit better through `tokenize`:
//*[tokenize(@class, '\s+')='foo']
but still not great. And god forbid you need to match multiple classes in the same selector.
[0] or xpath 1 + exslt if your xpath implementation provides it. exslt actually does slightly better as the pattern is optional and defaults to whitespace characters
Over in .NET land we appear to be stuck on XPath 1.0 forever. A project I used to work on used it extensively, but I now use the HtmlAgilityPack (badly formatted HTML) or XDocument (XHTML strict or XML) where I have the choice.
> Over in .NET land we appear to be stuck on XPath 1.0 forever.
I don't think it's a bad idea, most of the improvements in XPath 2 are the new standard functions which depending on your XPath implementation may be available as extensions (e.g. tokenize comes from exslt, an xpath 1.0 library) but along with that it brings significantly higher complexity and I think the spec has gone from "difficult to read" to "meaningless word-salad".
I really like XPath, but I can't say I was impressed by XPath 2, it loses much of xpath's simplicity with little to show for the added complexity.
Yep very true. Depending on what you are working on alternative approach is to strip out the namespaces before querying anything. Obviously, this can get you in trouble, and while namespaces are there for a reason, they can be a PITA when working with XML.
I wonder why libraries make it harder to combine XPath and namespaces. For example, in Nokogiri (libxml2 bindings for Ruby, and probably the de facto XML library in Ruby), you have to do:
I always end up declaring the namespaces in a constant and passing it in, because there is no way to specify your mappings globally. It should have been something like:
My impression when people try to traverse XML while ignoring its namespaces is that they often don't really understand why the namespaces are there in the first place, which indicates to me that they don't really understand the data they're processing, which leads me to believe that any effort they may undertake to extract specific pieces of their document is doomed from the start.
"This is a perfectly reasonable solution, but it's a whopping 11 lines of code. Further, it feels like we're using the wrong tool for the job: why are we using Ruby iterators and conditionals to get at DOM nodes?"
Is it really that bad to have 11 lines in Ruby?
Initially I didn't get the wrong tool part but after reading it all that did make more sense. I haven't used XPath more than a few times and they were pretty simple so can't complain. Just something I'll have to keep in mind.
One problem with XPath is that it can be a lot slower than native or JIT'd code depending on the implementation. Interestingly enough you can do xpath like things in Scala with native code using pattern matching:
But as far as the OP, this seems like a case of worrying about the code instead of the data structure. This would be easier to address before the lines are transformed into HTML. Which I assume is not how they are stored.
Actually, they are! We host many document types on RG, not just song lyrics, so the “lyrics” field of a song is just a specific case of the general “body” field of a text. And texts can definitely have rich formatting via HTML markup.
XML gets a bad rap because of how much prettier JSON is, but there are a lot of cool tools associated with it. XPath is pretty awesome, I had to write a XPath parser/executor once (for a class) and it made me appreciate the value and simplicity.
Then there was XSLT, which was a pretty sweet way to turn a data format into a variety of "print" or "display" formats. Definitely been replaced by bigger and better things but it's a pretty awesome technology that does one thing really well.
XSLT is a turing-complete programming language. Really, the moment your source XML has a slightly different structure from the target output, XSLT files become monsters.
Programming in XML is never a good idea. It isn't in XSLT, it isn't in Spring, it isn't in Maven. Anything that's XML and has elements or attributes with names like "if", "else" or "while", something went horribly, horribly wrong somewhere. It's horribly verbose, you can't reasonably debug it, and there's virtually no engineering best practices, which results in near-impossible maintenance tasks.
Any modern programming language with a good, consice, XML parsing library is a more effective tool to transform XML into something else than XSLT.
Then where do you draw the line between XML as data vs. XML as code (which seems to be bad) and sexpr as data vs. sexpr as code (which seems to be good)?
Just curious; it's just syntax, after all, and I think that XSLT really has an advantage in transforming XML to other XML (or text) compared to a program in other languages.
Just curious: what are the "bigger and better things" that have replaced XSLT? From what I can tell, it's still being used quite a lot, especially in the world of structured documentation. For something like transforming massive amounts of XML with a great amount of structural variety to another format, XSLT would certainly be my first choice.
I'd say a comparable new piece of technology is AngularJS + HTML + CSS with JSON as the data format. In the end you're still transforming an easily-exchanged data format to a visual display, but with all of the new things that come with Web 2.0 and the thick client model.
It still can't turn a pile of XML into a pile of PDFs though, so XSLT is definitely king in some arenas.
I prefer to use html parsers for such problems, such as beautifulsoup in python. I used xpath in the past but the ending code wasn't that much shorter then a more verbose version based on beautifulsoup. And, for someone looking at the code, the beautiful version makes so much more sense.. Xpath also feels like a big regex expression that magically works.
I'm not saying it's not useful. Actually, I believe that if you only have one use-case, then using xpath might be overkill because of all the added-complexity of maintaining a new library/technology/ideology. But if it's the sort of domain that xpath would be useful more than once, then sure use it.
Oh no, you've used the performance argument against me ;-) Obviously, when performance issues are on the line, you often need to trade simplicity and maintainability.
Or you could use lxml in Python, which allows you to mix XPath and CSS Selectors. E.g.:
from lxml import html
doc = html.fromstring('<html><body><p class="text"></p></body></html>')
doc.xpath('//p[@class = "text"]') == doc.cssselect('p.text')
# or
doc = html.fromstring('<html><body><p>1</p><p>2</p><p>3</p></body></html>')
doc.xpath('//p[2]')[0] == doc.cssselect('p')[1]
# Note: My only annoyance is that .xpath() always returns a list, even
# when you know that it will return only a single item.
I just started getting into xpaths pretty hardcore with my trivia generator for http://playhattrick.com ... I use it for identifying tables of data to scrape. It's not as fun as regex IMO but it is powerful.
Pro Tip: the chrome inspector lets you right-click on an element and get its xpath.
Pro Warning: sometimes the xpath generated by chrome doesn't work when scraping with Nokogiri. I'm not sure why yet, I've just learned not to rely on it.
There's not "an" XPath for an element, as XPath describes the route you take from the root of a document to the element in question. The correct route for your situation depends on your use case.
Describing an element as "the first child of the fifth child of the second child of the first child of the eighth child of the second child of HTML" is as much the right path to an element as if you described the way to your house as "Walk past the park then walk past the bus stop then walk past the hardware store then walk past the butchers then turn left then walk past the pizza shop then walk past the library"
And I'm not disagreeing with you–I'm only saying Chrome has this feature. I don't know what route they choose for you but I know they don't always work in tools that parse (HT/X)ML.
A problem with XPath is that many tools don't actually support the latest spec, or even a spec, and just give you a bastardised syntax that looks like XPath and walks like XPath but will never swim like XPath. This is frustrating because you're never sure which features you can actually rely on.
Interesting, (and this could just be a made up problem to illustrate the blog post), but wouldn't it have been much easier to just store the lyrics in another format (not HTML)?
For example, you could use TEI XML (http://www.tei-c.org/index.xml), and then use stanzas and lines. Then when you go to render your lyrics, you can capitalize the first letters in your presentation code.
XPath is alright but as sixbrx noted it suffers from problems with namespaces.
I keep using this xslt transformation to remove the ns info. Then it works just fine: http://stackoverflow.com/a/413088/34022
This is fine, I guess, and it's clearly something people want to do based on how often it gets asked on Stackoverflow, but there's a reason XML zealots get snarky when people ask how to do this. Some questions you might want to ask yourself:
- Why are the namespaces there in the first place?
- Do I really not care if the element is found in a namespace other than the one expected?
- Does my host environment have a way to specify the namespace of the element I want to find (hint: it probably does)?
- Is the reason that I want to remove the namespace that it's actually something I need to do or is it that I am ignorant of the method for specifying namespaces in my host environment?
The question is usually "is the author of this XML actually using namespaces in a reasonable manner?"
And the answer is usually "no".
I have seen SOAP responses with 20+ namespaces, all of them being essentially implementation details -- every different section of their internal API getting its own namespace. Inevitably, the elements are also prefixed in a way that makes them distinct, or wrapped in a distinguishing element (i.e. Contact/NameInfo/FirstName rather than FirstName xmlns="contact-name").
In situations like that, your best case scenario is that you do the grunt work of setting up aliases for all the namespaces, putting them into your XPaths, and you're done. The worst case scenario (which I've encountered) is when a version update of the API changes the URIs for half the namespaces, even though the structure of the data hasn't changed. In a case like that, you're actually penalized for doing the 'right' thing and not just stripping the damn things off.
CSS selectors are much easier to remember than XPath. Python's BeautifulSoup allows you to select elements with selectors and is very convenient. XPath is a bit more verbose and most people already are familiar with CSS syntax.
And indeed, any CSS selector can be converted to an equivalent XPath query, at least for selectors on XML and HTML. http://pythonhosted.org/cssselect/ is a Python implementation of such a conversion. (Note that there is no XPath to CSS selector converter, as XPath can express certain things CSS selectors cannot, as CSS selectors are designed such that they can be matched using a streaming parser as soon as the first child of the element appears.)
I'd say more than a bit. When you have multiple namespaces in your xml it can become so verbose that it's hard to see the signal through the noise. But then, maybe there's a way to reduce that noise in a way that I don't understand.
Long comment short, I agree. CSS selectors are easier to understand and read.
Ah yeah, I understand what you mean now. XPath's `/` is something like a token that means "separate these two things". In CSS the separation between segments is the zero or more whitespace characters that live between the parts of the selector.
This is a great explanation and quick tutorial on XPath, but, like regex, don't think I'd ever use it in production code unless I absolutely had to.
I'm sure I'd have fun coming up with an XPath solution, but for me, the ultimate goal is maintainability. If I wasn't 90% sure that the next person to look at that code already knew XPath, then I'd go with the Ruby solution.
Dealing with 11 lines of code in a language you know is better than dealing with 1 line of code in a language you don't (which ends up forcing you to read 1000 lines of documentation and examples to understand it).
XPath : XML :: regex : text