Hacker News new | past | comments | ask | show | jobs | submit login
URL Parsing in WebKit (webkit.org)
125 points by ash_gti on Dec 2, 2016 | hide | past | favorite | 19 comments



For what it's worth, the file:///foo/bar url means:

1. an empty hostname 2. path = 'foo/bar'

In the file:// url an empty hostname implies 'localhost'. The file:// is, as far as I know, unique in this regard.

When parsing URI's generically, and you encounter a URI like:

scheme:foo/bar

The meaning of 'foo/bar' is also a path. There's many URI schemes like this, such as mailto: and urn:.

So given that, file:foo/bar should probably throw an error, but if not, it should be canonicalized into file:///foo/bar (triple slash), because file://foo/bar refers to the 'bar' file on the 'foo' host, and file:///foo/bar refers to /foo/bar on localhost.


One annoying thing: web browsers parse URLs two different ways!

1. Typing "example.com" into your address bar takes you to "//example.com/" (root path of example.com domain in current/default schema)

2. Clicking on a link with a href="example.com" takes you to "///./example.com" (that is: current domain, current schema, relative path reference).

The URL parsers built into most runtimes use behavior #2, and I can see the usefulness of it in the sense that you can sort of treat a URL as an extended version of a filesystem Path object, where any string that forms a valid Path (relative or absolute) also forms a valid URL with equivalent semantics.

But most URLs people type, in the wild, are implicitly assuming behaviour #1. If you write an unstructured-text "ingestor" that extracts URLs embedded in plaintext on the web or in print, and tries to dereference said URLs, only approach #1 will get you anywhere.

That said, I've never seen a single URL library that exposes any parsing API for type-1 URL fragments. It'd be extremely useful for parsing URLs entered by humans as responses to prompts.


That'd be because type-1 fragments can be parsed by prefixing them with http:// - or, more technically, humans treat the scheme as implicit and assume the first component is always the net location. (a.k.a domain name)

Type-2 URLs are all considered relative URLs, which is why all URL libraries have a resolver concept for relative URLs. (RFC1808 has the gory details.)

The problem with type-1's is that they're not in the correct shape for relative address resolution, and so you're always engaging in a bit of gymnastics figuring out if somebody typed out part of the implicit agreement. And if you were to treat them as relative, there's ambiguity, so they can't be allowed in HTML pages.

I.e. href="www.example.com" is not clear in type-1 - it could be a domain, it could be a relative URL for the www.example.com file. In type-2, that's entirely clear - there's no scheme, and no net location, so it must be a path.

Conversely, we can't use type-2 urls exclusively, because humans would quickly realize that all useful urls start with a double-slash, and would want to start leaving that off, because "all URLs are like that". Most people oddly lack the ability to keep everything strictly clean according to a beautiful grammar and instead thing eliding common prefixes is fine, even if they span multiple distinct items. (There's a lesson for people who design grammars in there, but it's too late to learn the lesson on the web)

And here we are.

And don't even get me started on the fact that the address bar, on top of all of that, also needs to try to figure out if it's a URL (and which kind), or a query, or maybe a prefix for a custom search engine.

As usual, having users makes things really complicated ;)


Google around for a "public suffix list" library for your favorite language, it will aid you in guessing that 'example.com' could be a valid FQDN. As a crawler guy, I see common suffix list features in the kinds of url libraries that I use or write.


Don't forget to update your library whenever a new gTLD comes out


It appears that most libraries download the list from Mozilla, but yes, you do need to ensure that it's downloaded frequently.


> When parsing URI's generically, and you encounter a URI like:

> scheme:foo/bar

> The meaning of 'foo/bar' is also a path.

You can't count on that; a URI scheme can assign arbitrary semantics to what appears after the ':'. For instance, the "mid" URI scheme refers to a Message-Id (not a path), and the "cid" URI scheme refers to a MIMEd part of a message.


In the file:// url an empty hostname implies 'localhost'.

It is quite obvious how the resource is to be retrieved in that case, but one wonders how the resource is to be retrieved in the case that the hostname is not empty --- this is straightforward to figure out for http://, https://, and ftp://, but not file://; in other words, what is its protocol?


I have an implementation, although it's currently closed source and is only available via API: http://0ut.ca/documentation

I believe it's closest to the standard that I've found, and if it isn't I would like to correct that.

There is a Strict parser which will fail on any error, and Loose parser which will discard errors when possible and follow the defacto parsing implementations.

It should be able to handle any of the edge cases, such as partially percent encoded unicode, invalid characters, normalization, or octal/hex ipv4 addresses. The only thing from your linked unittests that it will not handle is | and \ for windows paths, they will be encoded.

You can easily compare the expected output in your browser here if anyone is interested in seeing how parsing is done: http://0ut.ca/api;v1.0/validate/uri/after?hTtPs://foo:%F0%9F... You can also try validating strange relative URIs: http://0ut.ca/api;v1.0/validate/uri/after?+invalid-scheme:/p...?

I would be happy to explain any of the reasoning behind the parsing if anyone is interested.


Wow, thanks!

Your tool helps me because it's like an EXAMPLES section of a man page.


The problem of inconsistent URL parsing doesn't just apply to browsers. This story prompted me to look for "What every web developer must know about URL encoding," which was posted a while back:

https://news.ycombinator.com/item?id=5930494

Unfortunately, the blog post is 404. Here's the most recent version from archive.org:

https://web.archive.org/web/20151229061347/http://blog.lunat...



This is a problem that's near and dear to my heart, and more progress on standardizing URL parsing would be lovely. I have a crawler and a data miner that both rely on URL parsing, and it's kind of a pain. There are about a hundred tests that the code needs to pass with each revision, and it's only a fraction of the tests that are needed and there are still edge cases found in the wild.

For instance, telling the difference between a web uri and a mailto uri without the benefit of a scheme at the beginning of the uri is total guesswork.

The parser's current approach is to return parsed URIs along with a confidence percentage and the application logic then tries to make some additional guesses based on context.

URI parsing is not my favorite thing.



An ideal benchmark would measure performance of parsing real URLs from popular websites, but publishing such benchmarks is problematic because URLs often contain personally identifiable information

You could base your benchmark on URLs obtained through crawling (without cookies or other state) the public web.


> For example, you might be trying to reduce your server’s bandwidth use by removing unnecessary characters in URLs.

Is this sarcasm? The savings can never be more than a few KBs per page.


We standardised HTML parsing, is URL parsing really much trickier?

Maybe we need HTML6, with an opt-in doctype and every parser fully defined.


The standardization is "easy", however we should not forget the browser must support the wrong behavior, because a bunch of applications are expecting that behavior, otherwise the implementation of the "standardised" URL parsing would break the web.


HTML parsing was actually slightly more uniform across browsers than URL parsing is. So in some ways URL parsing really is much trickier.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: