Hacker News new | past | comments | ask | show | jobs | submit login
In search of the perfect URL validation regex (mathiasbynens.be)
65 points by lgmspb on June 22, 2014 | hide | past | favorite | 81 comments



If you're going to allow dotted IPs you should really allow 32-bit IPs too, e.g., http://0xadc229b7, http://2915183031 and http://025560424667. (The validity of this last one was news to me I must admit.)



Since I just finished implementing a toy HTTP/1.1 server, I must throw in my newfound knowledge that 2732 has been updated with Zone IDs:

http://tools.ietf.org/html/rfc6874


Yep.

I should mention that 99.9% of domains will fall into standard form ( handle.domain or ip.ip.ip.ip )

As such, You are definitely more likely to let a user enter a bad URL they did not intend because it validates then to let a uncommon domain actually be used.

As such- a much simpler regex would likely 'make more people happy' than being 100% correct to tech spec.


None of those is a URI, so a URI validator most certainly should not accept them. Just because browsers tend to understand them as a matter of a historical accident does not mean those are valid URIs, just as tag soup that browsers also tend to understand isn't valid HTML either.


Exactly. The goal was to come up with a good regular expression to validate URLs as user input. There’s no way I’d want to allow alternate IP address notations.


Yes... I've found the relevant RFC now. How disappointing! I like 32-bit hex IP addresses.


Why use a regex? It's much simpler to write a URL validator by hand, speaking as someone who wrote a URL parser,[1] and fixed a bug in PHP's.[2]

Or, you know, use a robust existing validator or parser. Like PHP's, for instance.

[1] https://github.com/TazeTSchnitzel/Faucet-HTTP-Extension - granted, this deliberately limits the space of URLs it can parse, but it's not difficult to cover all valid cases if you need to

[2] https://github.com/php/php-src/commit/36b88d77f2a9d0ac74692a...


Exactly. Isn't this as bad an idea as trying to parse HTML with regular expressions? [1]

[1]: http://stackoverflow.com/questions/1732348/regex-match-open-...


No, it's not, URIs are regular, so using regular expressions is perfectly fine.


Good point. You're right, it's not as bad an idea.

(I still think it's not a great idea. Being regular isn't necessarily the same as being parseable with a maintainable regex.)


> Why use regex? It's much simpler to write a URL validator by hand...

I actually have a use-case. I am firming up a feature right now that detects when a user types a url into a text field and replaces it on the fly with a footnote-style reference number (much like your comment above). This is done to (1) minimize input string length, (2) draw the benefits of a consistent interface, and (3) avoid screwing around with the fragility of url shortening nonsense.

I may regret this, but here's a link to my dev environment for this feature (please be gentle), to see it in action:

https://cloudcity.tenfourgood.com/cloudcity

...just start typing in the big text box and add in some urls.

It uses a fairly ugly-looking regular expression[0].

If you take out the unicode mumbojumbo, it's not really THAT tricky of a pattern. It does fail on IP addresses and it may be a little over-aggressive on matching, but, I wanted it to catch things like "abc.com" and "//xyz.com".

Edit: Formatting and clarity. Removed the explicit regular expression because it predictably got garbled.

[0] http://regexr.com/38vsq not exactly the same one I'm currently using, but it's pretty close. See source for most up-to-date version.


I agree, the correct one (+500 chars) looks like a maintenance nightmare to me. I tried to build a real e-mail address validator that would accept also the more exotic forms and there was no way in hell I would have done that in regex.

I also met very few people that actually understood regex. It's a whole new skill and if I use it in my application I don't know if the next guy can pick it up.


It's why you should always write regex (at least, anything beyond a couple of characters) using the `x / PCRE_EXTENDED` flag. It makes them ignore whitespace and anything following a #, so you can split your regex up onto multiple lines and comment it to explain what it's doing.


Here's the documented version: https://gist.github.com/dperini/729294

We should get a port to every language.


If you looked at the page before commenting you’d know that PHP’s built-in URL parser is one of the implementations that is being tested. You’d also see that one of the requirements was to not match scheme-relative URLs (e.g. `//foo.bar`) like the ones your commit fixes.

The regular expressions presented do not conform to the URL Standard or to any RFC, but rather to the list of requirements on that page.

I applaud your work in improving PHP’s built-in URL parser.


Why are http://www.foo.bar./ and http://a.b--c.de/ supposed to fail?

The @stephenhay is just about perfect despite being the shortest. The subtleties of hyphen placement aren't very important, and this is a dumb place to filter out private IP addresses when a domain could always resolve to one. Checking if an IP is valid should be a later step.


As for why the trailing dot is disallowed, see <http://saynt2day.blogspot.com/2013/03/danger-of-trailing-dot....

The goal was to come up with a good regular expression to validate URLs as user input, and not to match any URL that browsers can handle (as per the URL Standard).

a.b--c.de is supposed to fail because `--` can only occur in Punycoded domain name labels, and those can only start with `xn--` (not `b--`).


Many of these regexes should not be used on user input, at least not if your regex library backtracks (uses NFAs), because of the risk of ReDoS: http://en.wikipedia.org/wiki/ReDoS

Trying to shoehorn NFAs into parsing stuff that isn't a regular expression is generally a bad idea. (See: Langsec.)


The first one I'm not sure, it looks like a valid FQDN (equivalent to www.foo.bar, where the trailing dot is implicit). The second one I guess it has to do with punycode and internaltional URI encoding?


You'd only have to filter out xn--, though.


The trailing dot makes it an invalid URL, which defines hostname as *[ domainlabel "." ] toplabel.

"b--c" seems to be a valid domainlabel, though, so I'm not sure why it's on there.


> The trailing dot makes it an invalid URL, which defines hostname as * [ domainlabel "." ] toplabel.

I disagree. But, this is a tricky one. The relevant specs are:

    Spec               |          | Validity | Definition
    URL      (RFC1738) | obsolete |  invalid | hostname = *[ domainlabel "." ] toplabel
    HTTP/1.0 (RFC1945) | current  |  invalid | host     = <A legal Internet host domain name
                       |          |          |             or IP address (in dotted-decimal form),
                       |          |          |             as defined by Section 2.1 of RFC 1123>
    HTTP/1.1 (RFC2068) | obsolete |  invalid | ; same as RFC1945
    HTTP/1.1 (RFC2616) | obsolete |    valid | hostname = *( domainlabel "." ) toplabel [ "." ]
    URI      (RFC3986) | current  |    valid | host     = IP-literal / IPv4address / reg-name
                       |          |          | reg-name = *( unreserved / pct-encoded / sub-delims )
    HTTP/1.1 (RFC7230) | current  |    valid | uri-host = <host, see [RFC3986], Section 3.2.2>
The only way that URL is invalid is if we are in a strict HTTP/1.0 context.

As a note about RFC1738 being obsolete: these days a URL is just a URI (1) whose scheme specifies it as a URL scheme, and (2) is valid according to the scheme specification.

As the given URL is a valid URI, and is valid according to the current http URL scheme specification (RFC7230), that URL is valid.


You forgot the most relevant spec, the URL Standard: http://url.spec.whatwg.org/


The trailing dot makes it an invalid URL, which defines hostname as ∗[ domainlabel "." ] toplabel.

RFC2396§3.2.2 defines hostname as:

   hostname      = *( domainlabel "." ) toplabel [ "." ]
RFC3986 which obsoletes RFC2396, seems to make fewer claims about the authority section, and specifically says in Appendix D.2 that the toplabel rule has been removed.


At best this lets you conclude that a URL could be valid. Is that really useful? Is the goal here to catch typos? Because you'd still miss an awful lot of typos.

If you really want your URL shortener to reject bad URLs, then you need to actually test fetching each URL (and even then...)

As an aside, I'd instantly fail any library that validates against a list of known TLDs. That was a bad idea when people were doing it a decade ago. It's completely impractical now.


My exact use case was the following: the user clicks a bookmarklet that passes the current URL in the browser as a query string parameter to a URL shortener script. The validation is then performed before the URL is shortened.

In that scenario, and with the given requirements, I can’t think of a case where the validation fails. There’s no need to worry about protocol-relative URLs, etc.

(Keep in mind that this page is 4 years old — I very well may have missed something.)

> If you really want your URL shortener to reject bad URLs, then you need to actually test fetching each URL (and even then...)

I disagree. http://example.com/ might experience downtime at some point in time, but that doesn’t mean it’s suddenly an invalid URL.

> As an aside, I'd instantly fail any library that validates against a list of known TLDs. That was a bad idea when people were doing it a decade ago. It's completely impractical now.

Agreed.


I still don't quite follow the purpose of the validation. Is it against malicious use? In normal use, I would think that pretty much any URL that's good enough for the browser sending it would be good enough for the link shortener.


But then you might end up shortening things like `about:blank` by accident.


> At best this lets you conclude that a URL could be valid. Is that really useful?

It's useful to find and linkify URLs in text (e.g. in your HN comments, how do you think HN makes http://foo.com into a link?)


That's not the premise laid out on the linked page.


Another important dimension when evaluating these regexes is performance. The Gruber v2 regex has exponential (?) behavior on certain pathological inputs (at least in the python re module).

There are some examples of these pathological inputs at https://github.com/tornadoweb/tornado/blob/master/tornado/te...


In node.js too. I found this out the hard way. I ended up modifying it so that it didn't work as well, but at least stopped DoSing my service:

https://github.com/PiPeep/NotVeryCleverBot/blob/coffee-rewri...

Note the commented out lines in the here-regex.


From experience, the python re module does weird things sometimes. There is a better third-party regex module, https://pypi.python.org/pypi/regex.


Does it use NFA?

http://swtch.com/~rsc/regexp/regexp1.html

Because the issue with the URL regex mentioned is with backtracking.


Use a standard URI parser to break this problem into smaller parts. Let a modern URI library worry about arcane details like spaces, fragments, userinfo, IPv6 hosts, etc.

   uri = URI.parse(target).normalize
   uri.absolute? or raise 'URI not absolute'
   %w[ http https ftp ].include?(uri.scheme) or raise 'Unsupported URI scheme'
   # Etc


That was not an option in this case, as the goal is to validate URLs entered as user input and blacklist certain URL constructs even though they’re technically valid.


So check if uri.host matches your blacklist.


It'd be interesting to look at URI.parse's source code.


It does match a regular expression internally.

https://github.com/ruby/ruby/blob/trunk/lib/uri/rfc2396_pars...


Why no IPv6 addresses in the test cases?


Why not put in some of the new TLDs as test cases... ;)


John Gruber (of daringfireball.com) came up with a regex for extracting URLs from text (Twitter-like) years ago, and has improved it since. The current version is found at https://gist.github.com/gruber/249502.

I haven't tested it myself, but it's worth looking at.

Original post: http://daringfireball.net/2009/11/liberal_regex_for_matching...

Updated version: http://daringfireball.net/2010/07/improved_regex_for_matchin...

Most recent announcement, which contained the Gist URL: http://daringfireball.net/linked/2014/02/08/improved-improve...


Isn't it the @gruber v2 column from the page? It looks to have no false negative, but many false positives. The only one which does perfect on the tested set is Diego Perini's https://gist.github.com/dperini/729294


Hrm, you're right. I managed to miss seeing that while skimming the page.


Interestingly it seems http://✪df.ws isn't actually valid, even though it exists. ✪ isn't a letter[1], so it isn't allowed in international domain names. I was looking at the latest RFC from 2010 [2] so maybe it was allowed before that. The owner talks about all the compatibility trouble he had after he registered it [3]. The registrar that he used for it, Dynadot, won't let me register any name with that character, nor will Namecheap.

[1] http://www.fileformat.info/info/unicode/char/272a/index.htm

[2] http://tools.ietf.org/html/rfc5892

[3] http://daringfireball.net/2010/09/starstruck


It's the same way with http://💩.la/ it gets turned into the following using IDNA: xn--ls8h.la.

That is a valid domain name and should be treated as such.


I guess you could argue the definition of "valid". According to the RFC it's

DISALLOWED: Those that should clearly not be included in IDNs. Code points with this property value are not permitted in IDNs.


There is no perfect URL validation regex, because there are so many things you can do with URLs, and so many contexts to use them with. So, it might be perfect for the OP, but completely inappropriate for you.

That said, there is a regex in RFC3986, but that's for parsing a URI, not validating it.

I converted 3986's ABNF to regex here: https://gist.github.com/mnot/138549

However, some of the test cases in the original post (the list of URLs there aren't available separately any more :( ) are IRIs, not URIs, so they fail; they need to be converted to URIs first.

In the sense of the WHATWG's specs, what he's looking for are URLs, so this could be useful: http://url.spec.whatwg.org

However, I don't know of a regex that implements that, and there isn't any ABNF to convert from there.


This is a good lesson why you want to avoid writing your own regexes. Even something simple like an email address can be insane:http://ex-parrot.com/~pdw/Mail-RFC822-Address.html


This always comes up in these discussions and is a terrible counter example. RFC 822 is the format for email messages (ie headers) and not a form you'll ever find email addresses in "in the wild" (eg on web forms).


What's wrong with IP-address URLs? If they are invalid because it says so in some RFC, this is still not the ultimate regex. If you redirect a browser to http://192.168.1.1 it works perfectly fine.

And why must the root period behind the domain be omitted from URLs? Not only does it work in a browser (and people end sentences with periods), the domain should actually end in a period all the time but it's usually omitted for ease of use. Only some DNS applications still require domains to end with root dots.


This is the context of a URL shortener, and one of the goals seems to prohibit bad IPs, not all IPs. Bad seems to be defined here as ones in the private IP spaces, and those in multicast space, etc.


Okay, that makes sense!


According to the RFC, since domains are allowed to be entirely numeric, there is overlap between valid domains and valid IP addresses. The RFC says that if something could be a valid IP address, it is to be thought of as an IP address.


I've put the test cases into a refiddle: http://refiddle.com/refiddles/53a736c175622d2770a70400


I just validate with this regex '^http' :P



What's wrong with /([\w-]+:\/\/[^\s]+)/gi

It's not fancy but it will essentially match any url


/./ will also match any URL. The point is to reject non-URLs.


Yes, to reject non-URLs and also some URLs that are technically valid but that I want to explicitly disallow anyway.


When you have a hammer, everything looks like a nail.


What does the red vs green boxes mean?


Oh I get it now. Got confused between 1s and 0s.


What flavor of regex are we do making this in?


WTF? When will people finally learn to read the spec and implement things based on the spec and test things based on the spec instead of just making up themselves what a URL is or what HTML is or what an email address is or what a MIME body is or ...

There are supposed URIs in that list that aren't actually URIs, there are supposed non-URIs in that list that are actually URIs, and most of the candidate regexes obviously must have come from some creative minds and not from people who should be writing software. If you just make shit up instead of referring to what the spec says, you urgently should find yourself a new profession, this kind of crap has been hurting us long enough.

(Also, I do not just mean the numeric RFC1918 IPv4 URIs, which obviously are valid URIs but have been rejected intentionally nonetheless - even though that's idiotic as well, of course, given that (a) nothing prevents anyone from putting those addresses in the DNS and (b) those are actually perfectly fine URIs that people use, and I don't see why people should not want to shorten some class of the URIs that they use.)

By the way, the grammar in the RFC is machine readable, and it's regular. So you can just write a script that transforms that grammar into a regex that is guaranteed to reflect exactly what the spec says.


This entire rant begins with the premise that the spec matches the real world implementation. Given that one of your examples is "what an email address is", I submit that expecting reality and the spec to match is a beautiful dream, from which a developer should awaken before trying to implement such a scheme.


Except there is no "the real world implementation". There only are lots of implementations that are incompatible with the spec as well as amongst each other. Inventing yet another variant of your own that also isn't going to be compatible with anything is not going to help anyone. Deviating from the formal spec because everyone practically agrees how to do things, albeit differently than in the formal spec, is something quite different from making shit up, and actually tends to be even harder than building things to spec, as there tends to be no easy reference to look things up in, but instead you might have to look into the guts of existing implementations and talk to people who have built them to figure out what to do - and you would normally start with an implementation according to spec anyhow, and only add special cases for non-normative conventions lateron.

Also, what exactly is the problem with email addresses? There is a very unambiguous grammar of those in the RFC, and there are lots of implementations of exactly what the spec specifies. Just because some web kiddies have made up some shit about email addresses and use that for validation, doesn't mean that postfix, qmail, or exim are written by morons.


> Deviating from the formal spec because everyone practically agrees how to do things, albeit differently than in the formal spec, is something quite different from making shit up, and actually tends to be even harder than building things to spec, as there tends to be no easy reference to look things up in, but instead you might have to look into the guts of existing implementations and talk to people who have built them to figure out what to do - and you would normally start with an implementation according to spec anyhow, and only add special cases for non-normative conventions lateron.

This is exactly what Anne van Kesteren has been doing with the URL Standard: http://url.spec.whatwg.org/


The goal was to come up with a good regular expression to validate URLs in user input, and not to match any URL that browsers can handle (as per the URL Standard). I am fully aware that this is not the same as what any spec says.

> By the way, the grammar in the RFC is machine readable, and it's regular.

The RFC does not reflect reality either (which, ironically, is what you seem to be complaining about). If you’re looking for a spec-compliant solution, the spec to follow is http://url.spec.whatwg.org/.

> If you just make shit up instead of referring to what the spec says, you urgently should find yourself a new profession, this kind of crap has been hurting us long enough.

I am aware of, and am a contributor to, the URL Standard: http://url.spec.whatwg.org/ That doesn’t mean there aren’t any situations in which I need/want to blacklist some technically valid URL constructs.


> The goal was to come up with a good regular expression to validate URLs in user input, and not to match any URL that browsers can handle (as per the URL Standard).

WTF? What is "validation" supposed to be good for if it doesn't actually validate what it claims to? Exactly this mentality of making up your own stuff instead of implementing standards is what causes all these interoperability nightmares! If you claim to accept URLs, then accept URLs, all URLs, and reject non-URLs, all non-URLs. There is no reason to do anything else, other than lazyness maybe, and even then you are lying if you claim that you are validating URLs - you are not. If you say you accept a URL, and I paste a URL, your software is broken if it then rejects that URL as invalid.

This does not apply to intentionally selecting only a subset of URLs that are applicable in a given context, of course - if the URL is to be retrieved by an HTTP client, it's perfectly fine to reject non-HTTP URLs, of course, but any kind of "nobody is going to use that anyhow" is not a good reason. In particular, that kind of rejection most certainly is something that should not happen in the parser as that is likely to give inconsistent results as the parser usually works at the wrong level of abstraction.

> The RFC does not reflect reality either (which, ironically, is what you seem to be complaining about).

Well, or reality does not match the RFC?

> If you’re looking for a spec-compliant solution, the spec to follow is http://url.spec.whatwg.org/.

A spec for a formal language that doesn't contain a grammar? The world is getting crazier every day ...

> That doesn’t mean there aren’t any situations in which I need/want to blacklist some technically valid URL constructs.

Yeah, but blocking IPv4 literals of certain address ranges seems like a stupid idea nonetheless. Good software should accept any input that is meaningful to it and that is not a security problem. And as I said above, such rejection most certainly should not happen in the parser.


> > The RFC does not reflect reality either (which, ironically, is what you seem to be complaining about).

> Well, or reality does not match the RFC?

Doesn’t matter – if there’s a discrepancy between what a document says and what implementors do, that document is but a work of fiction.

> And as I said above, such rejection most certainly should not happen in the parser.

This is not a parser.


> Doesn’t matter – if there’s a discrepancy between what a document says and what implementors do, that document is but a work of fiction.

Yes and no. When there is a de-facto standard that just doesn't happen to match the published standard, yeah, sure. Otherwise, bug compatibility is a terrible idea and should be avoided as much as possible, many security problems have resulted from that.

> This is not a parser.

Well, even worse then. Manually integrating semantics from higher layers into parsing machinery (which it is, never mind the fact that you don't capture any of the syntactic elements within that parsing automaton) is both extremely error prone and gives you terrible maintainability.

edit:

For the fun of it, I just had a look at the "winning entry" (diegoperini). Unsurprisingly, it's broken. It was trivial to find cases that it will reject that you most certainly don't intend to reject. For exactly the reasons pointed out above.


You do realize that RFC 3986 actually contains an official regular expression, right? http://tools.ietf.org/html/rfc3986#appendix-B


You do realize that RFC 3986 doesn’t actually match reality, right? http://url.spec.whatwg.org/#goals


That's for parsing a "well-formed" URI, not validating a URI as well-formed in the first place, however :)


That doesn't work for validation.


Of course it does, if that regular expression matches we have a valid URI. It's silly what gets downvoted on HN these days.


No, it doesn't. This regular expression matches strings that are not URIs, and that should be quite obvious if you compare the grammar in that same RFC to the regex.


Yes, it does. And it is quite obvious if you compare the grammar to the regular expression.


Please show how to produce the following string (which is matched by that regex) from the grammar (without the quotes):

"%x"

(edit: actually, feel free to do it with the quotes included if you like, that would still be matched by that regex)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: