Regex 101

It's all fun and games until you find out that the meaning of a regex is implementation specific. I've found that a Python regex will accept an unspecified newline byte/character before the $, while std::regex in GNU C++11 will not.

robert_tweed · on Nov 16, 2016

This is very true and it always bugs me when some application or API supports regex, but the documentation doesn't specify the grammar, usually saying it's "standard regular expressions" - which standard? Posix? PCRE? Something else? Almost every flavour has some subtle differences and it's really, really important to know what you are dealing with.

I do feel like these subtle differences are the main cause of bugs in programs containing regexes; compounded by the fact they aren't super-readable, especially for people who don't use them frequently.

Tools like this are IMO really valuable for code reviews and it's nice to see that this one does have a flavour switch on the LHS with some of the main implementations.

OskarS · on Nov 16, 2016

I also can never remember which characters are magic and which are literal in different implementations. This is mostly Vim's fault, where the rules are impossible to remember and often conflict with other implementations. Given that it's the most common regex thing I use day-to-day, it's very annoying.

EvilTerran · on Nov 16, 2016

Yeah, I agree, vim's default regex syntax is infuriatingly inconsistent. I pretty much always start my REs with "/\v"[0], to the point it's in my muscle memory: I'll find myself looking at that in the cmdline before I've even drafted the pattern in my head. (Then sometimes I'll realise I just want to find an exact string full of symbols, and have to back up and capitalize the "v".)

... I should probably come up with some mappings so the \v is inserted automatically.

[0] http://vimdoc.sourceforge.net/htmldoc/pattern.html#/\v

BorisMelnik · on Nov 16, 2016

For those that don't regex. I originally learned regex just so I could do complex find/replace in Notepad++ (I work with large data sets) I can't begin to say how many hours of work its saved me over the years, and how much money for that matter.

The best / easiest way that I've taught regex is to pop open a text editor and start searching for stuff. Start by searching for some letters, then a string with spaces, then a line break. From there I point people towards stuff like Apache config file conditionals / URL re-writes etc.

berntb · on Nov 16, 2016

Best sources about regexps?

I thought I knew regexps, then I read "Mastering regular expressions". After that I was embarrassed. :-)

BorisMelnik · on Nov 16, 2016

hmmm not sure, I'd also like a good resource because mine is not really the "mastering" type resource

iheartmemcache · on Nov 16, 2016

Mastering Regular Expressions really is the seminal text, but I highly suggest reading the O'Reilly Sed/awk book as well. They include full history behind the evolution of the original line editor ed -> sed & g/re/p (etymologically derived from g[lobally-replace]/re[gex]/p[rint]) -> awk. The original vi also derived a lot of it's commands from 'ex' (the underlying editor, based on the original ed).

All history aside, 1: Sed&awk 2: Mastering Regexp 3: The vim book has a great chapter on regular expressions (it's been ~10 years since I've touched it, but I remember being gifted a copy from a colleague of my father when I was in my teens and it helped me grok it)

The glibc programming manual is actually really comprehensive. kernel.org/docs/man-pages/ gives a pretty in depth analysis of POSIX.2 regex. I'm sure Perldocs also have PCRE's very well documented.

Zalastax · on Nov 16, 2016

Complement with https://www.debuggex.com to also get an interactive finite automata (this can really help with understanding what's going on)

strictnein · on Nov 15, 2016

Many hours spent here, trying to create the ultimate email regex, while telling my PMs and managers that there was no such thing.

captaincrowbar · on Nov 15, 2016

Been there, done that. There's only one way to check email addresses: /.+@.+/

Anything else is going to give you a false negative sooner or later. Yeah, you can spend hours framing a complicated expression that carefully tests for exactly what the current bucket of RFCs specify, but you're guaranteed to run into an MTA somewhere that disagrees with you, and you can argue until the cows come home that you're right and they're wrong, but it's still you that's going to have to change to accommodate them in the end.

NelsonMinar · on Nov 16, 2016

Historically there have been many valid RFC 822 addresses without an @ in them. I don't know how many can still be delivered to though.

aroch · on Nov 15, 2016

I would think you'd use: /.+@.+\..+/

pmlnr · on Nov 15, 2016

nope. root@localhost is a valid mail address.

aroch · on Nov 15, 2016

How often are you validating email addresses for localhost!? Those are pretty much never a valid email for >>>99% of applications

nothrabannosir · on Nov 16, 2016

That's the whole point of the original post: don't bother thinking about it, let the mta do that for you. The "@" is a courtesy to make sure people didn't mistakenly put their name in. Don't validate, just do a sanity check.

strictnein · on Nov 16, 2016

And people are just as likely to type an valid but incorrect email address as a invalid email address.

nkrisc · on Nov 16, 2016

That's why you also implement an email verification method if you actually care that it's a correct email as well as valid.

mason55 · on Nov 15, 2016

Right, so if you're not going to accept the full RFC then where do you draw the line? Your arbitrary line will assuredly be different from someone else's.

jandrese · on Nov 15, 2016

In the back of the Camel book there is an email matching regex. It takes the entire page. It probably doesn't handle some Unicode edge cases properly.

Validating emails with regexes is one of those things that seems like it should be fairly straightforward, but thanks to the language in the RFC it is actually a nightmare of rabbit holes filled with landmines.

EvilTerran · on Nov 15, 2016

That would be the one shown here:

http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

I don't know about unicode, but I do note it says it doesn't handle comments (because, apparently, they can be nested - and perl's REs lack recursion).

justinator · on Nov 15, 2016

Perl has supported recursion in regex since around v5.10. The giant regex is not from the Camel book, but from O'Reilly's, Mastering Regular Expressions which does give the example in Perl, but was printed before v5.10 of Perl was released. (3rd ed. of Mastering Expressions was released in 2006; Perl v5.10 was released in 2006.). I believe the regex was published in the very first edition.

EvilTerran · on Nov 15, 2016

Ah yes, sorry, it would have been more correct to say "perl's REs lacked recursion at the time ". I use PCRE pretty frequently, so knew that it's supported recursion for some time; but it's been a while since I've done anything non-trivial with perl itself, so couldn't remember if it too had that functionality or I was just getting their feature-sets confused. Evidently I guessed wrong - thanks for the correction.

77pt77 · on Nov 16, 2016

RFC 822 is the Internet's cancer.

Really. Email addresses should have never been allowed to have comments.

Since every new RFC maintains compatibility with this one, we'll probably be stuck with these poor choices for decades.

jandrese · on Nov 16, 2016

RFC 822 is what happens when you try to standardize a system in such a way that it is backwards compatible with every home grown system on the planet.

Have further RFCs not deprecated most of the braindamage from 822? If not, why? Are there really people who are still trying to use such horrible abominations for their email address?

Cyph0n · on Nov 16, 2016

Email addresses allow comments? Who the heck thought that was a smart idea!?

I guess this is what you're referring to: James Doe <jdoe@a.com>

The email - which is inside the <> - should be parsed correctly, while the rest should be treated as a "comment" [1].

[1]: https://www.cs.tut.fi/~jkorpela/rfc/822addr.html

77pt77 · on Nov 16, 2016

No No.

Stuff like:

* comments are allowed with parentheses at either end of the local-part; e.g. john.smith(comment)@example.com and (comment)john.smith@example.com are both equivalent to john.smith@example.com. [1]

* Also have a look at quotes

[1]https://en.wikipedia.org/wiki/Email_address#Local-part

77pt77 · on Nov 16, 2016

These are all valid email addresses!

    * "Abc\@def"@example.com

    * "Fred Bloggs"@example.com

    * "Joe\\Blow"@example.com

    * "Abc@def"@example.com

    * customer/department=shipping@example.com

    * $A12345@example.com

    * !def!xyz%abc@example.com

    * _somename@example.com

http://haacked.com/archive/2007/08/21/i-knew-how-to-validate...

Cyph0n · on Nov 16, 2016

what.

erelde · on Nov 16, 2016

Guess I'm in today 10000 who learned something today.

77pt77 · on Nov 16, 2016

I can't understand this sentence.

At first I thought you were a bot. Then that your English was not very good.

Looking at your history clearly both are false.

What did you mean?

strictnein · on Nov 16, 2016

They're referring to this: https://xkcd.com/1053/

Cyph0n · on Nov 16, 2016

Good catch, forgot about that one!

77pt77 · on Nov 16, 2016

Now it makes sense.

Thank you!

Cyph0n · on Nov 16, 2016

I think he meant: "I guess I'm among the 10000 who learned something today".

jandrese · on Nov 15, 2016

> Implementing validation with regular expressions somewhat pushes the limits of what it is sensible to do with regular expressions

(rest of the page is filled with line noise)

LOL

xiaoma · on Nov 16, 2016

And the irony is that regular expressions, by definition, can't recurse. Nobody's gonna rename them to turing expressions, though.

edblarney · on Nov 15, 2016

The issue is not regexing, it's lack of real standard in email format :)

ComputerGuru · on Nov 16, 2016

    ^(?("")(""[^""]+?""@)|(([0-9a-z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-z])@))(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9]{2,17}))$

This and other email validation here: https://github.com/neosmart/web/blob/master/Web%20Toolkit/Em...

bduerst · on Nov 16, 2016

Is that chinese character compatible? (half-serious)

wodenokoto · on Nov 16, 2016

What I don't understand is, don't programs like thunderbird or mutt have a module or function for email sanity check, and why hasn't that been ported as a module for every language imaginable?

tonyedgecombe · on Nov 16, 2016

Being the maintainer of that would be the worst programming job possible.

foota · on Nov 16, 2016

I'm going to have to go ahead and agree with a comment in the javax internet address code here, if you insist on checking against the rfc, you should really just give up and use a lexer.

fasinfranco · on Nov 16, 2016

This one covers like 99.9% of the cases: [\w]\S+@\S+[a-z]+

manarth · on Nov 16, 2016

The pedant in me wants to point out that the TLD can be uppercase, as the domain part is case-insensitive.

In any case, it probably won't be too long before we get emoji TLDs, or UTF8 flags to replace the country-code :-D

2T1Qka0rEiPr · on Nov 16, 2016

I love resources like this, and use them quite frequently when I need to write anything with moderate complexity. If you're using Python and writing long RegExs, I really recommend using a multi-line approach with re.VERBOSE to provide helpful comments to your future-self/co-workers.

HCIdivision17 · on Nov 15, 2016

This is really slick. I've used it a few times just to get my capture groups tested right, where pattern matching is needed over parsing. The fact it color codes the Python named capture groups is just awesome: make a regex to match the stuff you need, then slam it into another string via % and you've just converted one string into another. There's tons of ways to do that, but when your patterns are fast and loose, it makes the intent a bit easier to show. (I also tend to use regex with comments, so they tend to be a bit more literate and long... well, at least not all line noise.)

glangdale · on Nov 15, 2016

This is really, really good. We work on regex ourselves (github.com/01org/hyperscan) and I will definitely point people towards this when they are trying to figure regex out.

The only regret I have is that many of the interesting regexes we see in the course of our work are not something which we can type into someone else's web site. Fortunately there are others who have looked at the idea of making an off-line version:

https://github.com/firasdib/Regex101/issues/76

spartanatreyu · on Nov 15, 2016

DevDocs.io handles this really well

Aldo_MX · on Nov 15, 2016

I have used this website a lot in the past, it has greatly helped me debugging complex expressions, I would love to see an electron port in the future.

Already__Taken · on Nov 15, 2016

It would be nice to add a little 1-click test button to this that can look at some of my sample string and auto-fuzz the pattern and see if some unexpected results come up.

Regex is weird anyway but it's often a surprise just some weird string that's close to what you thought you wanted comes up.

__krris · on Nov 16, 2016

Did anyone notice that the site is built using React? The UI is so interactive.

cableshaft · on Nov 16, 2016

I recently upped my Regex-fu significantly this year, and Regex101 is the main reason why. Also giving a presentation in the office on Regular Expressions helped too :).

metasyn · on Nov 15, 2016

I often teach people how to use regex w/in the context of my job and find this tool invaluable because it allows for PCRE in addition to the more common JS regex flavour.

bedros · on Nov 15, 2016

best way to test a regex is to use sublime search, and you'll see all matching strings inside the open text file interactively highlighted

_ao789 · on Nov 15, 2016

Weirdly enough I was actually playing around with this today instead of completing an even more boring jira task :) ..what a life

leommoore · on Nov 15, 2016

This is just the coolest way to test your regex is a great sandbox with lots of feedback to help you get it right.

pmlnr · on Nov 15, 2016

Donated once, will do more; the site saved the day too many times not to.