Hacker News new | past | comments | ask | show | jobs | submit login
Regex 101 (regex101.com)
303 points by adamnemecek on Nov 15, 2016 | hide | past | favorite | 67 comments




This one is great too: http://regexr.com/


This is my go to when trying to figure out why a regex isn't working how it should.


I like Explain Regular Expressions [0] too.

[0] http://rick.measham.id.au/paste/explain.pl


I've always liked http://perldoc.perl.org/perlre.html , obviously it's perl specific but most of it can be applied to other implementations.


In perl you can use:

    use re 'Debug';
or even better

    use re 'debugcolor';
and watch the state machine matching.

Very useful for debugging.

http://perldoc.perl.org/re.html#%27Debug%27-mode

Another great resource is:

http://search.cpan.org/dist/Regexp-Debugger/lib/Regexp/Debug...



Thanks to those who posted additional resources!


This is also nice to use to visualize regex expressions. It updates the url so you can add it a comment to code:

https://regexper.com


It's all fun and games until you find out that the meaning of a regex is implementation specific. I've found that a Python regex will accept an unspecified newline byte/character before the $, while std::regex in GNU C++11 will not.


This is very true and it always bugs me when some application or API supports regex, but the documentation doesn't specify the grammar, usually saying it's "standard regular expressions" - which standard? Posix? PCRE? Something else? Almost every flavour has some subtle differences and it's really, really important to know what you are dealing with.

I do feel like these subtle differences are the main cause of bugs in programs containing regexes; compounded by the fact they aren't super-readable, especially for people who don't use them frequently.

Tools like this are IMO really valuable for code reviews and it's nice to see that this one does have a flavour switch on the LHS with some of the main implementations.


I also can never remember which characters are magic and which are literal in different implementations. This is mostly Vim's fault, where the rules are impossible to remember and often conflict with other implementations. Given that it's the most common regex thing I use day-to-day, it's very annoying.


Yeah, I agree, vim's default regex syntax is infuriatingly inconsistent. I pretty much always start my REs with "/\v"[0], to the point it's in my muscle memory: I'll find myself looking at that in the cmdline before I've even drafted the pattern in my head. (Then sometimes I'll realise I just want to find an exact string full of symbols, and have to back up and capitalize the "v".)

... I should probably come up with some mappings so the \v is inserted automatically.

[0] http://vimdoc.sourceforge.net/htmldoc/pattern.html#/\v


For those that don't regex. I originally learned regex just so I could do complex find/replace in Notepad++ (I work with large data sets) I can't begin to say how many hours of work its saved me over the years, and how much money for that matter.

The best / easiest way that I've taught regex is to pop open a text editor and start searching for stuff. Start by searching for some letters, then a string with spaces, then a line break. From there I point people towards stuff like Apache config file conditionals / URL re-writes etc.


Best sources about regexps?

I thought I knew regexps, then I read "Mastering regular expressions". After that I was embarrassed. :-)


hmmm not sure, I'd also like a good resource because mine is not really the "mastering" type resource


Mastering Regular Expressions really is the seminal text, but I highly suggest reading the O'Reilly Sed/awk book as well. They include full history behind the evolution of the original line editor ed -> sed & g/re/p (etymologically derived from g[lobally-replace]/re[gex]/p[rint]) -> awk. The original vi also derived a lot of it's commands from 'ex' (the underlying editor, based on the original ed).

All history aside, 1: Sed&awk 2: Mastering Regexp 3: The vim book has a great chapter on regular expressions (it's been ~10 years since I've touched it, but I remember being gifted a copy from a colleague of my father when I was in my teens and it helped me grok it)

The glibc programming manual is actually really comprehensive. kernel.org/docs/man-pages/ gives a pretty in depth analysis of POSIX.2 regex. I'm sure Perldocs also have PCRE's very well documented.


Complement with https://www.debuggex.com to also get an interactive finite automata (this can really help with understanding what's going on)


Many hours spent here, trying to create the ultimate email regex, while telling my PMs and managers that there was no such thing.


Been there, done that. There's only one way to check email addresses: /.+@.+/

Anything else is going to give you a false negative sooner or later. Yeah, you can spend hours framing a complicated expression that carefully tests for exactly what the current bucket of RFCs specify, but you're guaranteed to run into an MTA somewhere that disagrees with you, and you can argue until the cows come home that you're right and they're wrong, but it's still you that's going to have to change to accommodate them in the end.


Historically there have been many valid RFC 822 addresses without an @ in them. I don't know how many can still be delivered to though.


I would think you'd use: /.+@.+\..+/


nope. root@localhost is a valid mail address.


How often are you validating email addresses for localhost!? Those are pretty much never a valid email for >>>99% of applications


That's the whole point of the original post: don't bother thinking about it, let the mta do that for you. The "@" is a courtesy to make sure people didn't mistakenly put their name in. Don't validate, just do a sanity check.


And people are just as likely to type an valid but incorrect email address as a invalid email address.


That's why you also implement an email verification method if you actually care that it's a correct email as well as valid.


Right, so if you're not going to accept the full RFC then where do you draw the line? Your arbitrary line will assuredly be different from someone else's.


In the back of the Camel book there is an email matching regex. It takes the entire page. It probably doesn't handle some Unicode edge cases properly.

Validating emails with regexes is one of those things that seems like it should be fairly straightforward, but thanks to the language in the RFC it is actually a nightmare of rabbit holes filled with landmines.


That would be the one shown here:

http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

I don't know about unicode, but I do note it says it doesn't handle comments (because, apparently, they can be nested - and perl's REs lack recursion).


Perl has supported recursion in regex since around v5.10. The giant regex is not from the Camel book, but from O'Reilly's, Mastering Regular Expressions which does give the example in Perl, but was printed before v5.10 of Perl was released. (3rd ed. of Mastering Expressions was released in 2006; Perl v5.10 was released in 2006.). I believe the regex was published in the very first edition.


Ah yes, sorry, it would have been more correct to say "perl's REs lacked recursion at the time ". I use PCRE pretty frequently, so knew that it's supported recursion for some time; but it's been a while since I've done anything non-trivial with perl itself, so couldn't remember if it too had that functionality or I was just getting their feature-sets confused. Evidently I guessed wrong - thanks for the correction.


RFC 822 is the Internet's cancer.

Really. Email addresses should have never been allowed to have comments.

Since every new RFC maintains compatibility with this one, we'll probably be stuck with these poor choices for decades.


RFC 822 is what happens when you try to standardize a system in such a way that it is backwards compatible with every home grown system on the planet.

Have further RFCs not deprecated most of the braindamage from 822? If not, why? Are there really people who are still trying to use such horrible abominations for their email address?


Email addresses allow comments? Who the heck thought that was a smart idea!?

I guess this is what you're referring to: James Doe <jdoe@a.com>

The email - which is inside the <> - should be parsed correctly, while the rest should be treated as a "comment" [1].

[1]: https://www.cs.tut.fi/~jkorpela/rfc/822addr.html


No No.

Stuff like:

* comments are allowed with parentheses at either end of the local-part; e.g. john.smith(comment)@example.com and (comment)john.smith@example.com are both equivalent to john.smith@example.com. [1]

* Also have a look at quotes

[1]https://en.wikipedia.org/wiki/Email_address#Local-part


These are all valid email addresses!

    * "Abc\@def"@example.com

    * "Fred Bloggs"@example.com

    * "Joe\\Blow"@example.com

    * "Abc@def"@example.com

    * customer/department=shipping@example.com

    * $A12345@example.com

    * !def!xyz%abc@example.com

    * _somename@example.com
http://haacked.com/archive/2007/08/21/i-knew-how-to-validate...


what.


Guess I'm in today 10000 who learned something today.


I can't understand this sentence.

At first I thought you were a bot. Then that your English was not very good.

Looking at your history clearly both are false.

What did you mean?


They're referring to this: https://xkcd.com/1053/


Good catch, forgot about that one!


Now it makes sense.

Thank you!


I think he meant: "I guess I'm among the 10000 who learned something today".


> Implementing validation with regular expressions somewhat pushes the limits of what it is sensible to do with regular expressions

(rest of the page is filled with line noise)

LOL


And the irony is that regular expressions, by definition, can't recurse. Nobody's gonna rename them to turing expressions, though.


The issue is not regexing, it's lack of real standard in email format :)


    ^(?("")(""[^""]+?""@)|(([0-9a-z]((\.(?!\.))|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)(?<=[0-9a-z])@))(?(\[)(\[(\d{1,3}\.){3}\d{1,3}\])|(([0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9]{2,17}))$
This and other email validation here: https://github.com/neosmart/web/blob/master/Web%20Toolkit/Em...


Is that chinese character compatible? (half-serious)


What I don't understand is, don't programs like thunderbird or mutt have a module or function for email sanity check, and why hasn't that been ported as a module for every language imaginable?


Being the maintainer of that would be the worst programming job possible.


I'm going to have to go ahead and agree with a comment in the javax internet address code here, if you insist on checking against the rfc, you should really just give up and use a lexer.


This one covers like 99.9% of the cases: [\w]\S+@\S+[a-z]+


The pedant in me wants to point out that the TLD can be uppercase, as the domain part is case-insensitive.

In any case, it probably won't be too long before we get emoji TLDs, or UTF8 flags to replace the country-code :-D


I love resources like this, and use them quite frequently when I need to write anything with moderate complexity. If you're using Python and writing long RegExs, I really recommend using a multi-line approach with re.VERBOSE to provide helpful comments to your future-self/co-workers.


This is really slick. I've used it a few times just to get my capture groups tested right, where pattern matching is needed over parsing. The fact it color codes the Python named capture groups is just awesome: make a regex to match the stuff you need, then slam it into another string via % and you've just converted one string into another. There's tons of ways to do that, but when your patterns are fast and loose, it makes the intent a bit easier to show. (I also tend to use regex with comments, so they tend to be a bit more literate and long... well, at least not all line noise.)


This is really, really good. We work on regex ourselves (github.com/01org/hyperscan) and I will definitely point people towards this when they are trying to figure regex out.

The only regret I have is that many of the interesting regexes we see in the course of our work are not something which we can type into someone else's web site. Fortunately there are others who have looked at the idea of making an off-line version:

https://github.com/firasdib/Regex101/issues/76


DevDocs.io handles this really well


I have used this website a lot in the past, it has greatly helped me debugging complex expressions, I would love to see an electron port in the future.


It would be nice to add a little 1-click test button to this that can look at some of my sample string and auto-fuzz the pattern and see if some unexpected results come up.

Regex is weird anyway but it's often a surprise just some weird string that's close to what you thought you wanted comes up.


Did anyone notice that the site is built using React? The UI is so interactive.


I recently upped my Regex-fu significantly this year, and Regex101 is the main reason why. Also giving a presentation in the office on Regular Expressions helped too :).


I often teach people how to use regex w/in the context of my job and find this tool invaluable because it allows for PCRE in addition to the more common JS regex flavour.


best way to test a regex is to use sublime search, and you'll see all matching strings inside the open text file interactively highlighted


Weirdly enough I was actually playing around with this today instead of completing an even more boring jira task :) ..what a life


This is just the coolest way to test your regex is a great sandbox with lots of feedback to help you get it right.


Donated once, will do more; the site saved the day too many times not to.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: