It's all fun and games until you find out that the meaning of a regex is implementation specific. I've found that a Python regex will accept an unspecified newline byte/character before the $, while std::regex in GNU C++11 will not.
This is very true and it always bugs me when some application or API supports regex, but the documentation doesn't specify the grammar, usually saying it's "standard regular expressions" - which standard? Posix? PCRE? Something else? Almost every flavour has some subtle differences and it's really, really important to know what you are dealing with.
I do feel like these subtle differences are the main cause of bugs in programs containing regexes; compounded by the fact they aren't super-readable, especially for people who don't use them frequently.
Tools like this are IMO really valuable for code reviews and it's nice to see that this one does have a flavour switch on the LHS with some of the main implementations.
I also can never remember which characters are magic and which are literal in different implementations. This is mostly Vim's fault, where the rules are impossible to remember and often conflict with other implementations. Given that it's the most common regex thing I use day-to-day, it's very annoying.
Yeah, I agree, vim's default regex syntax is infuriatingly inconsistent. I pretty much always start my REs with "/\v"[0], to the point it's in my muscle memory: I'll find myself looking at that in the cmdline before I've even drafted the pattern in my head. (Then sometimes I'll realise I just want to find an exact string full of symbols, and have to back up and capitalize the "v".)
... I should probably come up with some mappings so the \v is inserted automatically.
For those that don't regex. I originally learned regex just so I could do complex find/replace in Notepad++ (I work with large data sets) I can't begin to say how many hours of work its saved me over the years, and how much money for that matter.
The best / easiest way that I've taught regex is to pop open a text editor and start searching for stuff. Start by searching for some letters, then a string with spaces, then a line break. From there I point people towards stuff like Apache config file conditionals / URL re-writes etc.
Mastering Regular Expressions really is the seminal text, but I highly suggest reading the O'Reilly Sed/awk book as well. They include full history behind the evolution of the original line editor ed -> sed & g/re/p (etymologically derived from g[lobally-replace]/re[gex]/p[rint]) -> awk. The original vi also derived a lot of it's commands from 'ex' (the underlying editor, based on the original ed).
All history aside,
1: Sed&awk
2: Mastering Regexp
3: The vim book has a great chapter on regular expressions (it's been ~10 years since I've touched it, but I remember being gifted a copy from a colleague of my father when I was in my teens and it helped me grok it)
The glibc programming manual is actually really comprehensive. kernel.org/docs/man-pages/ gives a pretty in depth analysis of POSIX.2 regex. I'm sure Perldocs also have PCRE's very well documented.
Been there, done that. There's only one way to check email addresses: /.+@.+/
Anything else is going to give you a false negative sooner or later. Yeah, you can spend hours framing a complicated expression that carefully tests for exactly what the current bucket of RFCs specify, but you're guaranteed to run into an MTA somewhere that disagrees with you, and you can argue until the cows come home that you're right and they're wrong, but it's still you that's going to have to change to accommodate them in the end.
That's the whole point of the original post: don't bother thinking about it, let the mta do that for you. The "@" is a courtesy to make sure people didn't mistakenly put their name in. Don't validate, just do a sanity check.
Right, so if you're not going to accept the full RFC then where do you draw the line? Your arbitrary line will assuredly be different from someone else's.
In the back of the Camel book there is an email matching regex. It takes the entire page. It probably doesn't handle some Unicode edge cases properly.
Validating emails with regexes is one of those things that seems like it should be fairly straightforward, but thanks to the language in the RFC it is actually a nightmare of rabbit holes filled with landmines.
I don't know about unicode, but I do note it says it doesn't handle comments (because, apparently, they can be nested - and perl's REs lack recursion).
Perl has supported recursion in regex since around v5.10. The giant regex is not from the Camel book, but from O'Reilly's, Mastering Regular Expressions which does give the example in Perl, but was printed before v5.10 of Perl was released. (3rd ed. of Mastering Expressions was released in 2006; Perl v5.10 was released in 2006.). I believe the regex was published in the very first edition.
Ah yes, sorry, it would have been more correct to say "perl's REs lacked recursion at the time ". I use PCRE pretty frequently, so knew that it's supported recursion for some time; but it's been a while since I've done anything non-trivial with perl itself, so couldn't remember if it too had that functionality or I was just getting their feature-sets confused. Evidently I guessed wrong - thanks for the correction.
RFC 822 is what happens when you try to standardize a system in such a way that it is backwards compatible with every home grown system on the planet.
Have further RFCs not deprecated most of the braindamage from 822? If not, why? Are there really people who are still trying to use such horrible abominations for their email address?
* comments are allowed with parentheses at either end of the local-part; e.g. john.smith(comment)@example.com and (comment)john.smith@example.com are both equivalent to john.smith@example.com. [1]
What I don't understand is, don't programs like thunderbird or mutt have a module or function for email sanity check, and why hasn't that been ported as a module for every language imaginable?
I'm going to have to go ahead and agree with a comment in the javax internet address code here, if you insist on checking against the rfc, you should really just give up and use a lexer.
I love resources like this, and use them quite frequently when I need to write anything with moderate complexity. If you're using Python and writing long RegExs, I really recommend using a multi-line approach with re.VERBOSE to provide helpful comments to your future-self/co-workers.
This is really slick. I've used it a few times just to get my capture groups tested right, where pattern matching is needed over parsing. The fact it color codes the Python named capture groups is just awesome: make a regex to match the stuff you need, then slam it into another string via % and you've just converted one string into another. There's tons of ways to do that, but when your patterns are fast and loose, it makes the intent a bit easier to show. (I also tend to use regex with comments, so they tend to be a bit more literate and long... well, at least not all line noise.)
This is really, really good. We work on regex ourselves (github.com/01org/hyperscan) and I will definitely point people towards this when they are trying to figure regex out.
The only regret I have is that many of the interesting regexes we see in the course of our work are not something which we can type into someone else's web site. Fortunately there are others who have looked at the idea of making an off-line version:
I have used this website a lot in the past, it has greatly helped me debugging complex expressions, I would love to see an electron port in the future.
It would be nice to add a little 1-click test button to this that can look at some of my sample string and auto-fuzz the pattern and see if some unexpected results come up.
Regex is weird anyway but it's often a surprise just some weird string that's close to what you thought you wanted comes up.
I recently upped my Regex-fu significantly this year, and Regex101 is the main reason why. Also giving a presentation in the office on Regular Expressions helped too :).
I often teach people how to use regex w/in the context of my job and find this tool invaluable because it allows for PCRE in addition to the more common JS regex flavour.
http://regexone.com/
http://www.rexegg.com/
http://eloquentjavascript.net/09_regexp.html
http://www.cheatography.com/davechild/cheat-sheets/regular-e...
http://www.smashingmagazine.com/2009/06/essential-guide-to-r...
http://www.smashingmagazine.com/2009/05/introduction-to-adva...