Hacker News new | past | comments | ask | show | jobs | submit login
Stop validating email addresses with your complex regex (davidcelis.com)
184 points by davidcelis on Sept 6, 2012 | hide | past | favorite | 148 comments



A couple years ago Evan Phoenix (of rubinius) and I collaborated (by which I mean he wrote the grammar and I did almost nothing) on a REAL RFC compliant email address validator using a PEG for parsing: https://github.com/larb/email_address_validator

I don't know that anyone uses it, or would even want to use it. It was a fun project, but I certainly wouldn't use it in an app (unless it was an MTA or MUA). What comprises a valid email address is way too broad.

I think there are reasonable regexes for email addresses FWIW. The RFC(s) are stupid and broken, and you can hit 99.99999% of addrs and provide useful errors back to the users with a regex.


Sure, I mean `me@yahoo` is an RFC-compliant email address, it refers to a local server 'yahoo'. However in your online app it's almost certainly an error if this email address turns up in a registration form.

Don't test for an RFC-compliant address if you don't want to accept all RFC-compliant addresses. Being able to send an email is a much better test because it matches what you're going to use the email address for in your app.


But your way, if the user makes a typo like "foo@@bar,com" then he will expect to receive an email but wont. Better UX in my opinion is to do validation before sending, both client-side and server-side (esp. good for REST JSON APIs), then if these are all good send the welcome email.


I wouldn't mind using a simple regex or validator to check the e-mail addr for validity.

It wouldn't be RFC-compliant, but it would catch 99% of typos.

Instead of being an error when the e-mail fails validation though, it would say something like: "your e-mail does not appear valid; please double check your entry. You will be sent an activation e-mail; click [Continue] if you're sure the address is valid."

Basically if it fails the "99%" test, then if that fails, let the user decide if their e-mail is in the 1% or not.


I see that the API you feel most people would want is validate_2822_addr (aka validate_addr), which validates an addr-spec (as people should really not be typing angle addresses with display names into forms asking for their e-mail address ;P).

However, that specification, and the implementation you provide, is really designed for parsing e-mail address headers (as you say: for an MTA/MUA), and so contains a bunch of properties specific to structured MIME fields that really has nothing to do with e-mail addresses.

Instead, if you are verifying an e-mail address that someone types into a form, you probably are looking for "the kind of e-mail address that SMTP would accept for delivery", and that is covered by a different standard with a different and unrelated grammar.

Specifically, you implemented RFC 2822, the successor to RFC 822 that has now been obsoleted by RFC 5322, the standard on "Internet Message Format" (in essence, MIME). The related RFC 2821, the successor to RFC 821 that is now obsoleted by RFC 5321, is for SMTP.

For an example of the kinds of differences this would cause, RFC 5322 (with errata) believes that ""@example.com is invalid (by errata), but hello(ignore)@example.com is (MIME comment); RFC 5321, on the other hand, believes the exact opposite validity.

(edit: When I realized that I should probably write a blog post about this, given how much time I've put into implementing this stuff recently, I realized that there was more to say on this general subject, and I'm including it below.)

That said, I will go even further: these formats are designed for escaping e-mail addresses in the context of a larger standard and protocol, one that might already have special characters. This is why they contain so much quoting support.

This is then why the grammer is often so highly restricted for things that don't need to be quoted: given that an @ cannot be found in a domain name, you really shouldn't need to quote anything to the left of the @ to get a valid e-mail address.

However, "(" is a special character in a MIME field (begins a comment), and thereby if you want to include it in the local part of an e-mail address, you will need to escape it somehow; the same is true of things like whitespace, commas, or angle brackets.

The user typing the e-mail address into the form, however, isn't dealing with these restrictions: asking him to escape special characters in his e-mail address seems silly: one might as well be asking people to HTML escape their username in the username field.

That said, there is then a separate RFC 3696 which talks about the semantics of contemporary e-mail addresses and how one might go about validating them, and it includes the idea of quoting in its implementation (so maybe it believes that RFC 5321 is king).


Best comment ever. :) I feel that's something that's always missed in these discussions. Users are not entering RFC compliant email header values into your form. Maybe my next web app will make people base64 encode their name, and submit it in =?B?utf-8?..?= format.


I'm the author of the Rails wrapper for the big chunk of regex code; you're correct, it is for use cases that are akin to an MUA/MTA.

http://github.com/sixarm/sixarm_ruby_email_address_validatio...

We use a combination of client-side JavaScript validation and server-side validation in Rails. Typical server-side validation is for REST JSON API calls by third-party apps, and also for parsing freeform text fields like "tell your friends about us".

Original code is by Tim Fletcher & Cal Henderson.


The biggest argument for not validating email addresses shouldn't be that "it's hard". It should be that email providers don't always follow RFC guidelines. At my old job, we had this really weird bug that, after a lot of work, we tracked down to users who legitimately had greek letters in their email addresses.

If there are legitimate emails that don't follow RFC, you should absolutely allow users to enter non-RFC-compliant email addresses.


It's not hard, though. Anybody can go online and search for an email validating regex, but as some have pointed out, many are too strict and don't allow for, say, tagged email addresses (email+tag@example.com). There's email validation, and there's overkill. That's what I was trying to get at.


The difficulty of writing regexes for email validation wasn't the main point - the main point was that emails don't follow spec so any regex based on that is inherently wrong. However, it seems like saying "most email validation regexes are broken" is equivalent to saying "email validation regex is hard"


They do most likely follow the standard, just a different standard than the one you're thinking of. E.g. RFC6530.


This technique is hinted about, and I don't claim it as my own creation. I think the easiest (maybe not the best) route for both devs and users is:

[submit]

does email address have a '@' and a '.' after that?

yes -> send a validation email.

no -> Hey there, [email] seems like an odd address, are you sure?

       yes -> send a validation email.

       no -> restart
I think this does a reasonable job of staying out of everyone's way, and it catches a large percentage of the actual typographical/user error email entries.

[formatting edit]


There used to be someone who had the email address john@dk (maybe it wasnt john). Mx records on tlds are frowned upon but not technically invalid.


This is precisely the reason why you let the user bypass your check with a confirmation. It's likely that most of your users who enter something without a period made a mistake. If someone is using an address like 'john@dk', I think they'd expect a message saying 'your email is invalid'; and seeing a message that just says, 'are you sure?' with a 'yes' option would be perfectly acceptable.

Edit: sorry for the echo, bigiain - I saw your post right after adding the comment.


There is an MX record for ai, so the dot after the @ is strictly not required:

  ;; ANSWER SECTION:
  ai.			14400	IN	MX	10 mail.offshore.ai.

  ;; ADDITIONAL SECTION:
  mail.offshore.ai.	14400	IN	A	209.59.119.34


Yeah, good luck to whoever uses that address. I would expect many MTAs to have problems delivering to that, let alone web form validation.


Which MTAs do you suspect wouldn't deliver it? I've used internal zones of that form extensively without issue on all the common OSS MTAs (postfix, sendmail, exim).

I'm unaware of any common MTA software which wouldn't handle it.


Several of the hosted ESPs I used couldn't handle .aero and .mobi at the time they came out.


imperialWicket's approach would still work fine in that case when john@dk clicks the "Yes, that's really my email adress" button - and I suspect he'd be pleasantly surprised to see it work, I'll bet he's got a very high expectation of that address failing completely on most websites...

I have a cow-orker who's first name is "G" - he's very used to poorly-validating websites claiming that his firstname is "wrong".


Been a long time since I last saw "cow-orker". Or USENET, for that matter...


Given by his parents?


Yup.

I've got another friend/acquaintance who's got a fully legal (self chosen) single name. No first/middle/last name, just a "mononym". He has the expected hilarious outcomes with things like Google's "real name policy".


A prominent example of an interesting name is Caterina Fake.


He? So it's not Cher?


Nope. Can't even claim "acquaintance" whth Cher. Or Prince.


One of the guys that managed the Norwegian (no.) USENET hierarchy and used to be responsible for newsgroup creation for it used to go under the nickname Dr. No. Whether because of the no prefix on the hierarchy, or because people liked to complain about it whenever he rejected a proposed group, I don't know. But people repeatedly suggested campaigns to get the .no NIC to let him have the address dr@no. Never got anywhere though.


The dk zone still has an A record (as one of the few?), so http://dk./ is a working web site name (though it redirects the the longer official Danish NIC).


Doesn't work from here. Wonder if maybe the local DNS server isn't happy with it.


Chrome at least does not allow this. Curl may be a better test...


Works fine here in Chrome. Are you sure you're entering http://dk/ and not just dk? The latter will trigger a search instead.


No, no, no, no. Normal people don’t always use the email field properly. The might put the username in the email field and the email in the username. Just check for an @. There is no email in the world outside your server that you can sent to without an @.


You've got it backwards-- the use case is to catch things like "grandma@aol" and "foo@@bar,com". If we just check for "@" then we'd send email into the void. Instead, we check for more specifics, and we prompt the user to correct the email address. This improves our response rate about 2% which for us is significant.


But grandma@aol is a valid email address (if aol were to become a proper TLD with an MX record).


Hahah yes you're correct. We do client-side auto-suggest on email validation, so we can ask "Did you mean to type grandma@aol.com"?

We have seen a real email address without any dot, and it routes successfully to a TLD MX exactly like you describe so yes, it does happen. :)


Just for fun: https://en.wikipedia.org/wiki/Bang_path#UUCP_for_mail_routin...

But yeah, that's effectively dead and you probably don't want to try sending mail that way.


He'd have to have bang routes set up on his mail server for that. =) So it's not just effectively dead, it's dead-dead.


I am chose my phrasing for the very reason that it's merely incredibly unlikely and not at all impossible. I've even sent mail with bang-paths post-2000 just to confound the recipient, though it is a rare network that would accommodate that today.

Edit: punctuation.


Where can I still use bang-paths?!


Nowhere that I know of, at least not for certain. The company with the mail server I used has been subsumed by another company, however some other companies from that time may still be running compatible mailers.


Here's another thought, just off the top of my head: get people to sign up by sending an email to "subscribe@example.com". You can include that as a "mailto:" link and many browsers will deal with it correctly.

There's very good odds that the email they send will have their "From:" (or "Reply-To:") address correctly set. Then just have an email autoresponder which emails them back a link with a token in it, when they click on that it'll take them to a page to create their account, with their email address already filled in by the token.


Sending people to their emailclient would probably decrease conversion too much.


Yeah, maybe. But you're going to have to send them there some time to do the "confirm email" thing, I was just thinking perhaps get that over and done with upfront.

Ideally, all this would be done away with by OpenID or client-side certificates or something along those lines. The whole password-and-email-in-case-you-forget-it paradigm is broken, really.


I did that for a time (which I mention in the article), but it's still a superfluous check on top of an activation email. If your users are typing the wrong values into your registration form, perhaps you need better labeling or placeholder text? Display an error that the activation email couldn't be sent. But why add superfluous checks?


While I appreciate the elegance of your solution from an engineering perspective the user who accidentally enters an invalid address is definitely getting a worse user experience here. If you're using javascript to validate the email a user will know they've made a mistake immediately. In your scenario they'll see a page telling them that they should receive an email, and then what? They have to start again? If they're lucky they'll think of hitting the back button and full it in correctly. More likely, they'll just think; hmm this email's taking a while to come through, I'll check Facebook... With a bit of luck they'll remember they were trying to sign up for your service, otherwise, you just lost a signup.


Yes exactly. We validate client side using typical JavaScript, and we also validate server side.

In our case, accounts may be created by third-party apps using REST JSON APIs, so we want to let the third-party app know that the email address isn't RFC-valid.


Knowing whether the email was sent is not always something you can do synchronously.

A simple check for an "@" sign would go a long way to avoiding bounced mail notifications from usernames being entered in email address fields.


Yes, agreed (and I mention this at the end of the article as well) and I still use the /@/ regex often. But a good UI on the registration form can go a long way in alleviating the "switching the email address and username fields" problem.


>a good UI on the registration form can go a long way in alleviating //

But not all the way. And so a simple "you might have got this wrong" flag would be helpful, no?


"You might have it wrong", yes. "You are wrong, we won't let you submit the form" won't.


One addtional check---see if there's an MX or an A record on the portion after the "@".


Often, your web server application won't know that the activation email can't be sent. It's common practice to throw outgoing emails into a job queue, so a very basic check isn't superfluous at all.


Yep, this is definitely most often the case. Perhaps I should have reworded that to reflect that I believe anything more than a simple /@/ check to be largely superfluous. I still use the /@/ regex often, which I mention at the end of the article.


I totally agree. The one other thing I might add is to check for commas, since they are never valid in an email address and "something,com" is a common typo.


By my reading of RFC 822, commas are technically valid, as long as they are properly quoted, as in: "Doe, John"@example.com

Of course, whether any email server actually accepts that is a different story.


At the risk of sounding like a hyprocrite since I just advocated for minimal validation of emails... that is a nonissue. Nobody actually does that.

Anyway, the regexp I use is /.+@.+\..+/ which supports the address you describe, but (usually) catches the relatively common mistake of user@yahoo,com


Commas can be valid and our servers manage them correctly.


That's not true either. I don't know how it's set up other than that we use exchange, but on our internal network, you can send an email to someone's firstinitiallastname (the part that is before the @ for external senders) and it will go through.

So on our production servers, I need the @, but on our dev/text servers, I don't. And no, the domain is not appended to the address before sending. That's the part I actually know about.


It's common practice to allow unqualified emails addresses (ie. without the @domain.tld) from local sources since it's trivial to look up against your own user list. Depending on the MTA, the domain may or may not be appended, but often the local organisation's default domain is implied. However, this shouldn't work when routing mail via external MTAs that have no knowledge of your organisational structure.


12345 could be a valid postcode in Germany or Spain, but in the UK it would have to look like AB1 2CD, except when it doesn't (e.g. AB12 3CD). Arg!

I wish web forms had a "Tell us we are wrong" button next to validated boxes.


    I wish web forms had a "Tell us we are wrong" button next to validated boxes.
You know... that's a very good idea actually.


I've been thinking about service that injects some lines of javascript for client sites that gives them button "fix this". It would send warning email after few messages about typos but it could also be used to fix these kind of things. Not sure how JS could handle selection and button press simultaneously. Maybe copy+paste part of page with fix. Maybe get some reward for being grammar nazi. Maybe some paid service for auto checking if corrections are accurate and auto-inject corrections to sites.

What the heck, how can I take MS Word online?


There's also the problem of "validation rot" in which validation functions can be correct today, but might not get updated if the data format changes.


FWIW, UK postcodes have a very strict and well defined format:

https://wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#F...

Also, something not many people know is that any postcode in the UK has a maximum of 100 addresses in it.

You can tell I've been writing a search algorithm for UK address data recently ;)


I'd like to think that's not rare enough for websites to have assumed they were always six characters - half of London lives in postcodes like "N1 2BC". The bigger problem I've found with postcodes was sites insisting on five digits, or even insisting on postcodes at all - not every country uses them!


If your service tells me that myname+servicexyz@gmail.com is an invalid address, you have to live without me. I can't count how many mails I sent to websites incorrectly validating email fields. However, password validation is mostly even worse. Just stop over-validating already!


Hushmail has a fantastic pseudonym service (among other fantastic services) for this use case.

  real email: name@hushmail.com
  servicexyz: name.servicexyz@nym.hushmail.com
  serviceabc: whatever-isnt-already-taken@nym.hushmail.com
https://www.hushmail.com/ (no affiliation)


No, you forcing me to include a number does not make my password more secure.

Weak password: correcthorsebatterystaplefoobarbaz

Strong password: password1

The worst is when they do that and enforce a maximum password of 8 or 12 characters (I'm looking at you, every bank in the US ever).


Is the top password not stronger than the bottom one in your example?


That was his point. Web services incorrectly classifying the strength of user supplied passwords.


The labels ("strong", "weak") are ironic.


Agreed that password validation is the worst. (Github, I'm looking at you.)

Even worse than excessive validation is when they make you change your password often, for no apparent reason.

On some sites that do both (ahem Apple), the only way I can login if I haven't been there in awhile is the security questions or the password reset mechanism.


Wow, I never noticed that Github enforces password rules, as well (Must contain one lowercase letter, one number, and be at least 7 characters long.). My passwords are usually HMAC-based, so they'll most likely validate.


even validation of first & last names is usually awful. For instance many sites tell you that having a dash "-" in your first name is invalid... yet a lot of French (first or last) names contain that character..


Those who happen to own domain name and some small hosting plan with Cpanel can easily redirect all mail to that domain for specific email address. This way you can replace plus sign with dot that works everywhere.

And I've have my problems with short addresses before with Microsoft. http://answers.microsoft.com/en-us/windowslive/forum/liveid-...


Using or not using a regex to validate email addresses misses the point. You should simply be delegating this task to the library that you’re using to send mail in the first place. If the mail library can’t deal with a particular address, then it’s not worth accepting because you’re not going to be able to send anything to it anyways.

If you have a Rails app you’re most likely using the Mail gem to send mail, so that’s why I wrote this: https://github.com/codyrobbins/active-model-email-validator. It lets the gem worry about whether an address is valid. Since the mail library is actively maintained, in my opinion it’s a safe bet to trust that it is properly parsing and validating addresses insofar as is possible.



I was half-expecting the classic "zalgo" html regex parsing rant here ( http://stackoverflow.com/questions/1732348/regex-match-open-... ) - but your link actually contained a lot of helpful responses :)


That is beautiful. And grandparent does indeeed link to the best regex ever, http://ex-parrot.com/~pdw/Mail-RFC822-Address.html


Whoops, you're right. That was an article left over that I meant to just replace with a bang. I've fixed that. Thank you!


It doesn't hurt to validate the domain part, and make sure it has at least one MX or A record. With a bit of AJAX you can usually validate that before the user even submits the form. Not forgetting to take punycode into account.


You can even validate that it's a valid mailbox:

http://www.webdigi.co.uk/blog/2009/how-to-check-if-an-email-...

Or use a service that will do it for you:

http://www.freshaddress.com/demo/


The trouble with validating using SMTP is that it is fairly likely to introduce a delay. I still think it is better to just check the DNS, with a very low timeout and then send the email as long as there wasn't a negative response.

You can let the user immediately know that the email was sent, but then you can also push updates to the user whilst they're still on the site if there is a rejection/bounce.

I wrote https://emailprivacytester.com/, which does the DNS checks that I mentioned when you enter your email address, and then keeps you informed of the status of the message delivery as it happens, including any SMTP rejection messages.


Those methods are far from foolproof. Lots of sites that do greylisting for example, will give a temporary failure before the message even reaches the server that actually knows what addresses on the domain are valid.

Also none of the e-mail systems I've operated in the last 15 years or so will let on whether or not the user actually exists until at earliest when you have committed to sending a message, and many of them wouldn't even then (instead accepting the message and sending a bounce) to reduce the spam harvesting.


I had some server IP address banned by hotmail by doing that. Because this is what spammers do to check their emails list.


Guys, I have to interject. This is a terrible idea:

This is horrible practice from the perspective of a mail server. Too many illegitimate email addresses and you will start getting more "soft bounces" (a polite way to say, "We're not delivering this") when attempting to send mail. ISP's keep score for how accurate a mail server is when delivering mail, attempting to sift out spammers. When your score dips to a certain level, ISP's stop cooperating with you and basically label you as a "dubious/bad actor".

For a small-scale operation, perhaps this is okay. For anything larger or more vital, I wouldn't trust this method of operation, as it will at some point result in emails to legitimate addresses not being delivered. That's simply unacceptable for many applications.


If you get an incorrect address, it's because the user made a mistake. Chances are, most of those mistakes will still be RFC-compliant addresses (wrong or missing letters, for example). I find it hard to believe that people inputting addresses without @s or with unicode characters would become a real problem.


Frankly, your lack of belief is likely due to a lack of experience with user submitted forms with email addresses. It's VERY common for users to simply type the wrong data into a particular textarea. If you do no validation you will get things like the person's name, street address, or other confused mixups.

Anyone who deals with forms of this nature will have seen this firsthand, and with enough frequency to cause trouble with mail relay as the person above has described. It's a real problem.


Is /@/ really going to catch significantly less invalid emails than whatever other almost but not quite validation you can apply?

What's more common, someone accidentally typing "foo:ar@gmail.com" or "foonar@gmail.com"? Both would bounce and cause mails server issues.

If someone was being malicious, they can do it with an unregistered address that passes any validation you throw at it.


Although I agree with the meta-point, there's still value in doing some basic validation. If there's no @ sign, or the domain name looks like it was mistyped (gmali, yahhoo), then a better user experience for your user is to present them with a warning and ask them to retype their email. You may not want it to be a blocking warning (that is you can still submit the form), but some validation can be a huge value add to users.

Still, sending an email beats regex validation any day for determining it's a real, working address.


Yeah, skip all that too. Who says yahhoo.com isn't a valid domain?

For example, I regularly receive email for an individual who has a nearly identical email address to my own, but at ymail.com instead of gmail.com. Any system that tries to guess at domain misspellings is going to catch ymail.com and think "Ah ha, they meant to type gmail.com, I'll correct that for them!" Viola, their email is sent to me. Again, this is not a theory, it happens to this poor guy all the time.


Just having a simple warning should not be a problem though as long as never do change the input of the user.


Like mailcheck.js I mentioned in other part of this thread:

http://news.ycombinator.com/item?id=4486341


At work I've introduced mailcheck.js[1] (which I found on HN, BTW.) and we use it in registration form to give hints about mistyped popular polish and international domains. We do also have some validation code, but it's simple and it handles + sign correctly. It's important not only for users, but essencial for developers as well - right now I have 56 unique accounts on the development server using the same e-mail but different things before the + sign. It helps tremendously.

[1] https://github.com/Kicksend/mailcheck


Perhaps its gotten better, but when I tried it out I found it a little too noisy (e.g. flagging user@hotmail.es)


Even the author of this blog post gets it wrong. Technically speaking, there is NO need to have a dot in the domain name. These are valid email addresses:

  user@ua  (.ua = Ukraine)
  user@km  (.km = Comoros)
  user@as  (.as = American Samoa)
  (and many more)
Because these ccTLDs have MX or A records at the top level, pointing to real MTAs. (RFCs say you should not have MX records at the top level, but many ccTLDs do it.)


This guy doesnt "get" UI. You are supposed to protect the user from additional steps/mistakes,and not protect your service from the user when you use regex to validate the email. Otherwise users could perfectly fool your best system using billgates@microsoft.com


I think saying that I don't "get" UI is being somewhat presumptuous. Good UI on the registration form can go a long way in alleviating the "switching the email address and username fields" problem, but anything more than a check for /@/ (or, if you're feeling ambitious, /.+@.+\..+/) is just overkill. If I enter a valid email address that's just typo'd, the result is the same: the activation email bounces.


/.+@.+\..+/ would be a bit too ambitious as the dot is not necessarily part of a valid email - an MX record can appear on a TLDs DNS records[1]. Considering ICANNs new licensing of hundreds of TLDs this could be a very real concern soon.

1: http://blog.nerdchic.net/archives/191/


>If I enter a valid email address that's just typo'd, the result is the same: the activation email bounces. //

No, you're excluding the set of emails that are entered incorrectly and thus are not valid. The result for those is not the same as if the UI included a simple test (such as your "ambitious" example).

1) Without UI validation:

- 1.1) Email address entered correctly -> activation email sent

- 1.2) Email address entered incorrectly but forms a valid address -> activation email sent to wrong address

- 1.3) Email address entered incorrectly but doesn't form a valid address -> activation email not sent

2) With UI validation:

- 2.1) Email address entered correctly -> activation email sent

- 2.2) Email address entered incorrectly but forms a valid address -> activation email sent to wrong address

- 2.3) Email address entered incorrectly but doesn't form a valid address -> user warned

-- 2.3.1) Email address re-entered correctly -> activation email sent

-- 2.3.2) other states

In the 1.3 case all of the activation emails fail to be sent. In the 2.3.1 case activation emails are sent that otherwise wouldn't be.


You are completely missing the the one case the whole argument circles around:

2.1a) Email address entered correctly -> validation fails, no mail sent

This prevents activation mails that would be sent without validation (or validation against a regex like [^@]+@[^@]+).


Sorry, yes, this wasn't supposed to be exhaustive just to illustrate the point that not providing a warning on apparently erroneous email address entries was some what pathological.


The problem is that the "protection" invariably prevents someone from using their odd but valid address.


I think the point is that in reality the user should be protected from the service -- which may be trying to validate their input and telling them their perfectly valid email address (or address, postal code, name, password) cannot possibly exist.


The real problem is that lots of developers conflate validating that an email address "looks" correct with it actually being a valid, functioning address for that user.

There is no technical way to verify that an email address works other than sending it a message.

And if someone doesn't want to give you a real email address, they're just going to enter bogus@fake.com to get past the validation.


I'm amazed that we're still having this discussion. It's not that hard:

1. Use regexes for client-side validation to catch typos and warn the user against potential problems without having to round-trip to the server

2. Check DNS records on the server side and send a confirmation mail

The client-side regular expression can be as simple as /@/, but something more complex like

  /^("(\\"|[^"])*"|[^@\s]+)@([A-Za-z0-9-]+\.)*[A-Za-z0-9-]+$/
is fine - even if you mess up the regex, that's not a big deal as long as you allow the user to send the form anyway, probably after asking if he really knows what he's doing...


Your client side regular expression doesn't take IDNs into account.


That's because it's pre-IDN (and it fails for IP address literals as well).

This actually strengthens one of the points I was trying to make: the need to fail gracefully. The application I took the snippet from (which has been retired some years ago) would have accepted IDNs after asking the user for confirmation.


Article is spot on for all sorts of reasons. Never mind people typing in domain literals like me@[1.2.3.4].

Internationalised Domain Names are going to REALLY screw over some regexes, given how poorly understood Unicode is. Does Ruby even have Unicode regex support yet? I don't program in it so I'm unfamiliar with the state of the art.

On my pet topic of Unicode, I especially enjoy the use of hidden form fields to reverse-engineer the character encoding certain browsers ACTUALLY send on submission rather that what your code hoped for...

Good luck convincing management to Do The Right Thing here.


Is there some rule that says you must use a regex to validate an email? Just validate it with a more complex parser.

Alternate option: Validate it with a simple regex. If it fails ask the user "Are you sure this email is correct?" If they say yes, then allow it even if it fails validation.


Email validation is a problem created out of nowhere. Sending an email, if anything, is so cheap that it's utterly idiotic for every website to validate the addresses instead of just sending the email to whatever the user happened to type in the box. Either the email will be delivered or it'll bounce at some of the hops. Think how just using exceptions instead of explicitly validating array indexes is Pythonic.


It's more that if I sign up and don't receive the email, you've lost a user/customer.


The "send them a confirmation email" works well for registration, but there are legitimate use cases in which you don't want to do that.

When you just want to store the email address without acting on it (e.g., think a landing page for an app that hasn't been released), the best you can do is validate using a regex since sending them a confirmation email would provide no value to the user.


Due to a huge failure on my part, I edited the title to more accurately reflect my opinion on this. I did not mean to say that the regex validations themselves are a waste; my point was that the complex regular expressions are often too strict and almost always overkill.


How about "stop using weird email addresses"? Just because you can doesn't mean you have to.

How do you know you can trust a random email validator you found via Google? Especially if apparently the rationale is to use a googled one because they are so complicated nobody can really understand them?

That advice seems bad to me. Perhaps it is not necessary to validate and just emailing is sufficient (as the article advises). In that case perhaps downgrading the validation to a warning might be helpful, though ("the email you entered looks weird, please double check").


I saw something similar to this once, I think it was on eventbrite. I entered my @google.com work address for an event, and it asked me "are you sure this is your email address" presumably because people sometimes confuse gmail.com and google.com. I thought that was quite good.


I think most sites would lose far more sign-ups to a required activation email (going unnoticed in a spam folder or sent to a typo address and assumed delayed and forgotten) than to the occasional rejection of a legitimate email with peculiar formatting.

Oddball email addresses probably don't last long anyway as their owners quickly realize they can't use them to sign up for things and then switch to something more conventional.


>Some people, when confronted with a problem, think, “I know, I’ll use regular expressions.” Now they have two problems.

It was a dumb thing to say when he said it, and it still is. Granted, REs aren't appropriate for everything, but for some problems they're the right solution.

>So eschew your fancy regular expressions already. If you really want to do checking of email addresses right on the signup page, include a confirmation field so they have to type it twice.

No. Just no. I hate you and everyone else who thinks this is a good idea. Don't make me type things twice - the point of computers is to take some of the grunt work out of life, not to add more.

There's nothing wrong with checking an email address for validity, and there's nothing wrong with using a RE provided it's correct. You'll miss a whole bunch of typos, but since you're going to send out a "click to activate" mail anyway it doesn't matter, and the ones you catch will save the user a bit of time.


This is like trying to tell people to stop typing

   cat file|program1|program2
instead of

   program1 file|program2
It is futile. There are so many examples of programmers just doing mindlessly stupid things, often because "everyone else is doing it" or they read some "howto" they found somewhere, or they are using some library written by someone else.

How many times do you think people use extended regular expressions and backtracking when it's entirely unnecessary? They often have no idea that there is even a simpler way that will work (in some cases it might be faster). They think more complexity is actually making things "easier". Must have PCRE. Why? "Because I can't get basic regex to do what I want."

Let 'em enjoy their complex regex. Until there's a problem and they have to try to decipher what they heck it's actually doing.

PEG is fine. Lua has a good PEG library.

Still, a good handle on basic regex will take you a long way.


Your example is quite telling. Catting a file may be 'useless' in a pure functional sense, but the command has extremely small overhead and aids readability, as you can read the input/output flow from left to right. This is consistent with the rest of the pipeline.

It's also easier to change the first command in the pipeline without having to step over the input argument, again more consistent with the rest of the pipeline.

To avoid the cat, one can write

    < file program1 | program2
but in practice, cat adds no noticeable overhead.

As for regexes, I personally find using POSIX regular expressions to be a bit like using vi after becoming familiar with vim. You can get by, but it's crap and there's a reason why people came up with something better. Of course, using complicated features of any language or toolkit without understanding how they work is dumb, but that's not a reason to go back to the 1980s.


Holy goddamned gray-on-gray readaibility disaster, Batman!

http://contrastrebellion.com/

http://www.useit.com/alertbox/designmistakes.html

Stylish rewrite FTFW.


as long as I can still use something@mailinator.com, I am happy.


A stronger reason: the point of validating email addresses is to prevent user error, such as putting their name in the email field. It is not to ensure emails are RFC compliant. In fact you probably want to allow non-RFC compliant email addresses because there is a chance it still may work - not all servers are going to be RFC compliant and as a product it is not your job to enforce obscure Internet rules.

Personally I test for @ and . with any characters surrounding.


As pointed out elsewhere, an email address doesn't necessarily feature a '.'

Also, I learned that the local part of the address (the name) can contain pretty much anything, including '@'

So, how would your validation handle my hypothetical, valid email address "@foo"@bar?


Any email address collected by a web application would have a '.'.

As I said, we are not looking for RFC compliance, but rather user error. Missing a dot is user error in 100% of cases in a web application, unless you are installed in and sending mail in an intranet.

As unlikely as an email with @ in the username is, the regex would still match (something like /.+@.+\..+/.


Smartass with the wacky TLD MX record and '@' user name may want to take advantage of your service... so it's down to their monthly subscription vs taking the time to "fix" your validation.

I'm still not sure which approach I prefer, but having been thwarted by zealous validation in the past I lean towards this double-check-on-weird-shit-then-send-mail system.

I will never have enough domain specific knowledge to reject a given email address with absolute certainty. That is how much fun that RFC is.


Nowdays browsers should do the validation that provides immediate feedback to the users (using <input type=email ...>), so the article rightly claims that just sending an e-mail should be sufficient for the server side code. Most of the stuff that passes the browser's input filtering will be nonexistant rather than malformed addresses.

OTOH, most languages have proven, stable libraries for validating e-mail addresses (e.g. Mail::RFC822::Address for Perl).


there is a standard for validating emails. it is described in RFC 3696 - http://tools.ietf.org/html/rfc3696

one implementation is in lelp - http://www.acooke.org/lepl/rfc3696.html - but that package is no longer maintained (i know, because i wrote it).

i don't know of any other implementation. but that's the right way to do it. imho.


What about the security implications? You should validate input to check it isn't malicious.

Also, no one has mentioned using DNS. For example, extract stuff before and after the @. Check the domain looks like one and does it have an MX record? Is the local name malicious? Send activation email. Large services should use some machine learning for common mistakes and warn the user. (grandma@aol may be one common error.)


I don't think this will work. If you follow the RFC completely an e-mail "text space\@moretext"@host.com is valid, as well.

Some services have two input fields for an e-mail address. The second is to verify for typos. After that, just send the e-mail, already. If it fails you can delete the user entry from your database and print out something in the likes of "Who types their e-mail wrong two times?".


I can't stand when I try registering for a website with my email (which has ".com" in it... BEFORE the @ symbol) as invalid. I know it's a funky email to have, but it's valid and people often do these crazy checks.

I can see both sides though, on one hand you don't want someone accidently entering an invalid email and then never getting their email confirmation...


A few years ago I would have agreed, and I still do in general. But now I'm not sure that Perl grammars aren't up to the task and still as 'regular' as PCRE.

However you still shouldn't be writing your own unless you're writing an email validation module of some kind. Laziness is a virtue after all.


A better way is to check that the email the user entered matches a typical email format, and if the email doesn't match, then warn the user that he/she might've entered the wrong email address (for example writing foobar@gmail, or foobar@gmail.com.).


I think foobar@gmail.com. should actually work. IIRC, that last dot is the root of the DNS hierarchy.

Try http://news.ycombinator.com./ ;-)


Oh come on, does this look complex to you:

  (^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*|^""([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*"")@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$


And yet it's not enough. According to the standard,

"my@scary$doublequoted.address"@example.com

is a valid address.


I rememeber back in the day when email addresses like hello@[8.8.8.8] were valid..


I've taken to checking if the domain of the email has valid MX records. That, and a very rudimentary regex expression (not quite as simple as the OP's though) seem to do the trick well.


PHP coders can use the simple built-in filter_var function: http://phpbestpractices.org/#validating-emails


Yeah, and there's a Perl Email::Valid http://search.cpan.org/~rjbs/Email-Valid-0.190/ which checks for well-formedness and the existence of the domain.



The solution the OP proposes (validating email addresses through a clicked link in a "welcome" email) is pretty user hostile and unnecessary for most services.


True. It would be fair enough if I were signing up for a mailing list, but it's rarely clear to me as a user why any web site wants my email address. Is it to send me emails reminding me to check the site? Is it to help me "keep in touch" (with my new friend do-not-reply@morons.com)? Is it just standard practice taught in web-dev school? Am I being overly cynical to think someone is trying to turn my inbox into a billboard, that we both know I don't want that, and that the confirmation link is used to stop my little brain from outsmarting him with a bogus address? What do those of you on the other side of this issue think about it?


A good reason to verify ownership of an email address is, for example, PayPal, where money is actually attached to an email address.


This misses the point entirely. The point is rarely to avoid fake email addresses, but to make sure they don't contain spelling errors.


Nobody does a domain lookup to test validity?


no thank you. i'll stay with ^[\w.]++@[\w.]+$


That doesn't validate user+tag@gmail.com, which is one of the very root points that he was trying to make in the first place.


Here is an interesting list of test cases, some of which parse as valid email addresses and some of which do not. If not, the page says why:

http://isemail.info/_system/is_email/test/?all

For example: !#$%&`*+/=?^`{|}~@iana.org is valid.

Here's an essay on the subject:

http://isemail.info/about




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: