Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How not to validate email addresses (mdswanson.com)
62 points by swanson on Oct 14, 2013 | hide | past | favorite | 76 comments


I've found that the best ways for validating email addresses, in order, are: checking for the '@' sign, resolving the hostname to the right of the '@' sign to see if it has MX records (or an A record, since the specification technically also allows sending mail to the server at the A record), and sending an email to the address that includes a verification link that the user must click.

The first two can be done without requiring any additional work for the user, but people are so used to clicking verification links that they don't really mind that either.


And the advice here, which I agree with, is to skip your steps 2 & 3.


Eh, I don't think having a site that allows people to sign up on behalf of other people easily is a good thing.


As someone who owns {commonFirstName}{commonLastName}@gmail.com, I _hate_ services that don't require you to validate the e-mail address you sign up with. And there's a special place in hell reserved for people who write services which don't validate the e-mail you sign up with, but do require you to sign in with your credentials in order to unsubscribe.


lol@gmail.com whoever owns this, hates their life.


Unfortunately no one can own that :) Google doesn't allow email id with less than 5 chars (to mitigate random spamming)


Thanks for the info. I don't feel bad anymore.


how about asdf@gmail.com?


Surely you mean aoeu@gmail.com?


And if you skip 3 I will report any automatically sent mail as spam where ever I can in the hopes of getting you on as many blacklists as possible.


You should only report spam that is actually spam.


It is spam if you allow X to sign Y up for automated mails without confirmation. In Germany we even have laws against that.


Interestingly the email subscription box on the page goes against the validation recommendations made in the post. Tried to subscribe with "Matt Swanson <matt@mdswanson.com>" and was told it was an invalid email.

(I do understand that it is mailchimp and not the author)


I was wondering who signed me up for my own list :)

I like the "push the boundaries" thinking though! There is no reason why I couldn't add the JS library to the form at the bottom.


I will admit to also trying @hotnail and @gmial in the newsletter subscription box and chuckling to myself.


> People use Gmail's tag-syntax (i.e. matt+whatever@gmail.com) to sign up for stuff all the time. Are you allowing those?

Minor nit: this is not anything Gmail invented. This is RFC 5233 -- subaddressing:

http://tools.ietf.org/html/rfc5233


Having gone up and down this problem a number of times it is my opinion that the only way to truly evaluate email address validity is with a fairly elaborate state-machine based approach that provides you with feedback as to what is wrong in order to decide how to deal with it (or not). Here's one example:

https://github.com/dominicsayers/isemail

The regex's floating around out there are horrible.

Validating email addresses doesn't necessarily mean that you affect the user's experience. I think of it as an opportunity to avoid losing a potential customer due to a silly mistake. One such example would be a one page sign-up site where you are trying to collect the email addresses of those interested in your offering. In this context it is important to try and catch errors. You have a visitor who wants to keep in touch with you. He or she mistypes the email address. If you don't detect it you might lose them forever.

Granted, all errors are not detectable. If someone types jeo@example.com vs. joe@example.com there's precious little you can do about it in terms of automated detection.

You can accept obviously bad email addresses, store them in your database and simply tag them as such. This is where ML or human intervention might be able to fix the problem or choose to discard it. Email list pollution can be dealt with in other ways, for example, if you use this list to reach out to prospective customers bad emails will simply bounce.

In the end what is important is to avoid losing real potential customers as much as possible. I think a little software-based verification along with giving the user the opportunity to catch the mistake is enough. All the junk easily falls though the cracks of a multi-stage filter after the fact.


Someone should make this guy King of the Internets.

Swanson: I need you to take a look at address forms. I don't want to enter my city and state any more after I've given my zip code.

PS Yes, I know that not everyone lives in the US.

PPS Yes, I've heard about some places where a single zip code serves two cities. Edge cases, there will always be one or two.


Just enter your easy-to-remember latitude/longitude pair into this form field - no more typing your city ever again ;)



I'd hate to use this for anything important - my entire regional area is reported to be 300km (~185 miles) away from its actual location :)

On the other hand I'd be quite happy for sites to calculate shipping based off this since the reported location is the capital city.


I don't know about the US, but in Australia post codes (our zip codes) regularly service more than one town (or principality) when you're rural, and if you're even vaguely familiar with Australia we have a lot of rural :)

I've seen this handled quite well by a number of sites - you enter your post code and they'll just drop down a list of all the places it matches. Choose your town/suburb and you're done!

The ridiculous part is that they'll often ask you to enter your state as well, which you can derive from the first digit of the post code :/


Re: deriving state from first digit of post code, that's true 90% of the time but the exceptions -- some of which are pretty big, eg. the entire Australian Capital Territory -- will kill you.

https://en.wikipedia.org/wiki/Postcodes_in_Australia#Austral...


Oh interesting, not having lived in the NT or ACT I didn't know about that one.


I find it hilarious that Australia has just 4 digits in its post code. The US has 5 digits with 4 optional digits. Taiwan (a much smaller place) has 3 digits with 2 optional specifiers.

Singapore has 6.


Singapore's post codes are 6 digits because they're granular enough to specify exact buildings -- it's possible to receive mail addressed to "John Tan, S(123456)".

Australia's post codes, on the other hand, are really wide. The entirety of central Sydney is all "2000" and mine covers around 6 city suburbs.


You know what I want? Just the opposite. I know my address, it just isn't in the autofill databases and doesn't fit your standard format (e.g. I don't have a state). Just give me a plain text area I can type my address into.

And there's a special place in hell for webdevs who force me to use their fancy javascript date picker rather than typing in a date.


Asking for zip code first or instead of city/state is so rare that I find it jarring. I've rarely, if ever, thought that entering my city and state was overly burdensome and I strongly doubt there's any conversion improvement.


But wouldn't you be happier if all sites asked for zip first? User's time is precious even if they're not angry at you for wasting it.

If you are saying positioning the zip field first actually wastes more user time because it's so jarring, there are so many possible improvements:

- Don't change layout but populate city & state if zip is entered first. I know, hard to justify the effort.

- Populate zip from address.

- Offer city, state completions when you start typing city.

- Offer full-address completions from street address & geoip.

- Just statically populate city, state (and possibly zip?) from geoip.


These days, zip code solves a lot of our location issues. In another product, we simply elected to request location data from the user's phone as a secondary option.


It's worth noting that dealing with email addresses like this affects other parts of the site. For example, I was signed up with Zappos with a blah+zappos@example.com email. Everything worked perfectly other than some of the links in the email, which didn't escape the '+', meaning that it was interpreted as a space. E.g.

  "https://example.com/unsubscribe?email=me+example@example.com"
vs.

  "https://example.com/unsubscribe?email=me%2Bexample@example.com"
[ On the plus side, Zappos was really responsive, and fixed the issue when I reported it. ]


i think validation should be done via sending an email to the email address on hand and then requiring the end user to click a link in the email to activate. for one, this validates that it's the actual user's account and then this also validates the address form without having to be pre-emptive at the start.


Yeah it is kind of like authentication vs authorization when managing users. Just because a user is correctly logged in, does not mean that they should have access to something in the system. Just because a valid email is entered, does not mean that the person entering it is the owner of said email. Though these two concerns are often conflated!


And this could be abused. I sign up to a popular service with someone else's email address. I could be a nuisance and block them from signing up as themselves, even it if it is temporary.


And all the companies ignoring that are really annoying me. I have a very common name and my email used to be firstname.lastname@gmail.com (not my primary anymore but I still forward the mail). Now the personal email I get is okay, I actually ask where the guy they sent the mail to is coming from. But newsletters without verification? Give me a break. I do report all of those as spam.


Unless that's actually necessary (for example, in PayPal's case), that's a major conversion obstacle for very little gain.


I am wondering about the following scenario when you allow anything to be entered, but do require validation: what if I entered "myaddress@mysite.org, someone-else@elsewhere.org". The mail will be sent to both addresses, and if I'm faster than the other guy (probably, because I'm expecting the email), I can basically sign someone else up for whatever it is. Okay, it's trivially easy to find out that I did it (but I could use a throwaway address), but you may want to prevent this scenario anyway to prevent harassment.


The current email validation approach in MediaWiki allows for email addresses like "foo@bar", instead of "foo@bar.com". This ended up breaking unit tests at Wikia when we upgraded versions last year.

I originally thought this was a bug. But if you think about it, MediaWiki is capable of being deployed on an internal network. An internal wiki actually could actually interact with email addresses only available in a given intranet server, and not reachable from a given TLD.


some NICs or registrars (don't remember exactly) use <user>@<tld> as e-mail. That's valid, so you can get your scenario even on the wild internet :)


With the revelation that the NSA collects email addresses by the millions, you'd think they could offer a service where they validation the existence of addresses.

http://www.washingtonpost.com/world/national-security/nsa-co...


Failing to figure out how to send an email to an individual when they sign up may be a problem.

In the case of validation I tend to look at it as "What is the minimum I can check for to ensure that I can get the data that I need out of this form?"

Too much and it becomes arduous to sign up, too little and the app ends up trying to send an order confirmation to "Matt Swanson" instead of "matt@mdswanson.com"


I'd check the TLD as well.

Going through the error logs on our mail server, there are a lot of people out there that get their email address wrong even if you have them type it last. .cmo instead of .com, transposing letters in their name, spelling their company name wrong...


Yep - that JS library will take care of that as well: https://github.com/Kicksend/mailcheck/blob/master/src/mailch...

"hotnail.con" -> "hotmail.com"


Isn't this pointless with the wholesale of TLDs?


Reading the article, the main use for the TLD check is to see if you have a typo, and if so suggest a correction - rather than automatically correcting the typo for you. Which I agree could be useful for the big email providers. (e.g you type gmial and it suggests a correction of gmail).


Also sometimes you don't even need the TLD:

user@myhostname, is a valid email address, and yet it's rejected by a lot of libraries.


It's certainly valid address, but why would you want to input that as an email address into public-facing service that you do not operate?


Perhaps I do operate it. Can be very useful on say a dev box.


Just because it's a valid address doesn't mean a real user is going to sign up for an account using a domain without a TLD.


Perhaps users are aware of relative domain names and addressing. You even see this on a service like gmail's login. A user with the address example@gmail.com doesn't have to enter '@gmail.com' when logging in - just 'example'. But actually either will do. Further it's not totally clear for a user what to enter here. Is a username/id is the same thing as an email address or not.

My aunt swears blind that an email address without the name in double quotes and the domainy bit is not a correct email address. She types the lot out.


For those who still want email verification, I believe Mailgun offers a pretty good API.

http://blog.mailgun.com/post/free-email-validation-api-for-w...


Sometimes it's important to only allow emails that work your other services. I once signed up for an EA account using an email with a + symbol, and it was a horrible experience because some parts of their system couldn't handle + symbols.


My favourite has been signing up for a couple of games - I think CubeWorld was one and some generic MMO I was trying out was another, where the account creation process handled a '+' email fine but the game client didn't handle it and gave a login error.

The best part of that was that the MMO account didn't allow me to change my email address.


To this day, netflix keeps asking me to confirm my email address and it keeps failing because it has a "." in it.

I've had the netflix account for years.


I actually always use this [1] regex for checking emails It is supposed to follow the RFC 5322 standard. Why is it wrong to get emails through this.

[1].

(?:[a-z0-9!#$%&'+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'+/=?^_`{|}~-]+)* | "(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f] | \\[\x01-\x09\x0b\x0c\x0e-\x7f])") @ (?:(?:[a-z0-9](?:[a-z0-9-][a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-][a-z0-9])? | \[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3} (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-][a-z0-9]: (?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f] | \\[\x01-\x09\x0b\x0c\x0e-\x7f])+) \])


I just had my email address rejected by a validator, I expect because it was a user@thing.thing2.domain format - and yes, that company lost a customer, because it punted them into 'meh, can't be bothered'.


To add to the conversation. There is a way to check if a mailbox exists using SMTP. It works on Gmail and several other servers.

Python/PHP Code and explanation is here http://www.webdigi.co.uk/blog/2009/how-to-check-if-an-email-....

It was eye opening to understand the underlying SMTP protocol. There are some pitfalls too as mentioned in the article.


This works almost everywhere, except Microsoft Exchange.


Why exactly is an RFC 2822 parser wrong? It is by definition the right way to validate an email address.

The RFC specs emails as a CF grammar, not a regular grammar, which is why validating all possible emails with regexes is so hard. Use a parser and call it done.

If your goal is "don't prevent a user from signing up" then why validate emails at all? Why not just accept anything, and whinge at them after you've already captured them in your system?


This topic comes up again and again on here.

Like others have said most popular web app languages have some email validation built in (like php has the FILTER_VALIDATE_EMAIL filter).


I remember looking at the PHP built-in option a year or so ago, there are bugs filed against that because it's too strict.


Who are these people who have whack email addresses trying to sign up anyway? They must find the web a pretty difficult place to navigate. Has anyone actually come across a real world scenario where a user has an email address on the cutting edge of RFC specification? Does anyone capture these email addresses? I've never come across one before in the wild. Is this a problem that actually exists?



sub-addressing with +, = or any other valid character is a desirable feature and it is entirely unreasonable to block it.


Also, screw the "Confirm your password" cargo-cult. Email and password that's it. Send a verification email and you're registered.

Another one, using the stupid asterisk character hiding password input field. It's user-hostile, especially on mobile. No one is looking over my shoulder, and if they are I'll take care of it myself, thank you very much.


Your two points are related: because the password is masked and you can't see what you're typing to spell-check, it's worth asking the user to type the password again to make sure they spelled it correctly the first time.

That being said, I completely agree with you – I don't think there's much validity in masking the password field except maybe when it's auto-filled by the browser.

We have tested turning off the masking on various sites we've developed and in general users tend to freak out and think the site is insecure as a result.


Yup, makes perfect sense to me why that design doesn't go over for a public accustomed to seeing the password mask. It would be neat if that could be a preference set in the OS for power users, but I can see that being abused and I see it no way compatible with legacy sites and applications. You couldn't make it the default, because it drops a degree of security for all users of the OS. And it wouldn't get adopted widely enough for it to be worth the effort, being more expensive to support for developers.


If you designated the input type as email, at least the browser could then take over the responsibility, not necessarily of the validation, but possibly of the suggestion that you might have mistyped or have an invalid email address. Surely that's preferable to every site writing their own code. Browsers should be web helpers!


All the points you make are excellent. I've always leaned to the 'greedy' regex, but never thought to articulate exactly why. http://www.radiumcrm.com


Recently I had a site reject my email address. The only thing I can imagine is that they used a really old whitelist of TLDs and .me wasn't included yet…


Some languages also have useful functions in their standard libraries for this. I know Python has email.utils.parseaddr().


Wow, when it comes time to validate user input, don't! What a clever solution!


What's your solution?


Lets keep validating user input. Even if it's "hard" i.e. uses a regex...

Sure there's better ways to build a regex then to hard code it into each method, but nevermind that, lets just accept whatever comes through the pipe into our barely tested (in production) and highly insecure frameworks, as long as it contains an '@'.

No matter what solution I propose, It's better than this "nonsolution" because it's a solution.


You are missing the point. Perhaps it will help if I frame the problem for you differently: scrubbing bad data, and developing policies for minimizing bad data in your system.

Not all validation needs to happen at the time of data entry.

Your "hard" regex may reject international email addresses. In fact, if your regex's input isn't converted to Punycode first, you are a fool for even attempting to use regex, because now your regex will likely fail on all IDNA inputs.

And what is your test suite going to be?

And what did you "validate" exactly? That you matched your regex? What if the e-mail isn't active, or the mailbox is full? Outlook 2013 actually has this really cool feature called MailTips that provides more advanced mailing list and e-mail address validation and warnings: http://blogs.technet.com/b/exchange/archive/2009/04/28/34073...

Suppose when you first signed up the user, they validated their email address, but now the account seems to be inactive. How do you handle that scenario? Continuous validation.

And how generally useful is your regex? What are you going to do if the email came from OCR software output, or screen scraping output? Your ERP may have the original document it was scanned from. Are you going to not store the bad e-mail address simply because you wrote some "hard" regex that rejected it? Not a straight forward question to answer, as it depends on your data model for storing addresses. You might have a column IsConfirmed.

Here is another example of "continuous validation". Validating mailing addresses. Most major e-commerce sites allow very liberal input, but scrub the data in real time or near real time, because the postal service gives discounts to companies that print "correct" address labels. "Correct" here could mean "One Post Office Square, Boston, MA, 02109" instead of "1 Post Office Sq, Boston, MA, 02109". This process is called Address Standardization, and in areas of the world with rapidly growing economies, often times Address Standardization vendors are behind, because some "streets" don't have addresses yet and aren't known to exist in any GPS system. This is common in many parts of China.

Here is another example of "continuous validation". How Google does spell checking, as compared to the "fixed validation" in Microsoft Word's spell checker.


All these other questions are dependent on context and out of scope of the argument.

The article argument in a nutshell is that validating email is hard, so don't bother, in fact, let users submit whatever they want including javascript. Then just check for @ and send it off to your next parser in the chain, in fact get lots of 3rd party parsers for misc features and send data to them first. spend effort fixing autocomplete so users can enter data easier that you will automatically accept. I'm sure this can only improve data quality...

I can imagine that wanting to know all the stupid shit your users submit as an email is the correct solution in certain contexts, but for a majority of cases, this article is wrong in everything that it suggests. Admittedly, there is very little context given.

Perhaps the context is "I don't care about security of my users or my services, and I will run whatever 3P code on my backend that appears to do the job of making a webpage look spiffy and easy to use. Once I have 10 Million (unverified) users, you sell your spaghetti factory and it's no longer your problem."

After all that, he recommends not letting people use software without a validated email address. Too bad he never bothers saying how he would get to that point, only how he would avoid doing to work.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: