It's weird how often email address validation comes up on HN. I think it must be bike shedding for the web app generation. It's the simplest piece of a common web app for which there could possibly be any discussion or disagreement.
Yes, you can have a crazy looking email address. But that doesn't mean you should. And you have no right to expect a web form will let you enter your address with nested comments.
On the other hand, a lot of developers seem to confuse validating an email address against the RFC with confirming that it is the user's true and correct address. This is not possible without sending the address a message. Regardless of how good your regex is, it will let many typos through and it will fail to stop "fake@fake.com". I'd suggest spending time elsewhere.
> And you have no right to expect a web form will let you enter your address with nested comments.
While nested comments are a bit extreme, I'm not a fan of the attitude that a user has "no right" to use features that are a documented part of the spec. Just because a feature is uncommon or doesn't seem important to you doesn't mean it's not important to some small subset of your users.
For example, that's not far from saying that a user has "no right" to put +tag in their email address (after all barely anyone uses that), but some people find this extremely valuable.
No, nested comments are not part of "the" spec (in a way that would imply "the e-mail spec"); sure, they are a feature of "a" spec, but that spec is actually somewhat unrelated to what an e-mail address actually is: they are just a feature of the MIME specification for header field values.
If you look at the SMTP specification (which, given that it defines the protocol in charge of using e-mail addresses for actual delivery has a much better claim to being "the" spec), you will note that you aren't allowed to use an e-mail address with nested comments in that context, as they have no meaning to SMTP.
However, you will also find that the rules for what characters are allowed and which have to be escaped are different, as that's what these specifications are actually discussing: how to escape an e-mail address for use with specific transport protocols.
An actual e-mail address? It seems to support pretty much anything followed by an @ followed by a domain name. It is just that in MIME, if you want to have a space character you will need to put it in quotes, or if you want a quote you will need to use a backslash.
The user of your web form, of course, is not typing MIME: there is a box that they can just type their e-mail address into, and it should probably support the raw syntax of their actual e-mail address, not a randomly chosen format required for escaping.
To make this more clear, one has to ask: why MIME escaping? Why not require the user to use HTML attribute escape sequences? That way, if their e-mail address contains a special character, instead of using quotation marks and backslash escaping, they'd use entities, like """.
Honestly, that makes about as much (if not more) sense. Meanwhile, of course, the user's username and password fields should also be escaped similarly, and if the user attempts to the use a bare < or > they should get a validation error "please escape your password using RFC1866 (HTML)".
Previous, more detailed versions of this same complaint:
Thanks for the clarification, I did not realize that there are multiple parallel RFC tracks that define differing syntax and semantics of email addresses. Your claim then, is that all of the complicated syntax defined for email addresses in RFC2822 and RFC5322 is for the sole purpose of escaping characters that are significant to MIME? What about "+" -- is it just convention that most email hosts ignore everything to the right of that, or is that actually specified somewhere?
Yes: that is just convention. In fact, RFC5233 defines an extension to Sieve (a purposely-not Turing-complete language for filtering e-mail that is implemented as part of many mail systems) that parses those + addresses; this is the only e-mail-related standard I've so far come across that mentions this common feature (and I've read through numerous at this point ;P).
However, it does not define the syntax for + addresses (even so far as to define the "+"), as + is only a convention (as is the entire concept of having detailed/sub-addressing at all): it even has various examples, such as "5551212#123@example.com", that use alternate characters.
> NOTE: Because the encoding of detailed addresses are site and/or implementation specific, using the subaddress extension on foreign addresses (such as the envelope "from" address or originator header fields) may lead to inconsistent or incorrect results.
> Implementations MUST make sure that the encoding method used for detailed addresses matches that which is used and/or allowed by the encompassing mail system, otherwise unexpected results might occur. Note that the mechanisms used to define and/or query the encoding method used by the mail system are outside the scope of this document.
Also, yes: RFC5322 defines a ton of syntax, and all of that syntax is related to MIME headers; a "structured header" has particular rules related to whitespace and is allowed to contain comments, so e-mail addresses included as part of the address lists used in headers like To and From are going to be adapted to follow those rules.
FWIW, RFC5322 actually has a SHOULD NOT on the things that make it un-similar to the SMTP specification. The two specifications really do attempt to use fairly similar syntax. You thereby are allowed to have comments and crazy whitespace in weird places in MIME, but "please don't" ;P.
> Comments and folding white space SHOULD NOT be used around the "@" in the addr-spec.
The goal really did seem to be, I will happily admit, to have the two protocols be largely compatible to the extent that they could: the same list of reserved characters is used by both (as a key example, SMTP also doesn't allow the ()'s despite not supporting MIME comments). There are some weird differences, like RFC5321 allowing empty double-quotes as the local part; although, RFC821 did not seem to have that corner case, so I'm starting to think this is bug introduced in RFC2821 (I had read mailing list posts about this issue a while back, but somehow it wasn't clear from those that it is a mistake).
I maintain, though, that it is very weird to be forcing this particular escape sequence set everywhere: when you lift e-mail addresses out of angle addresses and lists you don't need it anymore, as you can parse the address from the right unambiguously once you hit the @. Regardless, I do need to emphasize the statement in one of the earlier versions of my comment that RFC3696 has recommendations for e-mail address validation, and it includes the MIME escaping. I thereby doubt that my opinion, to be explicit, is shared by some of the people who worked on these specifications.
(That said, RFC3696 is weird... it mentions, for example, a limit of 64 characters on a username, but in fact that was just a "minimum maximum" from SMTP, and SMTP was quite clear that "TO THE MAXIMUM EXTENT POSSIBLE, IMPLEMENTATION TECHNIQUES WHICH IMPOSE NO LIMITS ON THE LENGTH OF THESE OBJECTS SHOULD BE USED", while at the same time saying that you must not send such things; I guess "welcome to Postel" ;P.)
Great. Now I'm confused. Isn't "no right to expect a web form will let you enter your address with nested comments" the exact opposite of "liberal in what you accept from others"?
When damn near every application today requires the user to click a link sent to their email, I can't envision any reason for actually validating the email at form submission time (except len <= 254). If they click the activation link, it's a legitimate user. End of story.
If you're trying to optimize a registration process, you want to minimize the number of people you lose at each step. If someone gives you a bad email address, and then you try to send the activation link there, you've lost that registration. If you catch it when they submit the form then you've got a better shot at getting a complete registration from that user.
But there are several businesses that tried to do exactly that and lost me as a customer since they didn't accept my perfectly valid address with a + in it.
The key is that your validation is not about "accepting addresses you know are good" but "rejecting addresses you know are bad." If I enter foo@gmail@com as my email address, it's bad user experience to make me sit around refreshing my inbox and wondering where the confirmation email is.
There's a reason why most of those valid email addresses are not allowed by most email providers. As an example, in the 1997-early 2000's, underscores were very popular amongst the internet users and most of them had an underscore in their E-mail addresses. Whilest applying for a job, they would fill out a form in the employer's website/some form application software where the data is then later tabulated.
Once tabulated, the entire email address would be underlined by many popular softwares back then, since it's (was) essentially considered as a link. Then, while the recruiters were trying to copy and paste these prospective e-mail ID's in their respective email clients, the underscore would be missed out and would have just a space instead (which the recruiters have no idea as to why) and thus the recipients would miss out from receiving these emails. Hence, many email providers wouldn't allow one to register an underscore (or any complex character) after receiving many such reports, just to avoid these hassles.
I don't care for overaggressive email validators myself, but if you are registering with my service using an email of ""()<>[]:,;@\\\"!#$%&'*+-/=?^_`{}| ~ ? ^_`{}|~.a"@example.org" I'll probably want you to enter something more reasonable.
Anyway just because it is valid according to the RFC, doesn't mean that it's actually a valid user's email address.
I guess I won't be using your service then! Quite frequently I find that services will not accept my perfectly valid email address as valid. This is on the whole their loss, since if they can't even properly validate an email address there are probably more loose ends and I'd rather not find out which.
I run into this a lot using a "+" in my Gmail address. I like to use "username+site_name@gmail.com" for easy filtering, but there are still quite a large number of sites that do not accept this as valid, and it is extremely annoying.
Years ago, I must have signed up or given my email address to Urban Outfitters with a +uo. They happily continue to send me emails, but their unsubscribe form does not accept it as a valid address! I've tried reporting the bug, but now just send the messages straight to spam.
On the whole, it's probably not a loss. It's probably a gain. On the whole, they validated almost all of the email addresses. They spent their corner case money instead on making the product awesome, and instead of poor []:,;@\\\"!#$%&'*+-/=?^_`{}| ~ ? ^_`{}|~.a"@example.org getting a copy, everyone else got something better.
I don't understand. Won't a correct-to-spec validation procedure take about as long to implement as an ad hoc procedure? I'm guessing a maximum of two developer-hours, and that's if you're extremely careful (i.e. unit tests).
It's not the worst way, but it's far from being the best way.
Sometimes you don't want the user to stop what they are doing, wait for an email to arrive, hope it doesn't land in spam, read it, click then link, and then continue. This is usually the case if you want them to buy something.
Secondly, what happens when they don't get the email? We had lots of problems where people where signing up from the following domains: gmail.cm, gmail.co, gmail.con, tmail.com, gmail.oom, gamil.com, hotamail.com, homtail.com, etc.
You want to firstly do some simple checks to see if it looks roughly like an email address. I like to see if it matches \S+@\S+\.\S+ (and ignore the few people with a top-level domain). Then you want to validate the actual domain to see if it's a misspelling of a popular address.
Why not use a librairy that validates emails? You save time, and you actually accept valid emails.
I agree that the example with a lot of symbols is over the top, but when a website doesn't accept foo+bar@host.com, I assume the product will be sub-par quality wise. The author did not follow of rigorous process for something as simple as email validation, I doubt he'll be more rigorous in other parts of his project.
I've given up on using + addresses. As other people have pointed out, the worst thing about them is that sometimes you will find that it works initially but will break other parts of the application now or later. For example, account creation works fine, but in two years, you can no longer log in.
Accepting "()<>[]:,;@\\\"!#$%&'*+-/=?^_`{}| ~ ? ^_`{}|~.a"@example.org versus accepting j@ww.com
Hardly a valid comparison. Yes my theoretical non existant service would validate addresses like yours. That is vastly different to the one I mentioned.
I once had an issue with a ticket purchasing website which accepted my "name+SERVICE" email address when I signed up, but it refused it when I tried to sign in later, or possibly just when I was trying to use the "forgot password" functionality - it's been a while. Either way, I was effectively locked out of a website that was storing my credit card information.
Since I was in a hurry to buy the tickets (a last minute gift), I signed up for a second account without the "+SERVICE" section, bought the tickets, and sent them an email asking them to either merge the accounts, delete the old one, or otherwise allow me access the other account so I could purge my credit card data.
What happened next was a bit shocking. They sent me an email back a few days later saying that they had fixed the issue with my account by removing the '+' symbol.
The risk of exploit was relatively low due to the extreme obscurity, but it was technically possible for a short while for a person to go to my email provider and register the username "nameSERVICE" and access my account on the ticketing website, including my credit card information.
The sites that let you sign up using the + and then don't work later tend to be using URL encoding/decoding and converting that + into a space in some cases but not in others.
That address is not "common" even if it is technically valid. If you want to talk about your own email address, you should provide a sample address that is similar so we know what you're talking about.
[edit: I see your address in your profile. It is nothing like the example given.]
If it's the one that you have on some of your social media sites, nice. I can suspect why they reject it but is it primarily because of the short length?
At a guess the problem is probably a code snippet that checks for minimum length. You could have an even shorter one for a single letter domain in a two letter tld, for instance x@y.nu
Just because myname@xyz.com is valid according to the RFC does not mean xyz.com is an existing domain name with a mail server setup that an account 'myname', let alone that user who inputted actually has access to that account. Just send a validation email
You are confusing "valid email address" with "a valid email address, functioning mail server and user with appropriate credentials to access the system that processes email for the supplied email address."
What you quoted was "valid user's email address", not "valid email address". While "valid email address" is arguable, a "valid user"'s email address is more clearly about whether it accepts mail. :)
I had no idea that email addresses could be so complex. ""()<>[]:,;@\\\"!#$%&'*+-/=?^_`{}| ~ ? ^_`{}|~.a"@example.org" is valid?
It won't be a popular view around here, but that SHOULDN'T be valid. The spec needs to change. I won't be making changes to any of the large sites that my company manages to accept weird characters like brackets, semicolons and quotation marks as a valid email address. That's just asking for a XSS or SQL injection attack and other trouble.
You sound like a lazy developer who doesn't want to learn the proper way to protect against XSS and SQL injection. If you have this mentality, your code probably has similar vulnerabilities in pieces of data other than the email address.
This is probably not a popular view around here either, but if you're examining the contents of the string to avoid XSS or SQL injection attacks, you're doing it wrong.
It helps you remember to whom you gave the address. mikec+farmville@digitalsushi.com is a nice comment that tells me "I used this for farmville; who is now emailing me with it?"
No, that's not a comment. That plus sign is part of the email address. It's up to your mail server how to figure out how to parse it and stuff it into the correct inbox.
The RFC allows for actual nested comments to appear within the address. Though nobody actually uses this and really the RFC is talking about formatting for email messages in transit, not how you should or shouldn't record your address on a form.
A malicious sender could just strip out your +whatever, and then you are where you started, unless you already filter all mail without a + part and give all your friends a salted email address.
Yes, you can have a crazy looking email address. But that doesn't mean you should. And you have no right to expect a web form will let you enter your address with nested comments.
On the other hand, a lot of developers seem to confuse validating an email address against the RFC with confirming that it is the user's true and correct address. This is not possible without sending the address a message. Regardless of how good your regex is, it will let many typos through and it will fail to stop "fake@fake.com". I'd suggest spending time elsewhere.