Hacker News new | past | comments | ask | show | jobs | submit login

"punycode" does seem to break a few things in userland, but obviously should exist for the web to be more useable in more languages.

I've delved into the subject a bit and as someone else has mentioned, "homograph attacks" are a thing.

Looking deeper, it seems each TLD has a set of dictionaries and rules for what is an acceptable combination of characters, and what isn't (to avoid homograph attacks)- and as I understand it, the solution is on a per TLD basis, as in different dictionaries and rules per TLD.

For 3rd parties like LinkedIn, I could understand why they may be averse to embracing IDN domains but for a company of their size and reach, they really should be looking after their wider user base.

For working with IDN domains, I've felt the easiest way to store/process them is to store them as their ASCII value and convert for human consumption.

Tried hunting for a C library that deals with the dictionary issues and was left not entirely sure whether libidn supports them.




Why worry about the dictionaries? No legitimate user will accidentally enter a Greek Alpha rather than an A, for example. If a malicious user does this it won't resolve anyway because the domain is banned by the rules and can't be registered.


Yeah, perhaps less important for LinkedIn but more pertinent for other parties like domain registrars and TLS cert issuers- apparently the 'rules' ala the registry's DNS is an evolving thing.

https://www.theregister.com/2020/03/04/homograph_attacks_sti...

Ignoring the specific characters without a local frame of reference would require at least a DNS lookup.


This kind of tricks is used by sending mail to users with legitimate looking links, which can be used for phishing.


Am I misunderstanding or are you looking for Unicode normalization?

http://site.icu-project.org/design/normalization/custom


I maintained a domain database and normalised to ASCII with libidn, sometimes the input data was not from zone files and preferably would've been able to double check the characters used in a potential domain to ensure it's something that's registerable, without any network required. That was my motivation for looking into the topic originally.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: