Domain hacks with unusual Unicode characters

jfk13 · on Nov 2, 2018

Under "How does this work?", the post refers to the text in RFC 5895:

> map characters to the "Simple_Lowercase_Mapping" property (the fourteenth column) in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>, if any.

as if that were responsible for turning ℡ into TEL. But the fourteenth column in

> 2121;TELEPHONE SIGN;So;0;ON;<compat> 0054 0045 004C;;;;N;T E L SYMBOL;;;;

is empty!

What we're actually seeing is a Compatibility Decomposition, used when Unicode normalisation form NFKC is applied to the text.

Whether it's appropriate for browsers to be applying NFKC may be questionable. RFC 5895 calls for the use of NFC (which would not apply mappings like this), but it also says that

> These form a minimal set of mappings that an application should strongly consider doing. Of course, there are many others that might be done.

which leaves things rather open.

kam · on Nov 2, 2018

I was hoping this would work on the "Regional Indicator Symbols" used for flag emoji, since they encode the same country codes used in ccTLDs.

🇩🇪 is actually two Unicode code points 🇩 (U+1F1E9 'REGIONAL INDICATOR SYMBOL LETTER D') 🇪 (U+1F1EA 'REGIONAL INDICATOR SYMBOL LETTER E') that fonts display as a single flag grapheme. But google.🇩🇪 becomes the Punycode google.xn--h77hc and fails to resolve rather than mapping to ASCII as these other characters do.

edent · on Nov 2, 2018

Yes, I found it rather curious to see which symbols converted back to "pure" ASCII.

For example, ℡ goes to TEL, but ℻ goes to punycode!

tyingq · on Nov 2, 2018

  http://℻zero.com works for me.

Apparently just doesn't work as a TLD (chrome/linux).

gondo · on Nov 2, 2018

most likely because .fax is not a valid TLD but .tel is

dec0dedab0de · on Nov 2, 2018

This kind of stuff is fun, reminds me of decimal IP addresses.

like http://3520653040 should take you to hn (atleast to the IP I'm resolving for hn right now)

Also, I know this is off topic, but that python example really bothered me.

original:

  python -c 'import sys;print sys.argv[1].decode("utf-8").encode("idna")' "℡"

Should have been

  python -c 'print "℡".decode("utf-8").encode("idna")'

Or in python3

  python -c 'print( "℡".encode("idna"))'

mdewinter · on Nov 2, 2018

I once wrote on this topic, numeric IP addresses:

https://raymii.org/s/articles/IPv4_Address_Conversion_Tricks...

This "feature" is documented in the inet_aton(3) manpage.

edent · on Nov 2, 2018

Pull Requests welcome :-)

More seriously, I just copied that codefrom somewhere. Why is the other way better?

half-kh-hacker · on Nov 2, 2018

Brevity - You don't need to `import sys` if you just use the literal character instead of reading from argv

emilfihlman · on Nov 2, 2018

>Why is the other way better?

Shorten? More readable? Simpler? Surely you said that in jest.

gondo · on Nov 2, 2018

how does that IP url works? for me HN resolves to 209.216.230.240 I know about http://209.216.230.240 format, but how did you get to 3520653040 ?

vanni · on Nov 2, 2018

An IP is just an unsigned 32 bit integer:

3520653040 = 209 * 256^3 + 216 * 256^2 + 230 * 256^1 + 240 * 256^0

The 209.216.230.240 representation is the sequence of octets (256 = 2^8) from MSB to LSB expressed as decimal values.

dane-pgp · on Nov 2, 2018

This makes me wonder what if there are new ways to consider the question of "What is the shortest possible domain name?".

An amusing approach is the one taken here:

https://www.namepros.com/threads/worlds-shortest-domain-name...

leading to the domain used for this URL shortener:

https://l.tl/

However, the ccTLD for São Tomé and Príncipe allows single-letter second-level domains, so perhaps this is a contender:

l.ﬅ

kmm · on Nov 2, 2018

A few TLDs have A records, like ai. or dk.

http://ai. http://dk.

hultner · on Nov 2, 2018

How come, they don't seem to resolve to anything? I tried both via my browser and curl.

I can see the records and if i curl the ip-address for the dk-records I only get a nginx 301 redirect loop to the http-s version which serves a certificate for https://eksempel.dk.

Similar exprience with ai, curling the ip seems to point to a http://offshore.ai page.

Is the top level A-records used for some other protocol? Do they server any purpose?

foepys · on Nov 2, 2018

I tried both in Chrome in Android and they are working fine for me.

merb · on Nov 2, 2018

you need to add the dot at the end. (for your browser)

captn3m0 · on Nov 2, 2018

I run a scan on these once every few months : https://captnemo.in/blog/2018/06/02/google-tld-no-more-a-rec...

krallja · on Nov 2, 2018

@edent did you submit this to HN as https://xn--69f31l4t57c0mag4b613h.xn--7uh4898msjaso/🆆🆃🅵/ or the ASCII variant?

edit: whoa, HN autoconverted it to Punycode. 🅂𝖍𝐤ₛᵖ𝒓.ⓜ𝕠𝒃𝓲/🆆🆃🅵/

bluejekyll · on Nov 2, 2018

It’s a little sad that we ended up with punycode, given that utf8 is so elegant as a forward compatible character set with ascii.

DNS’ concern over backward compatibility is a bit of a pain sometimes. And now we even have two competing standards, where multicast DNS, mDNS, allows utf8, but “standard” DNS does not.

drewmate · on Nov 2, 2018

An important benefit of punycode is that it provides some protection against homograph attacks [0]. There are so many similar-looking characters in Unicode that it seems reasonable to trim the allowed characters to a subset. Of course it's a compromise and ASCII's not perfect, but it's a lot easier to spot g00gle.com compared to gооgle.com.

[0] https://en.wikipedia.org/wiki/IDN_homograph_attack

kuschku · on Nov 2, 2018

At the same time there's sites like flüge.de which is not reachable under any domain except the unicode domain, and while ü could be written as ue, fluege.de is already owned by a competitor.

Over time, punycode is going to cause more phishing problems in non-ascii countries than it's going to solve, because users aren't going to see a difference between xn-blabla.de and xn-blablu.de if all domains are unreadable to them.

btown · on Nov 2, 2018

I feel like this is a browser UX problem, right? A browser designed to prevent phishing of readers of both ASCII and non-ASCII languages might display both the punycode and unicode versions of a website, and if a heuristic is detected that a homograph is used that would otherwise result in an Alexa Top 100k site, display a dialog to warn against a phishing attack. (Your flüge.de example shouldn't trigger that warning, for instance.)

https://github.com/phishai/phish-protect is an attempt to do this, but I think there's a better middle ground for international users that doesn't simply block-by-default all punycode domains.

tinus_hn · on Nov 2, 2018

It’s so you can’t use punycode to create similarly looking but different domains. Sure there are some defenses but they are weak.

Volt · on Nov 2, 2018

Punycode has its benefits, when people try to phish with phony domains that look like legitimate ones.

LaikaF · on Nov 2, 2018

I've found it interesting which things do this vs don't.

Slack doesn't convert unicode urls, discord does for example.

edent · on Nov 2, 2018

The ASCII one. There appears to be a bug in some URL parsers that converts things to punycode incorrectly.

ckd123abd · on Nov 2, 2018

that's interesting

_kxbd · on Nov 2, 2018

I recommend setting network.IDN_show_punycode to true in Firefox via about:config. This will help keep you safe from this phishing vector.

Ndymium · on Nov 2, 2018

Vivaldi shows domains in punycode by default. I believe this is the only reasonable solution, otherwise browser makers will always be playing catch-up with exploiters.

bluejekyll · on Nov 2, 2018

At that point, how much value is there in supporting Unicode? By only using ascii (punycode), it pretty much eliminates the reason it exists: To allow software to show a domain in someone’s native language.

Should we perhaps instead be restricting the domain characters to the glyphs of ascii (rfc1035 compliant) and those glyphs that appear in their locale? Otherwise revert to punycode when the glyphs fall outside those ranges.

lmm · on Nov 2, 2018

> At that point, how much value is there in supporting Unicode? By only using ascii (punycode), it pretty much eliminates the reason it exists: To allow software to show a domain in someone’s native language.

Allowing a user to enter a domain in their native language is very much worthwhile, I'd say, even if we revert to ascii for display.

tomsmeding · on Nov 2, 2018

That would still leave e.g. users in a localr with a cyrillic alphabet open to homograph attacks.

paulpauper · on Nov 2, 2018

this is fascinating. given how many pixels can fit in a character, the possibilities are in the tens of thousands at least.

It's almost possible to write a simple math paper using unicode instead of Latex

these characters can also be used to evade username restrictions and other spam filters when character substitution does not work.

you can even use these codes for bitcoin wallets

lelf · on Nov 2, 2018

Also: wordpreß.com