Hacker News new | past | comments | ask | show | jobs | submit login
Domain hacks with unusual Unicode characters (shkspr.mobi)
105 points by edent on Nov 1, 2018 | hide | past | favorite | 35 comments



Under "How does this work?", the post refers to the text in RFC 5895:

> map characters to the "Simple_Lowercase_Mapping" property (the fourteenth column) in <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>, if any.

as if that were responsible for turning β„‘ into TEL. But the fourteenth column in

> 2121;TELEPHONE SIGN;So;0;ON;<compat> 0054 0045 004C;;;;N;T E L SYMBOL;;;;

is empty!

What we're actually seeing is a Compatibility Decomposition, used when Unicode normalisation form NFKC is applied to the text.

Whether it's appropriate for browsers to be applying NFKC may be questionable. RFC 5895 calls for the use of NFC (which would not apply mappings like this), but it also says that

> These form a minimal set of mappings that an application should strongly consider doing. Of course, there are many others that might be done.

which leaves things rather open.


I was hoping this would work on the "Regional Indicator Symbols" used for flag emoji, since they encode the same country codes used in ccTLDs.

πŸ‡©πŸ‡ͺ is actually two Unicode code points πŸ‡© (U+1F1E9 'REGIONAL INDICATOR SYMBOL LETTER D') πŸ‡ͺ (U+1F1EA 'REGIONAL INDICATOR SYMBOL LETTER E') that fonts display as a single flag grapheme. But google.πŸ‡©πŸ‡ͺ becomes the Punycode google.xn--h77hc and fails to resolve rather than mapping to ASCII as these other characters do.


Yes, I found it rather curious to see which symbols converted back to "pure" ASCII.

For example, β„‘ goes to TEL, but β„» goes to punycode!


  http://β„»zero.com works for me. 
Apparently just doesn't work as a TLD (chrome/linux).


most likely because .fax is not a valid TLD but .tel is


This kind of stuff is fun, reminds me of decimal IP addresses.

like http://3520653040 should take you to hn (atleast to the IP I'm resolving for hn right now)

Also, I know this is off topic, but that python example really bothered me.

original:

  python -c 'import sys;print sys.argv[1].decode("utf-8").encode("idna")' "β„‘"

Should have been

  python -c 'print "β„‘".decode("utf-8").encode("idna")'
Or in python3

  python -c 'print( "β„‘".encode("idna"))'


I once wrote on this topic, numeric IP addresses:

https://raymii.org/s/articles/IPv4_Address_Conversion_Tricks...

This "feature" is documented in the inet_aton(3) manpage.


Pull Requests welcome :-)

More seriously, I just copied that codefrom somewhere. Why is the other way better?


Brevity - You don't need to `import sys` if you just use the literal character instead of reading from argv


>Why is the other way better?

Shorten? More readable? Simpler? Surely you said that in jest.


how does that IP url works? for me HN resolves to 209.216.230.240 I know about http://209.216.230.240 format, but how did you get to 3520653040 ?


An IP is just an unsigned 32 bit integer:

3520653040 = 209 * 256^3 + 216 * 256^2 + 230 * 256^1 + 240 * 256^0

The 209.216.230.240 representation is the sequence of octets (256 = 2^8) from MSB to LSB expressed as decimal values.


This makes me wonder what if there are new ways to consider the question of "What is the shortest possible domain name?".

An amusing approach is the one taken here:

https://www.namepros.com/threads/worlds-shortest-domain-name...

leading to the domain used for this URL shortener:

https://l.tl/

However, the ccTLD for SΓ£o TomΓ© and PrΓ­ncipe allows single-letter second-level domains, so perhaps this is a contender:

l.ο¬…


A few TLDs have A records, like ai. or dk.

http://ai. http://dk.


How come, they don't seem to resolve to anything? I tried both via my browser and curl.

I can see the records and if i curl the ip-address for the dk-records I only get a nginx 301 redirect loop to the http-s version which serves a certificate for https://eksempel.dk.

Similar exprience with ai, curling the ip seems to point to a http://offshore.ai page.

Is the top level A-records used for some other protocol? Do they server any purpose?


I tried both in Chrome in Android and they are working fine for me.


you need to add the dot at the end. (for your browser)


I run a scan on these once every few months : https://captnemo.in/blog/2018/06/02/google-tld-no-more-a-rec...


@edent did you submit this to HN as https://xn--69f31l4t57c0mag4b613h.xn--7uh4898msjaso/πŸ††πŸ†ƒπŸ…΅/ or the ASCII variant?

edit: whoa, HN autoconverted it to Punycode. πŸ…‚π–π€β‚›α΅–π’“.β“œπ• π’ƒπ“²/πŸ††πŸ†ƒπŸ…΅/


It’s a little sad that we ended up with punycode, given that utf8 is so elegant as a forward compatible character set with ascii.

DNS’ concern over backward compatibility is a bit of a pain sometimes. And now we even have two competing standards, where multicast DNS, mDNS, allows utf8, but β€œstandard” DNS does not.


An important benefit of punycode is that it provides some protection against homograph attacks [0]. There are so many similar-looking characters in Unicode that it seems reasonable to trim the allowed characters to a subset. Of course it's a compromise and ASCII's not perfect, but it's a lot easier to spot g00gle.com compared to gΠΎΠΎgle.com.

[0] https://en.wikipedia.org/wiki/IDN_homograph_attack


At the same time there's sites like flΓΌge.de which is not reachable under any domain except the unicode domain, and while ΓΌ could be written as ue, fluege.de is already owned by a competitor.

Over time, punycode is going to cause more phishing problems in non-ascii countries than it's going to solve, because users aren't going to see a difference between xn-blabla.de and xn-blablu.de if all domains are unreadable to them.


I feel like this is a browser UX problem, right? A browser designed to prevent phishing of readers of both ASCII and non-ASCII languages might display both the punycode and unicode versions of a website, and if a heuristic is detected that a homograph is used that would otherwise result in an Alexa Top 100k site, display a dialog to warn against a phishing attack. (Your flΓΌge.de example shouldn't trigger that warning, for instance.)

https://github.com/phishai/phish-protect is an attempt to do this, but I think there's a better middle ground for international users that doesn't simply block-by-default all punycode domains.


It’s so you can’t use punycode to create similarly looking but different domains. Sure there are some defenses but they are weak.


Punycode has its benefits, when people try to phish with phony domains that look like legitimate ones.


I've found it interesting which things do this vs don't.

Slack doesn't convert unicode urls, discord does for example.


The ASCII one. There appears to be a bug in some URL parsers that converts things to punycode incorrectly.


that's interesting


I recommend setting network.IDN_show_punycode to true in Firefox via about:config. This will help keep you safe from this phishing vector.


Vivaldi shows domains in punycode by default. I believe this is the only reasonable solution, otherwise browser makers will always be playing catch-up with exploiters.


At that point, how much value is there in supporting Unicode? By only using ascii (punycode), it pretty much eliminates the reason it exists: To allow software to show a domain in someone’s native language.

Should we perhaps instead be restricting the domain characters to the glyphs of ascii (rfc1035 compliant) and those glyphs that appear in their locale? Otherwise revert to punycode when the glyphs fall outside those ranges.


> At that point, how much value is there in supporting Unicode? By only using ascii (punycode), it pretty much eliminates the reason it exists: To allow software to show a domain in someone’s native language.

Allowing a user to enter a domain in their native language is very much worthwhile, I'd say, even if we revert to ascii for display.


That would still leave e.g. users in a localr with a cyrillic alphabet open to homograph attacks.


this is fascinating. given how many pixels can fit in a character, the possibilities are in the tens of thousands at least.

It's almost possible to write a simple math paper using unicode instead of Latex

these characters can also be used to evade username restrictions and other spam filters when character substitution does not work.

you can even use these codes for bitcoin wallets


Also: wordpreß.com




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: