Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was not able to understand why these code points are bad. The post states that they are bad, but why? Any examples? Any actual situations and PoC that might help me understand how will that break "my code"?


Sometimes it's not just "your code". Strings are often interchanged and sent to many other parties.

And some of the codepoints, such as the surrogate codepoints (which MUST come in pairs in properly encoded UTF-16), may not break your code but break poorly-written spaghetti-ridden UTF-16-based hellholes that do not expect unpaired surrogates.

Something like:

1. You send a UTF-8 string containing normal characters and an unpaired surrogate: "Hello /uDEADworld" to FooApp.

2. FooApp converts the UTF-8 string to UTF-16 and saves it in a file. All without validation, so no crashes will actually occur; worst case scenario, the unpaired surrogate is rendered by the frontend as "�".

3. Next time, when it reads the file again, this time it is expecting normal UTF-16, and it crashes because of the unpaired surrogate.

(A more fatal failure mode of (3) is out-of-bounds memory read if the unpaired surrogate happens at the end of string)


I had a github action with a phrase 'filter: \directory\u02filename.txt' or something close to this and the the filename got interpreted as a utf-8 character rather than a string literal causing the application to throw an error about invalid utf 8 in the path. Had to go about setting it up to quote the strings differently, but you get to see a lot of these issues in the wild.


Suppose, when you were registering your username `develatio`, you decided to put U+202E RIGHT-TO-LEFT OVERRIDE in there as well. Now when somebody is reading this page and their browser gets to your username, it switches the text direction to render it right-to-left.


and "that's it"? I mean, it does sound like it might introduce unexpected UI behaviour, but are there any other more serious / dangerous consequences?


One of my pet peeves is when UIs don't clearly constrain and delineate the extent of user-controlled text. Plenty of phishing attacks have relied on having attacker-controlled input seem authoritative, e.g. getting gmail to repeat back something to the victim.


Making any page that mentions you – including admin pages that might be used to disable your account – become unreadable is bad enough.

Another comment linked to this:

https://trojansource.codes


RTL lets you obfuscate file extensions.

E.g. Annexe.txt (that you might assume would be safely opened by a text editor) could actually be Ann\u202Etxt.exe, a dangerous executable.


Yes, dangerous consequences of unexpected UI behaviour: imagine writing a URL backwards with a right-to-left override, and it clearly says www.yourbank.example but it goes to www.evilsite.example/example.yourbank.www




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: