I couldn't debug the code because of my name

xlii · on Oct 20, 2021

Very similar problem to one described started my exodus from Google services.

I also have non-latin characters in my name however I knew it was always an issue so I never used it in paths etc.

At some point, long time ago, I was tasked to do some maintance with Google Cloud service (can't remember the name of the service now) which was doable only through Python CLI utility and it failed with very similar Python error.

What I found out rather quickly is that utility took my name from Google+ profile, which did include those non-latin characters. No biggie - I thought and fired e-mail to support (yeah it was those times it was still that easy). Few hours passed and I received information that this won't be fixed anytime soon and the best course of action would be to change my name.

Of course, support person probably meant to remove the diacriticals from my Google+ profiles, but still it left unplesant aftertaste for years to come.

perl4ever · on Oct 20, 2021

A Polish relative of mine used to just gave an arbitrary substitute name (e.g. "Dave Smith") for restaurant reservations, because even if they could write his last name, they wouldn't be able to pronounce it.

My sibling has a name that has an accent, and just enters it with the plain letter most of the time. The name was once rare and "ethnic", but became popular a generation later so people know how to pronounce it regardless.

Our parents gave us two middle names, wanting to preserve our grandmothers' surnames, but also in the spirit of "Bobby Tables", having ambivalent feelings about the computerization of society tending towards inflexibility.

According to: https://en.wikipedia.org/wiki/Naming_customs_of_Hispanic_Ame...

...misunderstanding of naming customs in the US has actually led to significant consequences due to last names not matching on legal documents.

I remember reading a story about how there are people in China whose name incorporates a character that is obscure enough, the authorities are trying to eliminate it and get them to change their name. If I recall correctly, Chinese has a particular problem with characters that are part of names that have been around forever, but are no longer used for ordinary writing.

pezezin · on Oct 21, 2021

I understand your troubles, I'm from Spain so I have two family names and my given name has an accent. Now that I'm living in Japan, it's an endless source of fun.

Regarding Chinese names and uncommon characters, Japan has the same problem. It's specially problematic for place names, with some kanji used to write the name of a single place in the whole country! I used to write in a place with such an obscure kanji that I wouldn't type it on my Linux PC.

It's also difficult for people moving to the other country, with some characters existing in one country but not in the other. I have a Chinese coworker who needs to write his name in katakana because his characters don't exist in Japanese.

perl4ever · on Oct 21, 2021

Because nearly everything in the US shoehorns things into three boxes, virtually every place my name is recorded on something important is different.

I could've been consistent in using two or three out of four, but when I was younger I was intimidated by forms that say you must enter your "full legal name", so I would, and they would mangle it unpredictably.

Checking account, credit card, drivers license, and property deed, each one different.

Well, in fact, my social security card and my birth certificate don't match, so I was doomed from the start.

It gives me some sympathy for places that try to regulate names to avoid parents doing something too goofy.

I sometimes wonder if there will come a day when all the databases will stop allowing discrepancies, and it won't matter to the powers that be, because it's such a tiny percentage of the population that becomes "unpersons".

GoblinSlayer · on Oct 21, 2021

Railroad signs transcribe place names in hiragana. I imagine a place name should be recognizable in hiragana form.

pezezin · on Oct 21, 2021

This was in the inaka, with the closest train station 40 km away.

em-bee · on Oct 21, 2021

a friend of mine has that problem. his company is owned by his wife because his name can't appear in legal documents and he refuses to change it. his approach was that he petitioned unicode to include that character.

i don't think the authorities are actively trying to eliminate those characters but simply don't want to go through the effort to track down and have those characters added to the standard. the process takes years and in the mean time you have to live with the inconvenience. also most of the people faced with the problem would not even know how.

sugarkjube · on Oct 21, 2021

> his name can't appear in legal documents

maybe has the benefit that he also can't receive speeding tickets?

> his approach was that he petitioned unicode to include that character.

Sounds like the right approach (although not the easiest). if unicode can include thousands of smileys, the least they can do is include actual characters used in people names.

nullspace · on Oct 20, 2021

> the best course of action would be to change my name

As someone who has been told this, for other reasons, I empathize. My reaction has always been - "Your system can't even handle names, you need to fix it".

Edit: I wish there was a library / service that helped you handle all sorts of edge cases in names, so that you don' t have to worry about it. Just use a user-id, and set / get a name from a lib / service that can actually handle it.

web007 · on Oct 21, 2021

I believe that library / service is called UTF-8.

These days everything should be stored as bare UTF-8 data (or utf8mb4 if you're MySQL) and presented without anything else. Don't parse it, don't slice-and-dice it, don't prepend or append titles or honorifics or suffixes, don't make assumptions about length or content beyond "must be > 0 as a whole" and DEFINITELY don't use it as an identifier. Treat it as a non-unique opaque token and you'll be fine greater than 99% of the time.

There are people with no last name. There are people with two or three or twelve middle names. There are people with a number for a last name. There are people with a symbol for their entire name.

Take what they give you and use it and be done with it.

lmm · on Oct 21, 2021

Not good enough, thanks to Han unification - if you do that you'll mangle Japanese names.

web007 · on Oct 21, 2021

TIL, https://en.wikipedia.org/wiki/Han_unification#Examples_of_la...

It looks like there's no general solution possible with Han unification. If you have any two of ZH and JA and KO and VI in a page, you will fail to display one of them correctly for certain characters unless (as in that wiki page) you add a LANG attribute for each element they are contained within.

Personally, I would use the browser's language or user locale to set the page language and give up. Then in Japan the local (Japanese) names would look fine, and same for China, Korea and Vietnam. Local consistency versus complicated perfection (tracking the input language as well as tokens and using them everywhere), and I could blame the browser for doing poorly at its impossible job.

One possible "perfect" fix would be to store the token and <span lang=...>$token</span> as well. The only place the non-wrapped version would be used is plaintext email or SMS, either of which are beyond lost causes for other reasons. Doing it with an embedded SPAN tag presents its own problems with sanitization, as well as guaranteeing it's always wrong if the input language was specified incorrectly when the token was populated, versus as above where it would be corrected to the local-optimal version if the user locale overrides it.

sushsjsuauahab · on Oct 21, 2021

I pray that a more complicated solution is not needed, but when I was living abroad I would always encounter issues with sites where they thought "since the IP is from x, or since the browser is requesting lang y, then we should think that this American passport holder is a Spanish citizen and thus we can make assumptions about him."

The ultimate source of this issue is that we are taking names and official IDs too seriously, but I doubt that problem will go away for "serious business". Funnily enough though, it already has for things like restaurant table reservations where all info provided is quite literally just a string for a human to do something with. No need to validate if the user's phone country code matches the country in which they are reserving a table...

WorldMaker · on Oct 21, 2021

It's not an ideal fix, but Unicode has had "variation selectors" for a few versions now which can force specific CJK variations: https://en.wikipedia.org/wiki/Variation_Selectors_%28Unicode...

Variation selectors are getting a good workout/testing technically in emoji at least (a lot of emoji are "just" "old" Unicode codepoints with a ZWJ and the variation selector known as the emoji variation selector to tell systems to always show it in "emoji styles"). I can't speak for how well it works in practice for CJK languages as I don't know them (more reason I appreciate emoji for letting me test compatibility with hard parts of UTF-8 in ways that I can read and most users want), but I do appreciate that there's at least the idea for/part of a fix in "recent" Unicode.

I'm also imagining it is not a fun thing to implement in practice, as Unicode at this point maintains a massive database just for it: https://www.unicode.org/ivd/

indigo945 · on Oct 21, 2021

"Mangle" is an exaggeration. Japanese names will look correct after Han unification both to Japanese people on a Japanese computer and to Chinese people on a Chinese computer. (All other combinations fail.)

jack1243star · on Oct 21, 2021

And we bilinguals can spam <span lang="..."> to try force the correct font on our blogs. Ugh.

lmm · on Oct 21, 2021

> Japanese names will look correct after Han unification both to Japanese people on a Japanese computer and to Chinese people on a Chinese computer.

Only if you are displaying them in a way that respects the computer's preferences (most websites and programs, especially American websites and programs, don't) and those preferences are set correctly. And certainly if you have text blocks that contain both Chinese and Japanese names you will always mangle at least one of them.

GoblinSlayer · on Oct 21, 2021

American honorifics get me every time. What's the point in teaching a computer to use honorifics? It's a heap of semiconductors that stirs a heap of bits. And on top of that you teach it yourself.

dymk · on Oct 20, 2021

Has that reaction ever resulted in the other party fixing their system in a timely manner?

kingcharles · on Oct 20, 2021

In my experience, never. You are considered the bug, not their system.

PeterisP · on Oct 21, 2021

Here's a relevant recent EU court case of a person arguing with their bank that their name should be represented properly including the accented 'é', as the GDPR asserts a right to have mistakes of personal data corrected. The bank argued that it's impossible due to a legacy system using EBCDIC encoding and would be expensive to change. The appeals court affirmed that the customer has the right to get mistakes in their personal data corrected, and it is the duty of the bank to do so even if it is expensive.

https://gdprhub.eu/index.php?title=Court_of_Appeal_of_Brusse...

GoblinSlayer · on Oct 21, 2021

I once did it, encoded UTF-8 text in a legacy text because the system interface didn't support unicode, and decoded on the other end. Used • prefix as a marker of encoded string... now that I think about it, it could be just old good BOM. MIME standard also has extensive experience in packing arbitrary data in 7-bit clean text.

sushsjsuauahab · on Oct 21, 2021

This is an insanely nightmarish precedent. We need to take computer systems less seriously. In a human-interface world, these issues are avoided thanks to simple human intuition.

unionpivo · on Oct 21, 2021

I disagree. We are in this mess because nobody took it serious enough for long time.

Now that serious money is on the table, it might actually get solved and fixed once and for all.

We/us in tech had 30 years to fix our shit on our own. We didn't, that's the result.

dylan604 · on Oct 20, 2021

I'd consider it more of an edgecase than a bug.

zarzavat · on Oct 21, 2021

Using only ascii letters is the edge case. Most of the world either uses Latin with diacritics or a different writing system.

Shorel · on Oct 21, 2021

If you do that, then we will consider you an edgelord.

mjevans · on Oct 20, 2021

This is exactly why I hate the way Python3 handles Unicode.

EVERY language should _try_ to handle Unicode such that if a data sequence were valid before it remains valid after. NONE should ever FORCE validation, since sometimes, like in the article's case, the correct answer is GIGO. Just pass it through and hope it continues to work. Sometimes the error is trying to enforce that validation.

nightpool · on Oct 20, 2021

How is this Python's fault? It's not like the `docker-compose` file would have worked any better if it silently replaced one of the volumes with an inaccessible file. Instead, you'd just get a failure from the Windows filesystem API when you tried to access or create a file at "C:\\Users\\Miko�aj\\AppData\\Local\\JetBrains\\Rider2021.2\\log\\DebuggerWorker\\\", right?

a1369209993 · on Oct 21, 2021

Not sure about Windows, but on a real operating system that's:

  $ ln -sT Mikołaj/ Miko�aj

That certainly isn't good, but it is

> would have worked any better

psd1 · on Oct 21, 2021

Your think Windows doesn't have symlinks? Snobbish and wrong, nice combo

a1369209993 · on Oct 21, 2021

You think "I'm not sure" means "I think it doesn't"? Pot, kettle; I'm sure you'll hit it off nicely.

geofft · on Oct 20, 2021

Python 3 usually handles this correctly, and I'm a little bit confused what's going on in the article, exactly.

For UNIX path names (and other OS data like environment variables), Python uses the "surrogateescape" error handling method, which does exactly what you ask. Any byte sequence can be converted to a string. If it decodes as valid UTF-8, it will do that. If it hits a byte that does not decode as valid UTF-8 (necessarily a byte >= 128), it will map it to code points U+DC80 through U+DCFF. These are in a reserved ranges of code points ("surrogates", which make it possible to represent code points > 0xFFFF in UTF-16), and they can't show up in actual Unicode text (i.e., there is no UTF-8 encoding of them, strictly speaking, and if you applied the UTF-8 encoding algorithm to a code point in the U+D800 to U+DFFF range, you would get bytes that aren't valid UTF-8).

On the way out, this is reversed. So you get the results you expect if your filenames are in UTF-8, but since UNIX has no requirement that filenames are indeed UTF-8 (the only constraint is they can't contain NUL or ASCII-forward-slash), the bytes are preserved in a funky-looking format in Python and you get the exact same output on the other end.

See https://www.python.org/dev/peps/pep-0383/ for more on what's going on. The tl;dr for users of Python is that if you want to interact with, say, subprocess output as mostly-normal strings (instead of bytes) but you want to be robust to non-UTF-8 bytes, you should do something like

    subprocess.check_output(["some", "command"], errors="surrogateescape")

You don't need to do this for APIs that directly interact with pathnames, because they do it already. You just need to do it for things like subprocess output and file contents that Python doesn't know you want to handle in this way.

...

On Windows, however, path names must be valid Unicode and are stored in UTF-16. So the idea of a "ł" that doesn't decode properly shouldn't even happen! Mikołaj's home directory ought to be a very boring (and valid) 004d 0069 006b 006f 0142 0061 006a on disk.

Windows doesn't enforce that file paths are valid UTF-16 though (specifically, the surrogate code points are only supposed to show up in a certain way, but nothing enforces that and you can have random surrogates on disk), and hence Rust, which internally represents all strings in UTF-8, has a solution ("WTF-8") that's basically the inverse of surrogateescape - it uses extrapolated-UTF-8-encoding-of-surrogates to handle unpaired surrogates. http://simonsapin.github.io/wtf-8/ But it seems very odd to me that the directory C:\Users\Mikołaj would actually contain any of those, and if it doesn't, I would expect it to very easily turn into a Python Unicode string.

Maybe this is from a Python version before https://www.python.org/dev/peps/pep-0529/ , which is claimed to "fail to round-trip characters outside of the user's active code page"? Maybe this is from a Python version after that change and it's wrong?

nightpool · on Oct 20, 2021

The incorrect docker-compose file was generated by Java (Jetbrains) but consumed by Python (docker-compose). The GP comment was complaining about Python's strict Unicode consumption, not Java's invalid Unicode generation.

deathanatos · on Oct 21, 2021

The Docker compose file is YAML. My reading of YAML's standard is that it must be in one of the Unicode encodings, and the smell I get from the article is that it is probably in windows-1250 (the CP Windows would use for Polish; Mikołaj is a Polish name, 0xb3, the octet in the error, is the Windows-1250 encoding of "ł"); thus, it isn't valid YAML.

I'm not sure what sane behavior Python could have here besides errorring.

> EVERY language should _try_ to handle Unicode such that if a data sequence were valid before it remains valid after.

This sequence was never valid, and never will be.

> in the article's case, the correct answer is GIGO. Just pass it through and hope it continues to work.

Dear God, no; emit a diagnostic and abort. Countless decades of existing code have shown time and again that "plow forward with some hot garbage" is not a good idea. But that ignores that … that that isn't how any of this works; the YAML parse is going to want to emit strings, which the incoming data isn't.

account42 · on Oct 21, 2021

> This sequence was never valid

Neither were four-byte UTF-8 characters at some point.

> and never will be.

We shall see.

geofft · on Oct 21, 2021

Oh, I see. But if it was UTF-8 it would have worked... I guess the problem is that JetBrains is generating the file in (e.g.) Windows-1252, and Python needs to be told that?

Does it work if you set the environment variable PYTHONENCODING to cp1252?

(I suppose I should either contact the author, or try it myself...)

stubish · on Oct 21, 2021

JetBrains is generating an invalid YAML file, which are UTF-8. If they were using a decent YAML library, it would have crashed at that point. And firmly pointed the finger at the real bug, reading raw bytes from the environment or a .properties file parser and assuming it is valid UTF-8.

And this is why you always validate your data when you slurp it in, or else you pass crap down several layers where it crashes or mostly works with the potential for security holes or catastrophic behavior, and a pain in the arse to track down since the actual bug is nowhere near where you are looking.

account42 · on Oct 21, 2021

I'd expect a decent YAML library to have functions taking UTF-8 and not wasting time verifying that the data passed is actually UTF-8 in release builds.

stubish · on Oct 22, 2021

You generally don't verify on output, because you verified on input (especially with languages where text strings are Unicode or UTF-8 byte strings like Python3 or Rust). But it would also be a premature optimization when it does make sense to check. For expected YAML use cases I doubt it would be a measurable difference in runtime. And it has to inspect the strings in any case to correctly quote things and deal with indentation if there are newlines.

deathanatos · on Oct 21, 2021

Normally I'd agree, a windows-1252 misencode should be one's default guess when mojibake is afoot. Unfortunately, the errant byte in the error is 0xb3, which is "³" in windows-1252.

If you Google, "Mikołaj",

> Mikołaj is the Polish cognate of given name Nicholas

Then Google, "windows character encoding polish"

> Windows-1250 - Wikipedia

And 0xb3 is "ł" in that encoding.¹

> Does it work if you set the environment variable PYTHONENCODING (sic) to cp1252?

I don't know if setting PYTHONIOENCODING would work here; I don't think it should affect this. Really, fixing the YAML file is the fix. (And fixing the thing that generated it.)

¹it is queries like this that really make me love the search engines of today. This would have been hell in the days of Alta Vista.

musicale · on Oct 21, 2021

> the best course of action would be to change my name

That's usually easier than getting a company to fix their software.

tyteen4a03 · on Oct 21, 2021

Fun fact: If you have the exclamation mark (!) in your Windows username, Java will think it's the jar separator and `getResourceAsStream` will refuse to work. This broke many people's Minecraft installation over the years.

The bug in question [0] was reported in 2001 and remains unsolved 20 years later.

[0] https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4523159

kspacewalk2 · on Oct 21, 2021

I find the very idea of putting an exclamation mark on one's username and not expecting eventual problems to be quite curious.

nextaccountic · on Oct 21, 2021

I find the idea of expecting names from other cultures to follow the customs of one's culture to be quite curious.

https://shinesolutions.com/2018/01/08/falsehoods-programmers...

(but, in any case, this discussion here on HN https://news.ycombinator.com/item?id=18567548 provides some nuance)

kspacewalk2 · on Oct 21, 2021

It should not cause issues. For regular users, it should be an expectation that, in well-written software, it won't cause issues. But in our world of power+ users... It will cause issues. It just will. Maybe in some number of decades that won't be the case, but it's irrelevant right now.

yjftsjthsd-h · on Oct 21, 2021

With your programmer hat on, sure. But as a user and annoyed programmer, it's a disgrace that the contents of your username is treated as anything but a totally opaque string. It's like saying that of course putting semicolons into web forms will break them: not novel, still horrifying.

jmopp · on Oct 21, 2021

The Nama language has a letter that is often replaced with an exclamation mark in common typography. The lead actor in the film The God's Must be Crazy was named N!xau ǂToma

skrebbel · on Oct 21, 2021

When you start a Windows computer for the first time, Windows gives you a text box and asks you to choose a user name. It does not explain in detail what this means or what it is for.

Does it really surprise you that people are going to type just about anything into that box that it will allow?

vincnetas · on Oct 21, 2021

i recommend reading this "Falsehoods Programmers Believe About Names". world is crazy place if you expect some system in how people name each other and are writing it down.

https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...

edaemon · on Oct 21, 2021

It can be used to indicate a click sound in some languages and the IPA. Probably not the most common reason to come across this bug but it's a case worth considering.

https://en.wikipedia.org/wiki/Exclamation_mark#Phonetics

mkotowski · on Oct 21, 2021

Quite a lot of languages use ! or ? as a letter. For example in Canada: https://en.wikipedia.org/wiki/Question_mark#Indigenous_langu...

account42 · on Oct 21, 2021

That looks like an example of languages using ? to represent a different but similar looking character.

retrac · on Oct 21, 2021

! is actually used as a regular letter with a phonetic value to write quite a few languages in Africa, usually representing a click sound. It's even in the name of some of them: https://en.m.wikipedia.org/wiki/%C7%83Kung_languages

bluecatswim · on Oct 21, 2021

Really? If someone asked you if exclamation marks in usernames would cause problems before you were aware of this, would you have said yes? It's a very common special character, it's not like it's a control character or some obscure unicode thing. Besides, it's a common thing to add to usernames outside of services where you use your real name.

kspacewalk2 · on Oct 21, 2021

Absolutely, I would have said yes to the question "do exclamation marks have the potential to cause problems in some piece of software or other, the likelihood going way up with the number and the obscurity of software you use on this machine".

fencepost · on Oct 21, 2021

Whether this seems like an issue may depend on whether it's a significant character on OSes you've used.

pvillano · on Oct 21, 2021

well, a username can reasonably be limited to [a-zA-Z0-9_] but a name field should take Unicode

selfhoster11 · on Oct 21, 2021

I just find it annoying, much like the fact that Unix shell still has trouble dealing with spaces in things.

GoblinSlayer · on Oct 21, 2021

Simply install the game into %ProgramData%

yooogle · on Oct 21, 2021

Classic java

amarshall · on Oct 20, 2021

For a list of strings that often cause problems to, e.g., add to a test suite, see https://github.com/minimaxir/big-list-of-naughty-strings

OskarS · on Oct 20, 2021

An enormously useful list, I’ve used it several times, and it can often dig up some real nastiness if you haven’t been super careful.

This entry, by the way, is a fantastic little easter egg in the list: https://github.com/minimaxir/big-list-of-naughty-strings/blo...

vertis · on Oct 20, 2021

No, seriously, wake up

jamiek88 · on Oct 20, 2021

That was creepy!

ryanianian · on Oct 20, 2021

Very handy. My previous simple test-case was simply a selection from this well-known text-file which is simply a collection of somewhat uncommon unicode characters, usually used for rendering tests.

https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt

But this set of strings is specifically designed to cause edge-case errors.

Also don't forget Spolsky's seminal "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

munk-a · on Oct 20, 2021

It's also important to width-test fields. Never forget to make sure that WWWWWWWWWWWW doesn't cause weird application wrapping.

aidenn0 · on Oct 20, 2021

I used a system where the maximum length on the "new password" field in the change password form was longer than the password field in the login form.

The symptom was that I could login if I used my password manager browser plugin, but not if I pasted it from my password manager.

pferde · on Oct 20, 2021

I have seen a windows app with a text field whose max character count was somehow determined by system font size - probably a crude way to make sure the entered text fits the hard-coded field size.

The problem was that this field was used to enter a 10-digit code, and as it turns out, on default Windows10 system, the fonts are set up so that this field only fit 8 of them. Oops! :)

munk-a · on Oct 20, 2021

I'd like to see how that App would work with me sitting here fonts cranked up to 175%. I've never heard of a setup like that though - it sounds like it'd be surprisingly intricate to actually configure.

joe_guy · on Oct 21, 2021

Not fully related but you reminded me of this.

Around the time of AOL3 or early AOL4 someone found a user name exploit.

When making a new account, on the client side use winapi's EM_LIMITTEXT to bypass the max character limit on the input textbox. Enter one or two letters, a bunch of spaces, then some more letters.

The server side would truncate to the original length, leaving you with a one or two letter username, working around the 3+ requirement.

munk-a · on Oct 20, 2021

I maintained a system where we had unbounded password length... but only respected the first six characters of the password. (we did fix that).

aidenn0 · on Oct 20, 2021

Unix passwords worked that way for 8 characters for many years due to the crypt algorithm used.

stzsch · on Oct 21, 2021

IIRC VNC passwords are still like this.

bradknowles · on Oct 22, 2021

Sounds like some banks I’ve heard of.

TechieKid · on Oct 21, 2021

Discover (discover.com) currently has a similar bug where it'll allow me to login with my password, but will not accept the same password in the 'Change password' workflow as the old password, complaining about it being invalid. (shrug)

kevinmgranger · on Oct 20, 2021

You're lucky they weren't different lengths in the backend. I've been bitten by that surprise one too many times (which is any number higher than zero)

aidenn0 · on Oct 20, 2021

The most ridiculous thing is the UI for setting the password even said "X-Y characters long, must include at least one..." but the login page could not support Y characters.

a1369209993 · on Oct 21, 2021

> one too many times (which is any number higher than zero)

Nitpick: if the number is higher than one, then it's at least two times too many.

amarshall · on Oct 20, 2021

Related (we do this at my work): https://en.wikipedia.org/wiki/Pseudolocalization

Dunedan · on Oct 21, 2021

For finding bugs caused by unexpected inputs I also find property based testing very useful. For Python there is the excellent hypothesis library for doing that: https://hypothesis.readthedocs.io/en/latest/

tomaslaureano · on Oct 20, 2021

Great resource! I usually use pangrams (holoalphabetic sentences like "The quick brown fox jumps over the lazy dog") to ensure that my code can handle all the alphabet characters for the languages that should be supported at the very minimum.

david422 · on Oct 20, 2021

There's also this article: falsehoods-programmers-believe-about-names:

https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...

Certainly informative if you haven't seen it before.

My takeaway from it was that design your system to try to accommodate as much as possible, but it would basically be impossible to accommodate them all, so aim for your target audience.

umvi · on Oct 20, 2021

Using non-ascii characters in file paths, toolchain config files, and other non-display contexts is just asking for trouble, even if it is your name...

BiteCode_dev · on Oct 20, 2021

Unfortunately, it's true, most toolchains are stuck in the past, and don't deal with non-ascii characters or even spaces very well. In fact, I just learned that spaces in .deskop files values could cause trouble after a long debugging.

But it's a shame.

In Europe, we do have a lot of non-ascii characters everywhere. Ubuntu puts a "Vidéo" and a "Téléchargements" directory in my $HOME because I'm french. If I were to use my name as my username I would have even more troubles.

I'm careful with not using special chars in names for work, but it feels like I'm a girl trying to not dress sexy in the wrong part of town: necessary, but I shouldn't have to do this, and it's definitely the others to blame.

All in all, I thank the Gods of encoding for Python 3 unicode handling. Having a scripting language that does the right thing out of the box is wonderful on this side of the pond.

mjevans · on Oct 20, 2021

"The right thing" for filesystem entries is transparently copy, do not evaluate. A file path is a mem-copied, length value sized block of identifier you don't ever mangle. If you must mangle it, touch only the necessary areas as directed. (E.G. join with os.pathsep and do not normalize anything).

Want to offer Unicode validation? Sure having that as an OPTION is fine. Forcing it means I can't rely on that tool to handle real world data which happens to not be valid but is still a valid file-system address.

10000truths · on Oct 20, 2021

One thing I've noticed is that ext, xfs, btrfs and zfs all explicitly store a length field alongside the filename. There's nothing inherent in the disk layout of these filesystems preventing them from supporting filenames with embedded slash and nul characters - those limitations are imposed by the kernel's VFS implementation. It would be nice to have a special version of open, exec etc. where one could specify a filepath as a length-prefixed array of length-prefixed strings.

GoblinSlayer · on Oct 21, 2021

ODBC (ISO 9075-3) got it right 30 years ago: all strings are accompanied by the length argument, which also accepts the sentinel value NTS, when you mean a null-terminated string.

gjvnq · on Oct 21, 2021

I once thought about using JSON arrays for filepaths so all possible strings would be valid filenames.

BiteCode_dev · on Oct 21, 2021

That the beauty of it with python:

- if you want to treat paths like unicode strings, you can. Which is great for simple scripts where you don't want to deal with complexity. And 99% of the time, it's enough with modern OSes.

- if you want to threat path as bags of raw bytes, you can. Which is necessary to transparently copy and do not evaluate, as you said, for covering edge cases.

- if you need to actually deal with those as strings but don't want to loose data for edge cases, so a mix of the 2 above, you can use surrogate escapes

mjevans · on Oct 21, 2021

Where does that filepath come from? A config file; are you going to do your text processing, and interaction, with other modules in byte[] arrays in Python 3+?

Python 2's unicode model was _closer_ to correct, the trivial coercion between byte[] and Unicode.

Conversion also shouldn't imply, force, or check Validation nor Normalization. Labeling a bytestream with an Encoding and validating / normalizing that encoding should be options. Operations on bytestreams with encoding related attributes should set them to either 'unknown' result or to a proper output type if they're aware the manipulations will still yield a valid encoding.

Normalization is more complex, since Unicode strings can be normalized in different ways, then combined, and still be a valid string but no longer uniformly normalized.

GoblinSlayer · on Oct 20, 2021

No seriously, create a user d'Artagnan.

jasonpeacock · on Oct 20, 2021

This is the modern, post-ASCII computing world, we should no longer be willing to settle for the lowest-common-denominator of ASCII-only strings.

There's no excuse for actively supported, paid products to have these problems today.

ainar-g · on Oct 20, 2021

Especially if those products are developed by a company from Russia, where Cyrillic is used. For me, a Russian myself, this situation is honestly ridiculous.

zczc · on Oct 20, 2021

Russian companies generally have ascii-only username policies

mbesto · on Oct 20, 2021

Do you write "if" statements in Cyrillic when you write in <insert Python/Ruby/Java/.NET/whatever>?

nine_k · on Oct 20, 2021

No. Keywords are ASCII everywhere (no, APL's are not words). Mixing English in keywords and non-English in identifiers feels odd.

Algol-68 supported localized sets of keywords; fortunately this language is gone.

You can #define non-ASCII stuff in modern C++. It's your best chance to "localize" a mainstream language.

Same would work for Clojure, but Lisp uses a lot of quirky abbreviations like `cdr` or `setq` that give awkward translations.

pavel_lishin · on Oct 20, 2021

It would be very amusing to see "если" in an if statement, given how much it looks and sound like "else" at a brief glance.

1-more · on Oct 21, 2021

There are a few computer languages that have non-english keywords though! And among them it looks like there was a version of Algol with Russian keywords, as well as a bunch of others in the list. Scanning it it would seem that Logo and BASIC get translated a lot, which makes sense for teaching young learners who haven't learned English yet.

https://en.wikipedia.org/wiki/Non-English-based_programming_...

selfhoster11 · on Oct 21, 2021

Why not? Several programming languages in the olden era did get localised.

GoblinSlayer · on Oct 20, 2021

I thought ArnoldC was just a couple of #define's, but looks like it isn't.

amenod · on Oct 20, 2021

True. But these actively supported, paid products build upon layers and layers of no-longer-supported, free/opensource products. Good luck fixing them.

Not saying that this is OK, just explaining why using non-ascii characters, in this day and age, is still asking for trouble.

SAI_Peregrinus · on Oct 20, 2021

This is on the Windows version.

Windows 2000 is when the OS changed to UTF-16 by default. Before that Windows NT was UCS-2, IIRC only the DOS-based Windows versions were Windows-1252 internally, starting from Windows 1.0. So while ł wasn't supported in Windows 1, characters like ñ were. Windows has literally NEVER been an ASCII-based OS.

horsawlarway · on Oct 20, 2021

Sure, but having used a lot of the windows system apis (admittedly - a lot of years ago) it was a complete hodgepodge of which api would take a char vs a wchar, and then they tried to hide the whole thing behind tchar, which just made it even harder to keep track of.

Basically - I agree: This shouldn't be a problem, and 7 months is a long time to wait for a basic fix. But there are a lot of footguns hanging around in windows code with respect to character encodings.

Just looking at the first result on google for "c++ get windows home directory" shows this: https://docs.microsoft.com/en-us/windows/win32/api/userenv/n...

Which takes a long pointer to tchar string (LPTSTR) - so this behavior is dependent on the unicode settings of the project at compile time, even today.

account42 · on Oct 21, 2021

> Which takes a long pointer to tchar string (LPTSTR) - so this behavior is dependent on the unicode settings of the project at compile time, even today.

The documentation is simply wrong, GetUserProfileDirectoryA which you linked always takes a LPSTR (always "ANSI") while GetUserProfileDirectoryW always takes a LPWSTR (always WTF-16). This is reflected in the function prototype at the top. Only the define GetUserProfileDirectory switches between these two. The define is a compatibility hack and arguably was a mistake, but you can always the W-suffixed function no matter what the project settings are.

david_allison · on Oct 20, 2021

> Windows 2000 is when the OS changed to UTF-16 by default.

Paths are UTF-16 + unpaired surrogates, so a Windows path isn't legally representable in UTF-8.

account42 · on Oct 21, 2021

But WTF-16 paths can be converted to WTF-8 just fine. You can even use the same algorithm, just only pair surrogates if they are matching an otherwise interpret them as UCS-2 values and encode those normally to "UTF-8".

fluxem · on Oct 20, 2021

Also spaces. I spent half an hour debugging why cmake cuda build was failing.

munk-a · on Oct 20, 2021

A lack of support for spaces at this point is unacceptable. I, personally, despise spaces in paths but on windows a whole bunch of default system paths already have spaces embedded in them in major ways... and let's not forget parens as well - thanks "Program Files (x86)"

PeterisP · on Oct 21, 2021

Using non-ascii characters in file paths, toolchain config files and other non-display contexts is something every development team should explicitly, intentionally do in order to catch such bugs.

"Asking for trouble" is a key part of testing. My suggestion would be for a QA person to have their username (and root folder of the testable project) to start with a space, and be followed by an accented letter, tab-symbol, apostrophe, an emoji, followed by an unicode RTL control character and some Arabic text.

gumby · on Oct 20, 2021

This is blaming the victim

midev · on Oct 21, 2021

No it's not. It's not even remotely victim blaming. At no point did anyone even remotely hint it was their fault.

People have taken this to a ridiculous level. Giving practical advice about how to avoid problems is not blaming the victim. Telling my child to look both ways before crossing the street is not blaming him if he gets hit. It's not wanting to see him get hurt when the person actually at fault fucks up. Telling someone to avoid non-ASCII characters because a program can't handle it is not blaming them....

Your comment was unhelpful. Great! Not their fault! What now? It's also part of a larger trend that will lead to people being hurt.

gumby · on Oct 21, 2021

The computer is supposed to make our lives better, we should not be required to make the computer’s (or programmers’) life better. Your attitude reminds me of the people in the 60s who included punch cards in the utility bills, marking them “DO NOT FOLD, SPLINDLE, OR MUTILATE” (it became a meme). Or the people in Spain who changed their alphabet to make it easier for computers to sort.

My name appears differently in my passport, on plane tickets (not always using the same modification), and my green card. And for the latter two I left out the part of my name that can’t properly be represented at all in ASCII. And you are saying that somehow I am at fault?

midev · on Oct 21, 2021

> The computer is supposed to make our lives better, we should not be required to make the computer’s (or programmers’) life better

Yes... The computer is at fault for not supporting proper names... That's literally what everyone has said. Nobody blamed the user....

> Your attitude reminds me of the people in the 60s who included punch cards in the utility bills, marking them “DO NOT FOLD, SPLINDLE, OR MUTILATE” (it became a meme).

The problem should be fixed, but it's not victim blaming to tell someone how to still submit their bill.

> And you are saying that somehow I am at fault?

No... I'm literally saying the opposite.... I think you need to reread your comment, then my comment.

Telling you to omit those characters so you can still travel internationally is not blaming you in any way shape or form... Your criticism of practical advice being victim blaming is harmful and unhelpful.

makeitdouble · on Oct 21, 2021

This is reality though.

As much as I wish we lived in a better world where name characters were better handled, using anything outside of [a-ZA-Z]{12} as a username is a world of hurt. Some people just realize it later than others.

So yes, you shouldn't think of your handlename as your name, it's just another identifier, and choosing simple handle names is a life skill at this point.

bbarnett · on Oct 20, 2021

This wouldn't have happened if using rust!

nightfly · on Oct 20, 2021

Can you knock it off??? This is even more annoying that out-of-place rust evangelism

GoblinSlayer · on Oct 21, 2021

You know he's right. Look at all the rust converts preaching their dogmas here.

burnished · on Oct 20, 2021

Some of the other attempts are a little subtle, this one is a pretty blatant attempt to rile up the folks that are already angry about rust for whatever reason. Please stop.

sschueller · on Oct 20, 2021

Many years ago I could not access the apple developer panel because of the umlaut in my last name. It was eventually fixed but I was quite surprised that such a large company would run into such a basic issue.

devrand · on Oct 20, 2021

My last name has an apostrophe in it which Apple apparently loves to embed directly into their JavaScript unescaped. For a long time neither I nor Apple could look up AppleCare status on my stuff as they were all linked to my Apple ID. The portal would thus require me to login, but then would just show a partially rendered page as my last name was causing an JS syntax error.

nneonneo · on Oct 20, 2021

Hmm, it sure sounds like John <script>alert(1);</script>Doe (Bobby Tables' distant cousin) should sign up for an Apple account. An XSS attack which could target the AppleCare reps' machines could be catastrophically bad...

shakna · on Oct 20, 2021

Like the AirTags bug [0] Apple had recently?

[0] https://www.theregister.com/2021/09/29/weaponised_apple_airt...

doubled112 · on Oct 20, 2021

You'd think the apostrophe would be common enough they'd know it could happen, but no.

I love to enter it and see what each vendor and website's backend does with it.

The Staples Canada website, for example, returns it as ' (HTML escaped) A couple times I've logged in, it seems to escape a new character. I'm currently up to &amp;#39;

devrand · on Oct 21, 2021

Haha yeah I'm fairly used to seeing HTML escaping in my name.

The weirdest case I've had with that is the Six Flags mobile app. To add a season pass you need to provide your card number and last name. For the life of me I couldn't get it to validate, but I saw they showed the HTML escaped version in their e-mails to me. Turns out I had to type out "'" into their input box for my last name as that's apparently what they put in their database.

scollet · on Oct 21, 2021

A Kafkaesque situation of no escape...

irrational · on Oct 20, 2021

>such a large company would run into such a basic issue

Every large company is just a conglomeration of smaller departments. Each department had individual contributors. Some individual contributor in that department wrote the code and if nobody else is their department caught it, nobody else at the large company would have caught it since they have their own work to consider and don't have time to look at other people's stuff.

lostgame · on Oct 20, 2021

I think what OP means is that a company so large should have the resources to test such edge cases.

irrational · on Oct 20, 2021

But, that's not how these things work. It would be nice if every department had unlimited QA resources, but most likely there have at most 1 QA person, and might be sharing that person with other departments. So if that person misses it then...

rodgerd · on Oct 20, 2021

If you look at many of the responses here it's sadly unsurprising: small-minded provincialism or outright xenophobia are no less common amongst programmers than the general population.

SergeAx · on Oct 21, 2021

When I first installed Windows 7 like ten years ago, I entered my Russian name in Cyrillic. When I saw that the system created a directory with exactly that name under `C:\Users\` I immediately scanned the internet for a way to rename it and done just that. I don't want to know how much mess like that in a story I thus had successfully escaped.

NB: the method is still the same, it's a second (not accepted) answer here: https://superuser.com/questions/890812/how-to-rename-the-use... (about ProfileImagePath registry value).

vertis · on Oct 21, 2021

This is sad though. You shouldn't have to change who you are for a computer program.

GoblinSlayer · on Oct 21, 2021

Is "vertis" who you are? There's more to a human, than a name.

SergeAx · on Oct 21, 2021

I have this lower ASCII handle since about 1990, I beleive. That was the time when you just can't do literally anything without one.

simonblack · on Oct 20, 2021

Isn't this one of those "100 things Programmers don't know about People's Names" things?

Like the poor, it will be with us always.

xdfgh1112 · on Oct 20, 2021

I don't know, it's just a Unicode character? Not even a newer one, it's just 2 utf8 bytes. Pretty much everything should support that in 2021.

When I think of 100 things I think of stuff like "some people spell their name in all lowercase and get really funny if you change it"

numpad0 · on Oct 20, 2021

Yeah so double byte characters costs extra. I don’t know, a checkbox or something default off. Always did still does. Double width costs even more.

horsawlarway · on Oct 20, 2021

you're getting downvoted, but between tchar hiding wchar vs char... this literally could be someone toggling off the "UNICODE" checkbox in visual studio somewhere.

deathanatos · on Oct 21, 2021

> Pretty much everything should support that in 2021.

Yes, like IPv6.

selfhoster11 · on Oct 21, 2021

UTF-8 is much less to ask for than IPv6.

hprotagonist · on Oct 20, 2021

windows probably defaults to latin-1

bryanrasmussen · on Oct 20, 2021

the default windows encoding is UTF-16, a long time ago it was Windows-1252 https://en.wikipedia.org/wiki/Windows-1252

deathanatos · on Oct 21, 2021

Given the frequency with which Windows-12* mojibake occurs, people are either a number of holdouts still using Windows 98 SE, or there are a good number of paths in Windows that still use the non-Unicode encodings.

GoblinSlayer · on Oct 21, 2021

Windows supports Windows 98 API and it's more natural to use for some languages like C++. No change is planned there. Windows 98 API is also closer to Unix API, which can incentivize the programmer to use the same approach on windows and unix.

account42 · on Oct 21, 2021

All windows needed to do is support setting that API to UTF-8. It's not like it doesn't already support multi-byte encodings. It's not like they even needed to even assign an ID for UTF-8 or implement the conversions - those existed already. All they needed to do is allow programs to set their codepage to UTF-8. This finally became possible two years ago. Better late than never I guess.

hprotagonist · on Oct 20, 2021

or CP-1251, in some locations.

deathanatos · on Oct 21, 2021

There are a good number of them, all depending on locale.

In this case, I'd guess CP-1250, since 0xb3, from the error, decodes to "ł", from the name, in that encoding. (But not in CP-1251, or '52.)

if you want to see how to arrive there: https://news.ycombinator.com/item?id=28939960

supernes · on Oct 20, 2021

It's somewhat common to see videogames issue a patch shortly after release where they fix crashes due to non-ASCII Windows usernames or non-English locales. I'm not sure what the root cause of the confusion is, other than text strings being hard in general.

jerf · on Oct 20, 2021

It's easy to think the answer is "just UTF-8 everything" but unfortunately the long and twisty history of filesystems means that's not the correct answer, and the "correct answer" is really hard to write down quickly.

If you never display the filename, the answer is to treat existing filenames as bags of bytes, but that breaks down as soon as you need to display them, or if you need to manipulate them by appending unicode to them, in which case you have to decide on an encoding.

Unicode encodings tend to mangle non-Unicode values because they're specified to replace whatever they can't understand with a particular Unicode character, usually represented as a diamond with an inverted ? inside of it.

There's some obscure solutions to this problem, like https://simonsapin.github.io/wtf-8/ (which includes discussion of the 16 bit encodings you need for Windows), but I haven't seen broad support for them. You need a deliberately "noncompliant" encoding/decoding system that doesn't replace unknown characters with replacement characters. Fortunately, compliant systems are becoming more and more popular and available. Unfortunately, that can make file name handling harder than when you had a non-Unicode-compliant handling system for your strings.

nyanpasu64 · on Oct 20, 2021

Rust uses WTF-8 on Windows for OsStr[ing] and Path[Buf]. It's zero-overhead to cast from &str to &OsStr/&Path to &[u8] (though converting WTF-8 to UTF-16 costs an extra operation when performing a Win32 function call). However this doesn't solve the inability to round-trip "possibly-valid UTF-8/16" to "Unicode text" and back (though Python's surrogateescape might be one viable approach).

Other libraries handle this even worse than Rust. On Linux (filenames are bytes), Qt is unable to open files with invalid UTF-8 names, while GTK can open them (but shows an "invalid encoding" message instead of the original filename), which I think is a good-enough approach.

account42 · on Oct 21, 2021

> If you never display the filename, the answer is to treat existing filenames as bags of bytes, but that breaks down as soon as you need to display them, or if you need to manipulate them by appending unicode to them, in which case you have to decide on an encoding.

No you don't. On Windows you treat paths as a u16'\' an/or u16'/'-separated sequences of uint16_t. On Unix it's a '/'-separated sequence of bytes. If you want to display, you need to decode, but for display only - so errors should use replacement characters as a graceful failure. For appending you encode your string and then append the bytes. Never do you decode externally provided paths for the purpose of manipulation.

> There's some obscure solutions to this problem, like https://simonsapin.github.io/wtf-8/ (which includes discussion of the 16 bit encodings you need for Windows)

It's relatively new, but has wide enough adoption cosidering - e.g. it's what Rust uses for Windows paths. It's also straightforward - just encode the unmatched surrogate pairs as if they were the corresponding reserved unicode characters using the normal UTF-8 algorithm.

garaetjjte · on Oct 20, 2021

Part of the problem is legacy Windows cruft. For long time to properly handle Unicode characers you needed to explictly use widechar UTF-16 functions. Legacy narrow encoding is systemwide setting, couldn't be set to UTF8, thus only subset of characters would be represented correctly. Only recently they introduced ability to set narrow encoding for application to UTF-8 with setlocale, which is a lot saner.

jan_Inkepa · on Oct 20, 2021

I've been bitten on a few small releases by forgetting that C# localises number->string conversion by default (which makes sense. But if you forget, and you're writing floats to csv files and the decimal points become decimal commas....).

account42 · on Oct 21, 2021

I disagree that having localization for number formatting based on a system setting by default makes sense. Formatted numbers are needed for both human and machine consumption and only one of those can deal with unexpected formatting.

jan_Inkepa · on Oct 21, 2021

Maybe the galaxy-brain design principle is: if you're designing an API, make sure that where possible bugs occur in an area where programmers care about fixing them (data I/O) rather than somewhere that they neglect (user interface localisation). Voila: better software!

account42 · on Oct 22, 2021

Except programmers test with their own locale and everything works there. Then the user gets an obscure error that the programmer is not able to reproduce because on their system a number from some internal config file was parsed incorrectly.

breakingcups · on Oct 20, 2021

It's also a common thing that Silent (aka CookiePLMonster) fixes in the games he patches.

See for example: - https://cookieplmonster.github.io/2020/05/23/silentpatch-maf... - https://cookieplmonster.github.io/2021/02/27/silentpatch-yak...

mkotowski · on Oct 20, 2021

In case of a home-grown code, it could be simply the question of a programmer awareness. There are still many outdated and/or unfinished tutorials that use WinAPI without any concern about enabling Unicode and wide chars support.

If we are talking about ready game engines like Unity and Unreal... it is probably a naive assumption about input being 1 byte wide and things getting lost because of that in some gamedev-made script.

GoblinSlayer · on Oct 20, 2021

It's text encoding confusion: https://en.wikipedia.org/wiki/Mojibake

tazjin · on Oct 20, 2021

The amount of random encoding problems that still exist are so bizarre. I recently left a UK job after already leaving the country more than a year ago, and in their attempt to mail P45 form to my new address (in Moscow) the only bits that survived are the string "c/o" and the postal code.

mkotowski · on Oct 20, 2021

I, too, have the Ł letter in my name, and yes, it is a sick joke that so many things even in a supposedly modern systems make an assumption that the world runs on ASCII.

In the case of the Windows operating system, the worst fact is that every single part of it behaves differently. Some parts display the path with a wrong encoding, but handle it correctly. A third-party app can display it correctly, but fails while trying to access any file. From what I remember, even the built-in PATH variable editor/manager goes through some arcane steps to display the letters in a wrong way, but getting them to work sometimes.

I can only imagine how much more pain it is for someone using any of the less widely-used writing systems or those with more advanced features compared to ASCII (Hebrew’s RTL, Arabic scripts mid- and final forms, etcetera).

gerdesj · on Oct 20, 2021

Can Ł have an alternative representation? For example the German ß => ss. Also I think ö can be written as oe.

In English we simply shake the big bag of letters, pick a few at random and then throw them at the page until a few stick.

q3k · on Oct 20, 2021

> Can Ł have an alternative representation?

Nope. Neither can ź, ć, ś, ą or ę. You can, and people do write them as z, c, s, a and e when writing in a restriced character set, but that is not 'correct' and is not a bijection, ie. „półka” and „polka” mean two different things.

There's also the case of technically-same-sounding-especially-recently ż/rz and ó/u (whose replacement would let you get rid of two 'non standard' characters), but for historical reasons these are not interchangeable.

gerdesj · on Oct 20, 2021

I do find this sort of stuff fascinating and also faintly frustrating but of course my mother tongue is (in)famous for being a bit loose at first sight.

According to one of my employees (Polish) Ł sounds roughly like w as in win or water but not as in what. A quick read of this: https://en.wikipedia.org/wiki/%C5%81 doesn't help too much.

Does enforcing Ł instead of say w cause your written language to fail in some way? I don't want to cause offense, I want to understand the causes of difference.

q3k · on Oct 20, 2021

'W' in Polish is already used, but for a different sound - it's pronounced like the English 'v'. 'V' in turn is not present the Polish alphabet (in the sense of it not being present in words of Polish origin).

If you wanna change that, you might as well change the entire writing system of the language, eg. to be more in line with some other, more common writing system (ie. other latin alphabets or the cyrillic alphabet which would probably make the most sense phonetically). But no-one's gonna go for that any time soon.

gerdesj · on Oct 20, 2021

"If you wanna change".

I think we have found the disconnect: you quite happily use a word like "wanna" which is nonsense in English. Its allowed because it is understandable. Wanna is "want to".

Ooh, "gonna": That'll be "going to".

What's gonna to you is l bar for me or vice versa or something 8)

q3k · on Oct 20, 2021

This post is a good reminder to never fall for “I don't want to cause offense, I want to understand the causes of difference.” again.

gerdesj · on Oct 23, 2021

I apologise for that comment. Too much wine was involved. Sorry.

ByteJockey · on Oct 21, 2021

I think I just learned that some dialects/accents don't pronounce the w in water and the w in what the same way.

What's the difference?

gerdesj · on Oct 23, 2021

What can be pronounced as "wot" or something like "hwot". I can't recall the technical term.

mkotowski · on Oct 21, 2021

The «wh» in words like when, why, what can be pronounced as a voiceless labial–velar fricative (IPA sign: ʍ).

For more detailed explanation: https://en.wikipedia.org/wiki/Pronunciation_of_English_%E2%9...

Dannymetconan · on Oct 20, 2021

I can very much relate to this but also have very little sympathy here.

I have a special character in my name, an apostrophe, and it causes trouble regularly online and with tooling. A number of years ago I decided just to never use it when it came to anything to do with technical work be it email, logins or usernames.

Unicode characters are a pain to deal with and I have suffered from it first hand trying to handle it. At the end of the day it is much easier just to not use the special characters and move on with your life rather then be battling the constant frustration.

I'm sure these tools have lots of issues opening and you would be surprised at the amount of time, effort and testing it would be required to provide fully Unicode support. Most people would see it as a very small positive and not worth the effort. I find it hard to disagree.

jltsiren · on Oct 20, 2021

My legal last name is "Sirén". When I was younger, I almost always used "Siren", because it was easier to type. Then, ~15 years ago, I started noticing that American websites sometimes rejected it, because they considered it inappropriate. Sometimes "Sirén" would work, sometimes it worked but caused minor annoyances, and sometimes it would not work for technical reasons.

Both versions work most of the time these days, but I still run into trouble once in a while no matter which name I use.

10000truths · on Oct 20, 2021

Why would Siren be an inappropriate name?

lostgame · on Oct 20, 2021

Someone who I know has the last name ‘Island’ and was unable to sign up for Facebook forever because they thought it was a fake last name.

Maybe ‘Siren’ is similar. It’s a pre-existing word that perhaps flags some sort of weird edge case.

LordDragonfang · on Oct 20, 2021

My best guess is the Sirens of greek myth, who are often (incorrectly) depicted as sexual temptresses. Bit of a stretch, though.

Dannymetconan · on Oct 21, 2021

Totally agree with the sentiment. It has gotten a lot better in the last 10 years. Very frustrating to have your name blacklisted by that. It does seem most system have a very US focused design.

I still find it funny that even in my home country you can't use a lot of local special characters in names. Also most airlines won't accept it so technically I'm not giving them my true name!

johnorourke · on Oct 21, 2021

I can relate, mine is O'Rourke and even in 2021 I get:

- websites telling me I have an invalid name

- post addressed to O'Rourke, O\\\Rourke, O&Rourke

- "my account" pages say "Welcome, Mr O\Rourke"

Dannymetconan · on Oct 24, 2021

The best one I have every seen is O\apostropheRourke for a car rental in France. I have no idea how they thought that was a good idea!

vultour · on Oct 20, 2021

I'm really surprised someone technically minded thought it's a good idea to put a non ASCII character in their username. I'd never do that.

ctdonath · on Oct 20, 2021

I'm really surprised someone technically minded thought it's a good idea to not allow non ASCII alphanumerics in a username.

Unicode has been a thing since 1988. Names have included non a-z characters since forever.

Dannymetconan · on Oct 21, 2021

Well in this case they were explicitly allowed it just caused problems down the line when other system attempted to consume them.

String come up again and again as a hard issue to deal with especially once your start looking at Unicode. I think it would be very reasonable to assume only ASCII works and even then it doesn't always work!

exporectomy · on Oct 20, 2021

Unicode really wasn't practical at all back then. Unless your entire system end-to-end was built internally, you'd have to interact with some non-unicode software. There was also no agreement on a common UTF-8 encoding, and other unicode encodings were all broken anyway.

Names have been spoken and hand-written since forever yet somehow computers aren't good at that so we all tolerate converting them to printed-looking text. Nobody cares, it doesn't matter.

stubish · on Oct 21, 2021

ASCII only is not appropriate in some locales, as the keyboards don't have a-z. This is why in Thailand people tend to use their mobile phone number as their password, because it can be typed on all the common keyboard layouts they will encounter.

Also, with Windows 10 users will often not even choose their username. It gets generated from their given name + surname (which is a whole different issue for people without one or t'other).

selfhoster11 · on Oct 21, 2021

There's nothing special about a username vs any other string. If there is, that is a problem.

nightfly · on Oct 22, 2021

Since identifiers like usernames are seen by people they are susceptible to homograph attack and _do_ deserve to be treated a bit more carefully. Also you probably dont want usernames like ń̸̡͍̲̲̫̰̦̔͛̋̉͊̔̈̈̈́̀͑͘i̶̜̔̐̅̔̑̈̕͝͝g̶̢̭̮̲͕͉͔͙̳̥͖̉̏̇̎̊̈́̊̆̃̎̑͆̿͠ͅh̶̡̛̪͔̯̯͈̼̿͊̂̍͐͒͐͐̆̽͛̄̽͝t̸̛͔̮̆͊̋́̑̓̅̀̆͋̕ͅf̸̤̗̺̣̤̝̟̱͎̦̀͒̽̓̋̏͌͋̇͛ͅḷ̶̭̓̿́y̵͍̦̫̫̠͆͛͋̓͑͑͋̔͑́̔̽̚̚

jasonpeacock · on Oct 20, 2021

And yet it's one of the simplest things to add non-ASCII chars to your tests to validate their handling.

It's like not testing if your calculate application can handle negative numbers or decimals.

nradov · on Oct 20, 2021

In fact it's trivial to generate a text file of all valid Unicode code points and use that as input to unit tests.

Someone · on Oct 20, 2021

I would have to do research on whether the list of valid code points depends on the Unicode version. For example, can regional indicator code points (https://en.wikipedia.org/wiki/Regional_indicator_symbol) appear in isolation? If not, is that different in Unicode < 6, where those code points weren’t assigned yet?

Similarly, what about tags (https://en.wikipedia.org/wiki/Tags_(Unicode_block) )? Do these require an U+E007F CANCEL TAG?

The 66 noncharacters certainly need consideration. http://www.unicode.org/faq/private_use.html says:

“Because of this complicated history and confusing changes of wording in the standard over the years regarding what are now known as noncharacters, there is still considerable disagreement about their use and whether they should be considered "illegal" or "invalid" in various contexts”

Edit: also, testing all code points likely is overkill and using code points in isolation likely isn’t enough. Most tests are better of with something like the big list of naughty strings (https://github.com/minimaxir/big-list-of-naughty-strings)

GrumpySloth · on Oct 20, 2021

It may be faster to generate them on the fly. Iterating over ranges of integers is a lot faster than reading files from disk.

mrweasel · on Oct 20, 2021

It’s a pretty good test case. Similarly we found a number of bugs in a Django application and path handling, because I happend to be using Windows for six months, while the rest of the team was on Linux and Mac.

mikasjp · on Oct 20, 2021

I think the whole problem is keeping the character encoding consistent in the applications and their dependencies. Programmers often forget this because they avoid non-ASCII characters in their code.

Ansil849 · on Oct 21, 2021

Sometimes even "regular" ASCII surnames cause problems.

When written in the Latin alphabet, my surname is one letter.

I've had an amazing amount of problems with this not just due to technical limitations (like various forms marking the entry as invalid), but--much more aggravatingly--human limitations.

One particularly infuriating anecdote: at a past job many years ago, the email structure was lastname@company.com. I dutifully sent the IT person in charge of creating emails my desired email. The IT person wrote back an amazingly condescending email that as per the policy, emails had to be last names, not individual letters. I then had to go find a bunch of random websites which explained single-letter names and forwarded them to the IT person. They then obliged, but did not apologize for insulting me. That is not right that I had to put up with that.

account42 · on Oct 21, 2021

> One particularly infuriating anecdote: at a past job many years ago, the email structure was lastname@company.com. I dutifully sent the IT person in charge of creating emails my desired email. The IT person wrote back an amazingly condescending email that as per the policy, emails had to be last names, not individual letters. I then had to go find a bunch of random websites which explained single-letter names and forwarded them to the IT person. They then obliged, but did not apologize for insulting me. That is not right that I had to put up with that.

Except single letter last names are less common than people not following policy and/or abbriviating the name. It could simply be an honest mistake and the email is just their standard response since they have other things to get to. Did you try simply pointing out that that the letter was in fact your last name instead of getting passive-agressive?

pledess · on Oct 20, 2021

The article offers a solution of idea.system.path=${root.dir}/JetBrains/Rider/system but doesn't mention the C:\JetBrains directory permissions. Directory permissions under %LOCALAPPDATA% (the location that works for people without a Polish character) should restrict write access to one user. With the Windows default behavior, creating C:\JetBrains would inherit permissions from C:\ - and wouldn't restrict write access to one user. Maybe 99% of the time this is irrelevant (i.e., there's no realistic threat from malicious actors who control unprivileged user accounts on your own development machine). Still, it's a potential downside of the solution, and more motivation for the vendor to fix their code so that Polish characters can be used under %LOCALAPPDATA%.

Kwpolska · on Oct 20, 2021

If you are on a multi-user system, the path "C:\JetBrains" isn’t really ideal (what if other users also need Rider and have non-ASCII usernames?). That said, you can easily change file permissions on Windows if the default ones don’t work for you.

dmingod666 · on Oct 20, 2021

The domain name to the website is all ascii..

zamalek · on Oct 20, 2021

If you use a Microsoft account to set up windows then you have no control over the local username.

dmingod666 · on Oct 20, 2021

That sucks.. always hated the idea of an online account to access your local system..

zamalek · on Oct 21, 2021

I'm not saying it's a good idea (even though I stupidly do it), I'm merely pointing out that there are reasons that U0080+ may end up in a username; for reasons other than intentionally putting it in there.

As for the benefits, which is completely off-topic, Windows Store is actually pretty awesome if you completely avoid search (and you need to do the Microsoft account thing for it AFAIK). Windows has needed a system to update 3rd-party software, to compete with Linux package managers, and the store is a really good effort (there are still annoying warts that Aur, Deb, RPM do not have). If you're willing to be a bit dumb, there is convenience.