Very similar problem to one described started my exodus from Google services.
I also have non-latin characters in my name however I knew it was always an issue so I never used it in paths etc.
At some point, long time ago, I was tasked to do some maintance with Google Cloud service (can't remember the name of the service now) which was doable only through Python CLI utility and it failed with very similar Python error.
What I found out rather quickly is that utility took my name from Google+ profile, which did include those non-latin characters. No biggie - I thought and fired e-mail to support (yeah it was those times it was still that easy). Few hours passed and I received information that this won't be fixed anytime soon and the best course of action would be to change my name.
Of course, support person probably meant to remove the diacriticals from my Google+ profiles, but still it left unplesant aftertaste for years to come.
A Polish relative of mine used to just gave an arbitrary substitute name (e.g. "Dave Smith") for restaurant reservations, because even if they could write his last name, they wouldn't be able to pronounce it.
My sibling has a name that has an accent, and just enters it with the plain letter most of the time. The name was once rare and "ethnic", but became popular a generation later so people know how to pronounce it regardless.
Our parents gave us two middle names, wanting to preserve our grandmothers' surnames, but also in the spirit of "Bobby Tables", having ambivalent feelings about the computerization of society tending towards inflexibility.
...misunderstanding of naming customs in the US has actually led to significant consequences due to last names not matching on legal documents.
I remember reading a story about how there are people in China whose name incorporates a character that is obscure enough, the authorities are trying to eliminate it and get them to change their name. If I recall correctly, Chinese has a particular problem with characters that are part of names that have been around forever, but are no longer used for ordinary writing.
I understand your troubles, I'm from Spain so I have two family names and my given name has an accent. Now that I'm living in Japan, it's an endless source of fun.
Regarding Chinese names and uncommon characters, Japan has the same problem. It's specially problematic for place names, with some kanji used to write the name of a single place in the whole country! I used to write in a place with such an obscure kanji that I wouldn't type it on my Linux PC.
It's also difficult for people moving to the other country, with some characters existing in one country but not in the other. I have a Chinese coworker who needs to write his name in katakana because his characters don't exist in Japanese.
Because nearly everything in the US shoehorns things into three boxes, virtually every place my name is recorded on something important is different.
I could've been consistent in using two or three out of four, but when I was younger I was intimidated by forms that say you must enter your "full legal name", so I would, and they would mangle it unpredictably.
Checking account, credit card, drivers license, and property deed, each one different.
Well, in fact, my social security card and my birth certificate don't match, so I was doomed from the start.
It gives me some sympathy for places that try to regulate names to avoid parents doing something too goofy.
I sometimes wonder if there will come a day when all the databases will stop allowing discrepancies, and it won't matter to the powers that be, because it's such a tiny percentage of the population that becomes "unpersons".
a friend of mine has that problem. his company is owned by his wife because his name can't appear in legal documents and he refuses to change it. his approach was that he petitioned unicode to include that character.
i don't think the authorities are actively trying to eliminate those characters but simply don't want to go through the effort to track down and have those characters added to the standard. the process takes years and in the mean time you have to live with the inconvenience. also most of the people faced with the problem would not even know how.
maybe has the benefit that he also can't receive speeding tickets?
> his approach was that he petitioned unicode to include that character.
Sounds like the right approach (although not the easiest). if unicode can include thousands of smileys, the least they can do is include actual characters used in people names.
> the best course of action would be to change my name
As someone who has been told this, for other reasons, I empathize. My reaction has always been - "Your system can't even handle names, you need to fix it".
Edit: I wish there was a library / service that helped you handle all sorts of edge cases in names, so that you don'
t have to worry about it. Just use a user-id, and set / get a name from a lib / service that can actually handle it.
These days everything should be stored as bare UTF-8 data (or utf8mb4 if you're MySQL) and presented without anything else. Don't parse it, don't slice-and-dice it, don't prepend or append titles or honorifics or suffixes, don't make assumptions about length or content beyond "must be > 0 as a whole" and DEFINITELY don't use it as an identifier. Treat it as a non-unique opaque token and you'll be fine greater than 99% of the time.
There are people with no last name. There are people with two or three or twelve middle names. There are people with a number for a last name. There are people with a symbol for their entire name.
Take what they give you and use it and be done with it.
It looks like there's no general solution possible with Han unification. If you have any two of ZH and JA and KO and VI in a page, you will fail to display one of them correctly for certain characters unless (as in that wiki page) you add a LANG attribute for each element they are contained within.
Personally, I would use the browser's language or user locale to set the page language and give up. Then in Japan the local (Japanese) names would look fine, and same for China, Korea and Vietnam. Local consistency versus complicated perfection (tracking the input language as well as tokens and using them everywhere), and I could blame the browser for doing poorly at its impossible job.
One possible "perfect" fix would be to store the token and <span lang=...>$token</span> as well. The only place the non-wrapped version would be used is plaintext email or SMS, either of which are beyond lost causes for other reasons. Doing it with an embedded SPAN tag presents its own problems with sanitization, as well as guaranteeing it's always wrong if the input language was specified incorrectly when the token was populated, versus as above where it would be corrected to the local-optimal version if the user locale overrides it.
I pray that a more complicated solution is not needed, but when I was living abroad I would always encounter issues with sites where they thought "since the IP is from x, or since the browser is requesting lang y, then we should think that this American passport holder is a Spanish citizen and thus we can make assumptions about him."
The ultimate source of this issue is that we are taking names and official IDs too seriously, but I doubt that problem will go away for "serious business". Funnily enough though, it already has for things like restaurant table reservations where all info provided is quite literally just a string for a human to do something with. No need to validate if the user's phone country code matches the country in which they are reserving a table...
Variation selectors are getting a good workout/testing technically in emoji at least (a lot of emoji are "just" "old" Unicode codepoints with a ZWJ and the variation selector known as the emoji variation selector to tell systems to always show it in "emoji styles"). I can't speak for how well it works in practice for CJK languages as I don't know them (more reason I appreciate emoji for letting me test compatibility with hard parts of UTF-8 in ways that I can read and most users want), but I do appreciate that there's at least the idea for/part of a fix in "recent" Unicode.
I'm also imagining it is not a fun thing to implement in practice, as Unicode at this point maintains a massive database just for it: https://www.unicode.org/ivd/
"Mangle" is an exaggeration. Japanese names will look correct after Han unification both to Japanese people on a Japanese computer and to Chinese people on a Chinese computer. (All other combinations fail.)
> Japanese names will look correct after Han unification both to Japanese people on a Japanese computer and to Chinese people on a Chinese computer.
Only if you are displaying them in a way that respects the computer's preferences (most websites and programs, especially American websites and programs, don't) and those preferences are set correctly. And certainly if you have text blocks that contain both Chinese and Japanese names you will always mangle at least one of them.
American honorifics get me every time. What's the point in teaching a computer to use honorifics? It's a heap of semiconductors that stirs a heap of bits. And on top of that you teach it yourself.
Here's a relevant recent EU court case of a person arguing with their bank that their name should be represented properly including the accented 'é', as the GDPR asserts a right to have mistakes of personal data corrected. The bank argued that it's impossible due to a legacy system using EBCDIC encoding and would be expensive to change. The appeals court affirmed that the customer has the right to get mistakes in their personal data corrected, and it is the duty of the bank to do so even if it is expensive.
I once did it, encoded UTF-8 text in a legacy text because the system interface didn't support unicode, and decoded on the other end. Used • prefix as a marker of encoded string... now that I think about it, it could be just old good BOM. MIME standard also has extensive experience in packing arbitrary data in 7-bit clean text.
This is an insanely nightmarish precedent. We need to take computer systems less seriously. In a human-interface world, these issues are avoided thanks to simple human intuition.
This is exactly why I hate the way Python3 handles Unicode.
EVERY language should _try_ to handle Unicode such that if a data sequence were valid before it remains valid after. NONE should ever FORCE validation, since sometimes, like in the article's case, the correct answer is GIGO. Just pass it through and hope it continues to work. Sometimes the error is trying to enforce that validation.
How is this Python's fault? It's not like the `docker-compose` file would have worked any better if it silently replaced one of the volumes with an inaccessible file. Instead, you'd just get a failure from the Windows filesystem API when you tried to access or create a file at "C:\\Users\\Miko�aj\\AppData\\Local\\JetBrains\\Rider2021.2\\log\\DebuggerWorker\\\", right?
Python 3 usually handles this correctly, and I'm a little bit confused what's going on in the article, exactly.
For UNIX path names (and other OS data like environment variables), Python uses the "surrogateescape" error handling method, which does exactly what you ask. Any byte sequence can be converted to a string. If it decodes as valid UTF-8, it will do that. If it hits a byte that does not decode as valid UTF-8 (necessarily a byte >= 128), it will map it to code points U+DC80 through U+DCFF. These are in a reserved ranges of code points ("surrogates", which make it possible to represent code points > 0xFFFF in UTF-16), and they can't show up in actual Unicode text (i.e., there is no UTF-8 encoding of them, strictly speaking, and if you applied the UTF-8 encoding algorithm to a code point in the U+D800 to U+DFFF range, you would get bytes that aren't valid UTF-8).
On the way out, this is reversed. So you get the results you expect if your filenames are in UTF-8, but since UNIX has no requirement that filenames are indeed UTF-8 (the only constraint is they can't contain NUL or ASCII-forward-slash), the bytes are preserved in a funky-looking format in Python and you get the exact same output on the other end.
See https://www.python.org/dev/peps/pep-0383/ for more on what's going on. The tl;dr for users of Python is that if you want to interact with, say, subprocess output as mostly-normal strings (instead of bytes) but you want to be robust to non-UTF-8 bytes, you should do something like
You don't need to do this for APIs that directly interact with pathnames, because they do it already. You just need to do it for things like subprocess output and file contents that Python doesn't know you want to handle in this way.
...
On Windows, however, path names must be valid Unicode and are stored in UTF-16. So the idea of a "ł" that doesn't decode properly shouldn't even happen! Mikołaj's home directory ought to be a very boring (and valid) 004d 0069 006b 006f 0142 0061 006a on disk.
Windows doesn't enforce that file paths are valid UTF-16 though (specifically, the surrogate code points are only supposed to show up in a certain way, but nothing enforces that and you can have random surrogates on disk), and hence Rust, which internally represents all strings in UTF-8, has a solution ("WTF-8") that's basically the inverse of surrogateescape - it uses extrapolated-UTF-8-encoding-of-surrogates to handle unpaired surrogates. http://simonsapin.github.io/wtf-8/ But it seems very odd to me that the directory C:\Users\Mikołaj would actually contain any of those, and if it doesn't, I would expect it to very easily turn into a Python Unicode string.
Maybe this is from a Python version before https://www.python.org/dev/peps/pep-0529/ , which is claimed to "fail to round-trip characters outside of the user's active code page"? Maybe this is from a Python version after that change and it's wrong?
The incorrect docker-compose file was generated by Java (Jetbrains) but consumed by Python (docker-compose). The GP comment was complaining about Python's strict Unicode consumption, not Java's invalid Unicode generation.
The Docker compose file is YAML. My reading of YAML's standard is that it must be in one of the Unicode encodings, and the smell I get from the article is that it is probably in windows-1250 (the CP Windows would use for Polish; Mikołaj is a Polish name, 0xb3, the octet in the error, is the Windows-1250 encoding of "ł"); thus, it isn't valid YAML.
I'm not sure what sane behavior Python could have here besides errorring.
> EVERY language should _try_ to handle Unicode such that if a data sequence were valid before it remains valid after.
This sequence was never valid, and never will be.
> in the article's case, the correct answer is GIGO. Just pass it through and hope it continues to work.
Dear God, no; emit a diagnostic and abort. Countless decades of existing code have shown time and again that "plow forward with some hot garbage" is not a good idea. But that ignores that … that that isn't how any of this works; the YAML parse is going to want to emit strings, which the incoming data isn't.
Oh, I see. But if it was UTF-8 it would have worked... I guess the problem is that JetBrains is generating the file in (e.g.) Windows-1252, and Python needs to be told that?
Does it work if you set the environment variable PYTHONENCODING to cp1252?
(I suppose I should either contact the author, or try it myself...)
JetBrains is generating an invalid YAML file, which are UTF-8. If they were using a decent YAML library, it would have crashed at that point. And firmly pointed the finger at the real bug, reading raw bytes from the environment or a .properties file parser and assuming it is valid UTF-8.
And this is why you always validate your data when you slurp it in, or else you pass crap down several layers where it crashes or mostly works with the potential for security holes or catastrophic behavior, and a pain in the arse to track down since the actual bug is nowhere near where you are looking.
I'd expect a decent YAML library to have functions taking UTF-8 and not wasting time verifying that the data passed is actually UTF-8 in release builds.
You generally don't verify on output, because you verified on input (especially with languages where text strings are Unicode or UTF-8 byte strings like Python3 or Rust). But it would also be a premature optimization when it does make sense to check. For expected YAML use cases I doubt it would be a measurable difference in runtime. And it has to inspect the strings in any case to correctly quote things and deal with indentation if there are newlines.
Normally I'd agree, a windows-1252 misencode should be one's default guess when mojibake is afoot. Unfortunately, the errant byte in the error is 0xb3, which is "³" in windows-1252.
If you Google, "Mikołaj",
> Mikołaj is the Polish cognate of given name Nicholas
Then Google, "windows character encoding polish"
> Windows-1250 - Wikipedia
And 0xb3 is "ł" in that encoding.¹
> Does it work if you set the environment variable PYTHONENCODING (sic) to cp1252?
I don't know if setting PYTHONIOENCODING would work here; I don't think it should affect this. Really, fixing the YAML file is the fix. (And fixing the thing that generated it.)
¹it is queries like this that really make me love the search engines of today. This would have been hell in the days of Alta Vista.
Fun fact: If you have the exclamation mark (!) in your Windows username, Java will think it's the jar separator and `getResourceAsStream` will refuse to work. This broke many people's Minecraft installation over the years.
The bug in question [0] was reported in 2001 and remains unsolved 20 years later.
It should not cause issues. For regular users, it should be an expectation that, in well-written software, it won't cause issues. But in our world of power+ users... It will cause issues. It just will. Maybe in some number of decades that won't be the case, but it's irrelevant right now.
With your programmer hat on, sure. But as a user and annoyed programmer, it's a disgrace that the contents of your username is treated as anything but a totally opaque string. It's like saying that of course putting semicolons into web forms will break them: not novel, still horrifying.
The Nama language has a letter that is often replaced with an exclamation mark in common typography. The lead actor in the film The God's Must be Crazy was named N!xau ǂToma
When you start a Windows computer for the first time, Windows gives you a text box and asks you to choose a user name. It does not explain in detail what this means or what it is for.
Does it really surprise you that people are going to type just about anything into that box that it will allow?
i recommend reading this "Falsehoods Programmers Believe About Names". world is crazy place if you expect some system in how people name each other and are writing it down.
It can be used to indicate a click sound in some languages and the IPA. Probably not the most common reason to come across this bug but it's a case worth considering.
! is actually used as a regular letter with a phonetic value to write quite a few languages in Africa, usually representing a click sound. It's even in the name of some of them: https://en.m.wikipedia.org/wiki/%C7%83Kung_languages
Really? If someone asked you if exclamation marks in usernames would cause problems before you were aware of this, would you have said yes? It's a very common special character, it's not like it's a control character or some obscure unicode thing. Besides, it's a common thing to add to usernames outside of services where you use your real name.
Absolutely, I would have said yes to the question "do exclamation marks have the potential to cause problems in some piece of software or other, the likelihood going way up with the number and the obscurity of software you use on this machine".
Very handy. My previous simple test-case was simply a selection from this well-known text-file which is simply a collection of somewhat uncommon unicode characters, usually used for rendering tests.
But this set of strings is specifically designed to cause edge-case errors.
Also don't forget Spolsky's seminal "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".
I have seen a windows app with a text field whose max character count was somehow determined by system font size - probably a crude way to make sure the entered text fits the hard-coded field size.
The problem was that this field was used to enter a 10-digit code, and as it turns out, on default Windows10 system, the fonts are set up so that this field only fit 8 of them. Oops! :)
I'd like to see how that App would work with me sitting here fonts cranked up to 175%. I've never heard of a setup like that though - it sounds like it'd be surprisingly intricate to actually configure.
Around the time of AOL3 or early AOL4 someone found a user name exploit.
When making a new account, on the client side use winapi's EM_LIMITTEXT to bypass the max character limit on the input textbox. Enter one or two letters, a bunch of spaces, then some more letters.
The server side would truncate to the original length, leaving you with a one or two letter username, working around the 3+ requirement.
Discover (discover.com) currently has a similar bug where it'll allow me to login with my password, but will not accept the same password in the 'Change password' workflow as the old password, complaining about it being invalid. (shrug)
You're lucky they weren't different lengths in the backend. I've been bitten by that surprise one too many times (which is any number higher than zero)
The most ridiculous thing is the UI for setting the password even said "X-Y characters long, must include at least one..." but the login page could not support Y characters.
For finding bugs caused by unexpected inputs I also find property based testing very useful. For Python there is the excellent hypothesis library for doing that: https://hypothesis.readthedocs.io/en/latest/
Great resource! I usually use pangrams (holoalphabetic sentences like "The quick brown fox jumps over the lazy dog") to ensure that my code can handle all the alphabet characters for the languages that should be supported at the very minimum.
Certainly informative if you haven't seen it before.
My takeaway from it was that design your system to try to accommodate as much as possible, but it would basically be impossible to accommodate them all, so aim for your target audience.
Using non-ascii characters in file paths, toolchain config files, and other non-display contexts is just asking for trouble, even if it is your name...
Unfortunately, it's true, most toolchains are stuck in the past, and don't deal with non-ascii characters or even spaces very well. In fact, I just learned that spaces in .deskop files values could cause trouble after a long debugging.
But it's a shame.
In Europe, we do have a lot of non-ascii characters everywhere. Ubuntu puts a "Vidéo" and a "Téléchargements" directory in my $HOME because I'm french. If I were to use my name as my username I would have even more troubles.
I'm careful with not using special chars in names for work, but it feels like I'm a girl trying to not dress sexy in the wrong part of town: necessary, but I shouldn't have to do this, and it's definitely the others to blame.
All in all, I thank the Gods of encoding for Python 3 unicode handling. Having a scripting language that does the right thing out of the box is wonderful on this side of the pond.
"The right thing" for filesystem entries is transparently copy, do not evaluate. A file path is a mem-copied, length value sized block of identifier you don't ever mangle. If you must mangle it, touch only the necessary areas as directed. (E.G. join with os.pathsep and do not normalize anything).
Want to offer Unicode validation? Sure having that as an OPTION is fine. Forcing it means I can't rely on that tool to handle real world data which happens to not be valid but is still a valid file-system address.
One thing I've noticed is that ext, xfs, btrfs and zfs all explicitly store a length field alongside the filename. There's nothing inherent in the disk layout of these filesystems preventing them from supporting filenames with embedded slash and nul characters - those limitations are imposed by the kernel's VFS implementation. It would be nice to have a special version of open, exec etc. where one could specify a filepath as a length-prefixed array of length-prefixed strings.
ODBC (ISO 9075-3) got it right 30 years ago: all strings are accompanied by the length argument, which also accepts the sentinel value NTS, when you mean a null-terminated string.
- if you want to treat paths like unicode strings, you can. Which is great for simple scripts where you don't want to deal with complexity. And 99% of the time, it's enough with modern OSes.
- if you want to threat path as bags of raw bytes, you can. Which is necessary to transparently copy and do not evaluate, as you said, for covering edge cases.
- if you need to actually deal with those as strings but don't want to loose data for edge cases, so a mix of the 2 above, you can use surrogate escapes
Where does that filepath come from? A config file; are you going to do your text processing, and interaction, with other modules in byte[] arrays in Python 3+?
Python 2's unicode model was _closer_ to correct, the trivial coercion between byte[] and Unicode.
Conversion also shouldn't imply, force, or check Validation nor Normalization. Labeling a bytestream with an Encoding and validating / normalizing that encoding should be options. Operations on bytestreams with encoding related attributes should set them to either 'unknown' result or to a proper output type if they're aware the manipulations will still yield a valid encoding.
Normalization is more complex, since Unicode strings can be normalized in different ways, then combined, and still be a valid string but no longer uniformly normalized.
Especially if those products are developed by a company
from Russia, where Cyrillic is used. For me, a Russian myself,
this situation is honestly ridiculous.
There are a few computer languages that have non-english keywords though! And among them it looks like there was a version of Algol with Russian keywords, as well as a bunch of others in the list. Scanning it it would seem that Logo and BASIC get translated a lot, which makes sense for teaching young learners who haven't learned English yet.
True. But these actively supported, paid products build upon layers and layers of no-longer-supported, free/opensource products. Good luck fixing them.
Not saying that this is OK, just explaining why using non-ascii characters, in this day and age, is still asking for trouble.
Windows 2000 is when the OS changed to UTF-16 by default. Before that Windows NT was UCS-2, IIRC only the DOS-based Windows versions were Windows-1252 internally, starting from Windows 1.0. So while ł wasn't supported in Windows 1, characters like ñ were. Windows has literally NEVER been an ASCII-based OS.
Sure, but having used a lot of the windows system apis (admittedly - a lot of years ago) it was a complete hodgepodge of which api would take a char vs a wchar, and then they tried to hide the whole thing behind tchar, which just made it even harder to keep track of.
Basically - I agree: This shouldn't be a problem, and 7 months is a long time to wait for a basic fix. But there are a lot of footguns hanging around in windows code with respect to character encodings.
Which takes a long pointer to tchar string (LPTSTR) - so this behavior is dependent on the unicode settings of the project at compile time, even today.
> Which takes a long pointer to tchar string (LPTSTR) - so this behavior is dependent on the unicode settings of the project at compile time, even today.
The documentation is simply wrong, GetUserProfileDirectoryA which you linked always takes a LPSTR (always "ANSI") while GetUserProfileDirectoryW always takes a LPWSTR (always WTF-16). This is reflected in the function prototype at the top. Only the define GetUserProfileDirectory switches between these two. The define is a compatibility hack and arguably was a mistake, but you can always the W-suffixed function no matter what the project settings are.
But WTF-16 paths can be converted to WTF-8 just fine. You can even use the same algorithm, just only pair surrogates if they are matching an otherwise interpret them as UCS-2 values and encode those normally to "UTF-8".
A lack of support for spaces at this point is unacceptable. I, personally, despise spaces in paths but on windows a whole bunch of default system paths already have spaces embedded in them in major ways... and let's not forget parens as well - thanks "Program Files (x86)"
Using non-ascii characters in file paths, toolchain config files and other non-display contexts is something every development team should explicitly, intentionally do in order to catch such bugs.
"Asking for trouble" is a key part of testing. My suggestion would be for a QA person to have their username (and root folder of the testable project) to start with a space, and be followed by an accented letter, tab-symbol, apostrophe, an emoji, followed by an unicode RTL control character and some Arabic text.
No it's not. It's not even remotely victim blaming. At no point did anyone even remotely hint it was their fault.
People have taken this to a ridiculous level. Giving practical advice about how to avoid problems is not blaming the victim. Telling my child to look both ways before crossing the street is not blaming him if he gets hit. It's not wanting to see him get hurt when the person actually at fault fucks up. Telling someone to avoid non-ASCII characters because a program can't handle it is not blaming them....
Your comment was unhelpful. Great! Not their fault! What now? It's also part of a larger trend that will lead to people being hurt.
The computer is supposed to make our lives better, we should not be required to make the computer’s (or programmers’) life better. Your attitude reminds me of the people in the 60s who included punch cards in the utility bills, marking them “DO NOT FOLD, SPLINDLE, OR MUTILATE” (it became a meme). Or the people in Spain who changed their alphabet to make it easier for computers to sort.
My name appears differently in my passport, on plane tickets (not always using the same modification), and my green card. And for the latter two I left out the part of my name that can’t properly be represented at all in ASCII. And you are saying that somehow I am at fault?
> The computer is supposed to make our lives better, we should not be required to make the computer’s (or programmers’) life better
Yes... The computer is at fault for not supporting proper names... That's literally what everyone has said. Nobody blamed the user....
> Your attitude reminds me of the people in the 60s who included punch cards in the utility bills, marking them “DO NOT FOLD, SPLINDLE, OR MUTILATE” (it became a meme).
The problem should be fixed, but it's not victim blaming to tell someone how to still submit their bill.
> And you are saying that somehow I am at fault?
No... I'm literally saying the opposite.... I think you need to reread your comment, then my comment.
Telling you to omit those characters so you can still travel internationally is not blaming you in any way shape or form... Your criticism of practical advice being victim blaming is harmful and unhelpful.
As much as I wish we lived in a better world where name characters were better handled, using anything outside of [a-ZA-Z]{12} as a username is a world of hurt.
Some people just realize it later than others.
So yes, you shouldn't think of your handlename as your name, it's just another identifier, and choosing simple handle names is a life skill at this point.
Some of the other attempts are a little subtle, this one is a pretty blatant attempt to rile up the folks that are already angry about rust for whatever reason. Please stop.
Many years ago I could not access the apple developer panel because of the umlaut in my last name. It was eventually fixed but I was quite surprised that such a large company would run into such a basic issue.
My last name has an apostrophe in it which Apple apparently loves to embed directly into their JavaScript unescaped. For a long time neither I nor Apple could look up AppleCare status on my stuff as they were all linked to my Apple ID. The portal would thus require me to login, but then would just show a partially rendered page as my last name was causing an JS syntax error.
Hmm, it sure sounds like John <script>alert(1);</script>Doe (Bobby Tables' distant cousin) should sign up for an Apple account. An XSS attack which could target the AppleCare reps' machines could be catastrophically bad...
You'd think the apostrophe would be common enough they'd know it could happen, but no.
I love to enter it and see what each vendor and website's backend does with it.
The Staples Canada website, for example, returns it as ' (HTML escaped)
A couple times I've logged in, it seems to escape a new character. I'm currently up to &amp;#39;
Haha yeah I'm fairly used to seeing HTML escaping in my name.
The weirdest case I've had with that is the Six Flags mobile app. To add a season pass you need to provide your card number and last name. For the life of me I couldn't get it to validate, but I saw they showed the HTML escaped version in their e-mails to me. Turns out I had to type out "'" into their input box for my last name as that's apparently what they put in their database.
>such a large company would run into such a basic issue
Every large company is just a conglomeration of smaller departments. Each department had individual contributors. Some individual contributor in that department wrote the code and if nobody else is their department caught it, nobody else at the large company would have caught it since they have their own work to consider and don't have time to look at other people's stuff.
But, that's not how these things work. It would be nice if every department had unlimited QA resources, but most likely there have at most 1 QA person, and might be sharing that person with other departments. So if that person misses it then...
If you look at many of the responses here it's sadly unsurprising: small-minded provincialism or outright xenophobia are no less common amongst programmers than the general population.
When I first installed Windows 7 like ten years ago, I entered my Russian name in Cyrillic. When I saw that the system created a directory with exactly that name under `C:\Users\` I immediately scanned the internet for a way to rename it and done just that. I don't want to know how much mess like that in a story I thus had successfully escaped.
you're getting downvoted, but between tchar hiding wchar vs char... this literally could be someone toggling off the "UNICODE" checkbox in visual studio somewhere.
Given the frequency with which Windows-12* mojibake occurs, people are either a number of holdouts still using Windows 98 SE, or there are a good number of paths in Windows that still use the non-Unicode encodings.
Windows supports Windows 98 API and it's more natural to use for some languages like C++. No change is planned there. Windows 98 API is also closer to Unix API, which can incentivize the programmer to use the same approach on windows and unix.
All windows needed to do is support setting that API to UTF-8. It's not like it doesn't already support multi-byte encodings. It's not like they even needed to even assign an ID for UTF-8 or implement the conversions - those existed already. All they needed to do is allow programs to set their codepage to UTF-8. This finally became possible two years ago. Better late than never I guess.
It's somewhat common to see videogames issue a patch shortly after release where they fix crashes due to non-ASCII Windows usernames or non-English locales. I'm not sure what the root cause of the confusion is, other than text strings being hard in general.
It's easy to think the answer is "just UTF-8 everything" but unfortunately the long and twisty history of filesystems means that's not the correct answer, and the "correct answer" is really hard to write down quickly.
If you never display the filename, the answer is to treat existing filenames as bags of bytes, but that breaks down as soon as you need to display them, or if you need to manipulate them by appending unicode to them, in which case you have to decide on an encoding.
Unicode encodings tend to mangle non-Unicode values because they're specified to replace whatever they can't understand with a particular Unicode character, usually represented as a diamond with an inverted ? inside of it.
There's some obscure solutions to this problem, like https://simonsapin.github.io/wtf-8/ (which includes discussion of the 16 bit encodings you need for Windows), but I haven't seen broad support for them. You need a deliberately "noncompliant" encoding/decoding system that doesn't replace unknown characters with replacement characters. Fortunately, compliant systems are becoming more and more popular and available. Unfortunately, that can make file name handling harder than when you had a non-Unicode-compliant handling system for your strings.
Rust uses WTF-8 on Windows for OsStr[ing] and Path[Buf]. It's zero-overhead to cast from &str to &OsStr/&Path to &[u8] (though converting WTF-8 to UTF-16 costs an extra operation when performing a Win32 function call). However this doesn't solve the inability to round-trip "possibly-valid UTF-8/16" to "Unicode text" and back (though Python's surrogateescape might be one viable approach).
Other libraries handle this even worse than Rust. On Linux (filenames are bytes), Qt is unable to open files with invalid UTF-8 names, while GTK can open them (but shows an "invalid encoding" message instead of the original filename), which I think is a good-enough approach.
> If you never display the filename, the answer is to treat existing filenames as bags of bytes, but that breaks down as soon as you need to display them, or if you need to manipulate them by appending unicode to them, in which case you have to decide on an encoding.
No you don't. On Windows you treat paths as a u16'\' an/or u16'/'-separated sequences of uint16_t. On Unix it's a '/'-separated sequence of bytes. If you want to display, you need to decode, but for display only - so errors should use replacement characters as a graceful failure. For appending you encode your string and then append the bytes. Never do you decode externally provided paths for the purpose of manipulation.
> There's some obscure solutions to this problem, like https://simonsapin.github.io/wtf-8/ (which includes discussion of the 16 bit encodings you need for Windows)
It's relatively new, but has wide enough adoption cosidering - e.g. it's what Rust uses for Windows paths. It's also straightforward - just encode the unmatched surrogate pairs as if they were the corresponding reserved unicode characters using the normal UTF-8 algorithm.
Part of the problem is legacy Windows cruft. For long time to properly handle Unicode characers you needed to explictly use widechar UTF-16 functions. Legacy narrow encoding is systemwide setting, couldn't be set to UTF8, thus only subset of characters would be represented correctly. Only recently they introduced ability to set narrow encoding for application to UTF-8 with setlocale, which is a lot saner.
I've been bitten on a few small releases by forgetting that C# localises number->string conversion by default (which makes sense. But if you forget, and you're writing floats to csv files and the decimal points become decimal commas....).
I disagree that having localization for number formatting based on a system setting by default makes sense. Formatted numbers are needed for both human and machine consumption and only one of those can deal with unexpected formatting.
Maybe the galaxy-brain design principle is: if you're designing an API, make sure that where possible bugs occur in an area where programmers care about fixing them (data I/O) rather than somewhere that they neglect (user interface localisation). Voila: better software!
Except programmers test with their own locale and everything works there. Then the user gets an obscure error that the programmer is not able to reproduce because on their system a number from some internal config file was parsed incorrectly.
In case of a home-grown code, it could be simply the question of a programmer awareness. There are still many outdated and/or unfinished tutorials that use WinAPI without any concern about enabling Unicode and wide chars support.
If we are talking about ready game engines like Unity and Unreal... it is probably a naive assumption about input being 1 byte wide and things getting lost because of that in some gamedev-made script.
The amount of random encoding problems that still exist are so bizarre. I recently left a UK job after already leaving the country more than a year ago, and in their attempt to mail P45 form to my new address (in Moscow) the only bits that survived are the string "c/o" and the postal code.
I, too, have the Ł letter in my name, and yes, it is a sick joke that so many things even in a supposedly modern systems make an assumption that the world runs on ASCII.
In the case of the Windows operating system, the worst fact is that every single part of it behaves differently. Some parts display the path with a wrong encoding, but handle it correctly. A third-party app can display it correctly, but fails while trying to access any file. From what I remember, even the built-in PATH variable editor/manager goes through some arcane steps to display the letters in a wrong way, but getting them to work sometimes.
I can only imagine how much more pain it is for someone using any of the less widely-used writing systems or those with more advanced features compared to ASCII (Hebrew’s RTL, Arabic scripts mid- and final forms, etcetera).
Nope. Neither can ź, ć, ś, ą or ę. You can, and people do write them as z, c, s, a and e when writing in a restriced character set, but that is not 'correct' and is not a bijection, ie. „półka” and „polka” mean two different things.
There's also the case of technically-same-sounding-especially-recently ż/rz and ó/u (whose replacement would let you get rid of two 'non standard' characters), but for historical reasons these are not interchangeable.
I do find this sort of stuff fascinating and also faintly frustrating but of course my mother tongue is (in)famous for being a bit loose at first sight.
According to one of my employees (Polish) Ł sounds roughly like w as in win or water but not as in what. A quick read of this: https://en.wikipedia.org/wiki/%C5%81 doesn't help too much.
Does enforcing Ł instead of say w cause your written language to fail in some way? I don't want to cause offense, I want to understand the causes of difference.
'W' in Polish is already used, but for a different sound - it's pronounced like the English 'v'. 'V' in turn is not present the Polish alphabet (in the sense of it not being present in words of Polish origin).
If you wanna change that, you might as well change the entire writing system of the language, eg. to be more in line with some other, more common writing system (ie. other latin alphabets or the cyrillic alphabet which would probably make the most sense phonetically). But no-one's gonna go for that any time soon.
I think we have found the disconnect: you quite happily use a word like "wanna" which is nonsense in English. Its allowed because it is understandable. Wanna is "want to".
Ooh, "gonna": That'll be "going to".
What's gonna to you is l bar for me or vice versa or something 8)
I can very much relate to this but also have very little sympathy here.
I have a special character in my name, an apostrophe, and it causes trouble regularly online and with tooling. A number of years ago I decided just to never use it when it came to anything to do with technical work be it email, logins or usernames.
Unicode characters are a pain to deal with and I have suffered from it first hand trying to handle it. At the end of the day it is much easier just to not use the special characters and move on with your life rather then be battling the constant frustration.
I'm sure these tools have lots of issues opening and you would be surprised at the amount of time, effort and testing it would be required to provide fully Unicode support. Most people would see it as a very small positive and not worth the effort. I find it hard to disagree.
My legal last name is "Sirén". When I was younger, I almost always used "Siren", because it was easier to type. Then, ~15 years ago, I started noticing that American websites sometimes rejected it, because they considered it inappropriate. Sometimes "Sirén" would work, sometimes it worked but caused minor annoyances, and sometimes it would not work for technical reasons.
Both versions work most of the time these days, but I still run into trouble once in a while no matter which name I use.
Totally agree with the sentiment. It has gotten a lot better in the last 10 years. Very frustrating to have your name blacklisted by that. It does seem most system have a very US focused design.
I still find it funny that even in my home country you can't use a lot of local special characters in names. Also most airlines won't accept it so technically I'm not giving them my true name!
Well in this case they were explicitly allowed it just caused problems down the line when other system attempted to consume them.
String come up again and again as a hard issue to deal with especially once your start looking at Unicode. I think it would be very reasonable to assume only ASCII works and even then it doesn't always work!
Unicode really wasn't practical at all back then. Unless your entire system end-to-end was built internally, you'd have to interact with some non-unicode software. There was also no agreement on a common UTF-8 encoding, and other unicode encodings were all broken anyway.
Names have been spoken and hand-written since forever yet somehow computers aren't good at that so we all tolerate converting them to printed-looking text. Nobody cares, it doesn't matter.
ASCII only is not appropriate in some locales, as the keyboards don't have a-z. This is why in Thailand people tend to use their mobile phone number as their password, because it can be typed on all the common keyboard layouts they will encounter.
Also, with Windows 10 users will often not even choose their username. It gets generated from their given name + surname (which is a whole different issue for people without one or t'other).
Since identifiers like usernames are seen by people they are susceptible to homograph attack and _do_ deserve to be treated a bit more carefully. Also you probably dont want usernames like ń̸̡͍̲̲̫̰̦̔͛̋̉͊̔̈̈̈́̀͑͘i̶̜̔̐̅̔̑̈̕͝͝g̶̢̭̮̲͕͉͔͙̳̥͖̉̏̇̎̊̈́̊̆̃̎̑͆̿͠ͅh̶̡̛̪͔̯̯͈̼̿͊̂̍͐͒͐͐̆̽͛̄̽͝t̸̛͔̮̆͊̋́̑̓̅̀̆͋̕ͅf̸̤̗̺̣̤̝̟̱͎̦̀͒̽̓̋̏͌͋̇͛ͅḷ̶̭̓̿́y̵͍̦̫̫̠͆͛͋̓͑͑͋̔͑́̔̽̚̚
I would have to do research on whether the list of valid code points depends on the Unicode version. For example, can regional indicator code points (https://en.wikipedia.org/wiki/Regional_indicator_symbol) appear in isolation? If not, is that different in Unicode < 6, where those code points weren’t assigned yet?
“Because of this complicated history and confusing changes of wording in the standard over the years regarding what are now known as noncharacters, there is still considerable disagreement about their use and whether they should be considered "illegal" or "invalid" in various contexts”
Edit: also, testing all code points likely is overkill and using code points in isolation likely isn’t enough. Most tests are better of with something like the big list of naughty strings (https://github.com/minimaxir/big-list-of-naughty-strings)
It’s a pretty good test case. Similarly we found a number of bugs in a Django application and path handling, because I happend to be using Windows for six months, while the rest of the team was on Linux and Mac.
I think the whole problem is keeping the character encoding consistent in the applications and their dependencies. Programmers often forget this because they avoid non-ASCII characters in their code.
Sometimes even "regular" ASCII surnames cause problems.
When written in the Latin alphabet, my surname is one letter.
I've had an amazing amount of problems with this not just due to technical limitations (like various forms marking the entry as invalid), but--much more aggravatingly--human limitations.
One particularly infuriating anecdote: at a past job many years ago, the email structure was lastname@company.com. I dutifully sent the IT person in charge of creating emails my desired email. The IT person wrote back an amazingly condescending email that as per the policy, emails had to be last names, not individual letters. I then had to go find a bunch of random websites which explained single-letter names and forwarded them to the IT person. They then obliged, but did not apologize for insulting me. That is not right that I had to put up with that.
> One particularly infuriating anecdote: at a past job many years ago, the email structure was lastname@company.com. I dutifully sent the IT person in charge of creating emails my desired email. The IT person wrote back an amazingly condescending email that as per the policy, emails had to be last names, not individual letters. I then had to go find a bunch of random websites which explained single-letter names and forwarded them to the IT person. They then obliged, but did not apologize for insulting me. That is not right that I had to put up with that.
Except single letter last names are less common than people not following policy and/or abbriviating the name. It could simply be an honest mistake and the email is just their standard response since they have other things to get to. Did you try simply pointing out that that the letter was in fact your last name instead of getting passive-agressive?
The article offers a solution of idea.system.path=${root.dir}/JetBrains/Rider/system but doesn't mention the C:\JetBrains directory permissions. Directory permissions under %LOCALAPPDATA% (the location that works for people without a Polish character) should restrict write access to one user. With the Windows default behavior, creating C:\JetBrains would inherit permissions from C:\ - and wouldn't restrict write access to one user. Maybe 99% of the time this is irrelevant (i.e., there's no realistic threat from malicious actors who control unprivileged user accounts on your own development machine).
Still, it's a potential downside of the solution, and more motivation for the vendor to fix their code so that Polish characters can be used under %LOCALAPPDATA%.
If you are on a multi-user system, the path "C:\JetBrains" isn’t really ideal (what if other users also need Rider and have non-ASCII usernames?). That said, you can easily change file permissions on Windows if the default ones don’t work for you.
I'm not saying it's a good idea (even though I stupidly do it), I'm merely pointing out that there are reasons that U0080+ may end up in a username; for reasons other than intentionally putting it in there.
As for the benefits, which is completely off-topic, Windows Store is actually pretty awesome if you completely avoid search (and you need to do the Microsoft account thing for it AFAIK). Windows has needed a system to update 3rd-party software, to compete with Linux package managers, and the store is a really good effort (there are still annoying warts that Aur, Deb, RPM do not have). If you're willing to be a bit dumb, there is convenience.
This is exactly why I don't do that initially - I don't mind my account being linked - but I've been bitten by the home path bugs multiple times, I unplug my pc during setup
Somewhat surprising that this is an issue with JetBrains, given that they are based in Eastern Europe, and would probably have more direct experience of these sorts of problems than US or UK based companies. OTOH maybe it's just a scale thing - bigger companies have more resources to handle these sort of cases, regardless where they're based (not that they always do...)
Character encoding is in a special class of problems. Like time handling.
If you pick up a halfway non-ancient framework in a somewhat common language with a somewhat non-terrible persistence like postgres, you just don't have problems. Just don't care, and it just works.
But it's super easy to derail that fragile correctness with something like MySQLs utf8-ish handling, or some OS's path handling, or 'efficiency', or a user or frontend dev submitting data in a wrong encoding. And then it gets mangled. And then the user is unhappy.
At that point, it becomes very hard to argue why one of the two things is wrong, and the other is not. While the user argues the other way around. Because both look correct, if you look from the right angle. And the only reason why I am right is because of some standard, while the customer is right because of money.
And yes, it is very 'surprising' why our software now functions correctly for russian or greek customers.
That it's a special class of problems doesn't mean it shouldn't be solved by now. Time handling should be solved too; amazing that an iOS app can't get current correct GMT.
It's not bizarre at all. Character encodings are a sort of language in themselves, and end up with all the problems that regular old languages have – there's a lot of variety, people can't agree on one particular solution, and there's not a lot of money in taking care of the edge cases.
It would be bizarre if we were at the point where we had perfect translations for everything, but still struggled with character encodings specifically.
for self driving cars, ISS and digital cameras everything you do is blurry in a sense, "good enough" approximation is actually good enough while character encoding and transformations have to be done perfectly and precisely and have surprisingly big number of edge cases
Sadly there is even still software which fails to build or even fails to run when there is a space in a filename (as is super common on windows file paths, as well as autogenerated CI build folders). It's ridiculous to no end that software cannot handle paths correctly.
Oh, it’s not a common knowledge that you should not UTF-8 in Windows username? That had been the case since 95 days. Only recently it had supposedly improved after Microsoft Account login become semi mandatory.
A lot of adults today weren't even alive in 95. Also, the assumption that people are familiar with windows vs other operating systems is becoming less and less valid. And as the world gets more globalised and remote, it's no longer to be assumed that all technical people are of a Anglo American culture.
I don't think this bug is anything to do with Windows, rather it is due to the way the paths are handled in the IDE's codebase. Presumably the same problem exists when using these IDEs in conjunction with a path containing non-ascii characters in the Linux or macOS world.
> Presumably the same problem exists when using these IDEs in conjunction with a path containing non-ascii characters in the Linux or macOS world.
Why would you presume that when the problem seems to be that one tool uses the systems native 8-bit encoding while another tool expects UTF-8 - under sane systems these are the same.
Isn't it some compilation option issue in native part? I thought it's a line on .sln or include library in a C++ source or something that has to be explicitly specified when building a Win32 binary.
On the contrary, the first bug happens because docker-compose tries to decode the path as UTF-8, but it is not UTF-8-encoded. ("'utf-8' codec can't decode byte")
The solution to this is extremely simple: don't validate usernames, period.
The rationale is from an article someone linked here ("Falsehoods Programmer's Believe About Names"):
> Anything someone tells you is their name is—by definition—an appropriate identifier for them.
If you try to validate by checking for profanity, knowing full well that people can have names that contain profane substrings, I have a tongue-in-check message for you—you are a fucking asshole.
Some years ago I used the + feature in my gmail address. e.g. myname+ycombinator@gmail.com to track down which service is giving away my email address. It happened more than once that I could not log in anymore at some point because they started to disallow the + character in email addresses. I also got phone calls from some companies complaining that i misspelled my email address because there was their company name in it.
hehe, did the same, although not with +, but using a catch-all feature of the provider. I still get a lot of spam and phishing attempts on my „dropbox@<mydomain>“ address. I faintly remember they (dropbox) had a breach some time in the past.
I've also had issues putting in my full name as my username. Lots of programs do not expect spaces in the path, and I experience a lot of errors which are resolved by changing the path to not contain a space.
> My username contains a "ł" character and because of it, this file cannot be processed properly.
What is so curious there? Some names contain all non-latin characters, and some softwares don't work with non-ASCII symbols. I just cannot understand why is it interesting.
>When I found out that the bug was in the Rider itself, I reported it to technical support. I also found a similar report for PyCharm. Unfortunately, things haven’t moved forward since then.
Similar to this, Node and NPM get very temperamental when you have a User folder with a space in it. I gave up on the community workarounds and just created a new account and copied my files over to fix it.
In CS, most algorithms assume an ASCII character set. I wonder if there's any string-related algorithms that completely break (functionally or complexity wise) when given UTF-16 or UTF-8 character sets
Asymtotic complexity can't change based on the character set, since you can just reuse the same algorithm with larger opaque datums. (Exception being algorithms with O(n^8) or O(n^256) complexity, but noone uses those anyway.)
A variable width encoding can cause issues in principle, but useful algorithms already have to deal with strings that have variable-length physical represention anyway (eg "yes" vs "no"), so it tends not to be a problem in practice.
> In CS, most algorithms assume an ASCII character set.
They most certainly do not. E.g., a Turing machine assumes an alphabet Γ which is a set of some characters and is defined no further, as any exact definition is meaningless to the theory. (I.e., the algorithm is generic over any alphabet.) The alphabet need not even be text; e.g., for a Turing machine, the set of all octets suffices.
Even for something like Levenshtein distance, the only real requirement of the algorithm is that the abstract "characters" implement equality testing. For Unicode text, I'd start with graphemes, and then look for counter examples.
I guess they won't break correctness but I do remember many algorithms (e.g. tries) assume you have constant random access to characters which AFAIK is not possible in UTF-8
One way of working arrive such issues is to use subst. That way the application thinks your project directory is actually located on P:\ or something like that.
I think it is a Java related issue. Relevant issue occurs in Jaspersoft Report. You cannot install Jaspersoft Report on Turkish Windows no matter what.
it was 30 years ago when i discovered that it doesn’t really matter what my name is. the system i’m interacting with expects my name to be “john” or something like that. so i let it be.
30 years later and i completely dropped all non-latin chars from my name in any and all forms. from airplane tickets to passport to you name it.
and you know what? no one cared about non-latin. not even the government. i loled when i actually realised.
i’ve encountered zero issues ever since.
and it’s been the same for lots of my friends. they just adopted some western name. case closed, no more issues.
it all depends on who much importance you attribute to your name. for me it’s always been a random variable. for others it’s a matter of pride. but to the “system” it will be a “random list of chars”, sometimes latin, other times utf.
It's not strange to localize your name. In ASL for example, you could sign your English name letter-by-letter, but it's much more common to have a totally new sign for your name - usually a word combined with the first letter of your name. Taking part in a different system often means taking on another name.
That's the harsh way to put it. A more diplomatic way is that computing is not unique in having deeply ingrained artifacts of the language and culture that birthed it and developed many of the paradigms.
Take anything having to do with seamanship. There are many terms that date back to early modern English that simply don't make sense anymore yet are accepted and universal because the British Empire had a large and enduring influence on maritime matters and happened to be at the forefront of most modern developments until about 70 years ago.
In some cases this is actually built into laws and industry practice. Pilots speak English. That's the rules. Don't like it? Invent the time machine and beat Wilbur and Orville. For much the same reason, science speaks Latin.
This technical debt is difficult if not impossible to overcome, especially in regards to computers because we still haven't cracked general purpose AI. Software will only accommodate what it was written to accommodate.
Recognizing the problem and working to fix it is all well and good. But its wise to understand that this wont be solved any time soon so in the meantime it is pragmatic to operate in such a way to maximize compatibility.
After all, I still have to call it a Foc'sle even if I think that's dumb or isn't inclusive of my culture.
There's also the practical consideration that English, due to having a) an alphabet b) letter shapes that aren't affected by surrounding letters and c) no diacritics, is the easiest major language to store and display on a computer. Even if silicon valley ended up in a country with a logographic writing system, I'd bet that the first character set that would have been used would have been Latin based
My name contains non-Latin characters (apparently strange as we use a Latin language), but 40 years of working with computers I learned to avoid using the original form and always convert to ASCII; yes, it is not my name, but my pride and sense of entitlement are not hurt at all.
Sometimes it is better to avoid being hit by the bus even if you are right.
The first idea was to change the username to one that does not contain Polish characters. It turned out that Windows does not rename the user’s folder when changing the username. Manually renaming the folder was not an option. This way I could corrupt my profile in the system.
The end of the article is about how to change the directory where the temporary files go to one not under the user folder.
Yep. For example, the name of the third-largest city in Poland is "Łódź", which might look like it's pronounced "lods", but is actually pronounced more like "wootch".
This is a pretty frequent thing to encounter actually. Just some years ago many websites actually preferred to use ISO/Windows codepages to save space on multibyte Unicode symbols adding HTML entities to represent everything which is not in it the basic ASCII and their primary language alphabet.
Fun fact: I was looking for an e-mail solution for a small company about a decade ago and found Zarafa. It seemed nice and I deployed it happily. Just to find out it only supports the Western European ISO codepage which was hardcoded. I hope they have switched to UTF-8 since then.
The fact most of the speakers of the language have switched to a different accent (by the way, I feel curious why) doesn't mean a fundamental difference. Same letters and whole same words are vocalized differently in the US and the UK (and even different regional accents within both countries) and nobody thinks about them as different letters/words. Ł still is the same letter, a direct counterpart to L in English, German, almost all latin-based Slavic alphabets and pretty much everywhere. I bet almost every Mikołaj drops the slash happily (some probably even change the whole name to Nicholas which is the English counterpart) if they get a passport of an anglophone country.
Nevertheless I find it absurd it's 2021 and they still have to. It's almost 30 years since the introduction of Unicode in Windows NT and NTFS, probably also close to that in Java. Pretty much every serious programming language or database supports Unicode by default today.
I believe it's a bug in some app in the toolchain as Windows file system API is perfectly capable of handling non-ASCII symbols. I always cared to avoid using non-ASCII symbols and spaces in my paths (incl. always installing almost everything to a custom directory outside "Program Files") but c'mon, how many decades do we need to develop handling these reliably?
I would also consider Windows' inability to optionally change the actual home directory name and distinguish between the user's full (display) name and their "username" (which are 2 distinct properties in Linux) "feature-bugs".
> The fact most of the speakers of the language have switched to a different accent (by the way, I feel curious why)
It was a long process of L-vocalization [0] that started around XVI century. The first segment of the population to be affected by it were peasants (it is also one of the reasons why the original sound quality of «ł» survived relatively long among Polish artists in early XX century as a sign of professionalism, similarly to English’s Mid-Atlantic accent [1]).
I suspect that L-vocalization’s proliferation was aided by multiple wars, partitions and occupations that followed, which caused many waves of both natural and forced internal migration and disappearance of most dialectal differences.
According to [2](PL) the pronunciation of «ł» as /w/ was codified as the standard around XIX/XX in both informal and formal settings. There are still remaining populations using the old sound quality, but they are mostly confined to the areas in proximity to other Slavic languages.
I actually take a hint of offense that you say "Ł still is the same letter, a direct counterpart to L". By what measure? It was always a separate phoneme. Just because it shifted to w doesn't mean it wasn't distinct to poles to begin with - just now clearly distinct to everyone as the sound has only become more distinct! Pointing to ways we poles live with the situation doesn't mean it's ideal
It was not "always a separate phoneme". English-L-like hard pronunciation of Ł still is valid Polish, although rare to encounter in real life outside certain regions in the east. In fact it is even considered sort of more literate (conservative/standard) in theory.
Would you also say Greek lambda has nothing to do with the English L? Or, slightly more relevant and complex example, actual Polish L with Slovak Ľ and Serbian Љ?
Can you provide argument that a letter was created for Polish for something that was not a phoneme? It has always co-existed with normal L, while it has sometimes sounded close to L I don't think one could say standard Polish ever merged these.
You probably misunderstood me, I never meant to say Polish Ł and Polish L are the same and have no reason to be distinguished. I meant Polish Ł is a direct counterpart to English L and only differs significantly by the sake of regional/historical shift in accents.
The problem is some software just has problems with non-English alphabets because, roughly saying, all the software was meant to only process English text historically and much of it still has not been fixed. Users of non-latin-based alphabets have been accustomed to this and have no problem writing "Иван" as "Ivan" (despite it normally reads even more different in English, more correct phonetic transliteration would be "Eevan"). Heck they even spell "Семён" (~"Semyon", the Russian counterpart to English "Simon") as "Semen" X-). But the users of diacriticized Latin somehow get surprised with this.
If I could travel back in time to when ASCII was designed and give the engineers a hint I would ask them to add first-class diacritics to their design so anybody would be able to add the slash for Ł, the umlaut for Ü, Ö or whatever using an extra byte. Sadly, even today we mostly encode Ü as a letter absolutely distinct to U rather than a combination of the latter with the umlaut even though Unicode allows doing this the latter way AFAIK.
When you have non-standard characters in your name you quickly learn to never use them in computers since even though most systems works fine, some don't. And you can't fix all the thousands of systems your name has to interact with.
I even had trouble booking flight tickets since their security system couldn't parse my name, and then had to go through some special security check due to it returning errors. After that, never again. Not sure how they managed to do it but they had some basic rules that they used to say "no real name can look like this, this is a fake person!" and just kicked it out.
From a programmers perspective. The characters in my name are standard where I come from, but they are not standard to the international air travel security systems likely developed by Americans.
Edit: You know how aircraft travel security always transforms your name into letters from the English alphabet to parse? Yeah, it transformed my name and then the resulting string looked so bad that the system rejected that. The original name doesn't look bad, but after transformations it did...
That is exactly what I meant. My name doesn't have non-standard characters either from the perspective of my home country, it is just normal letters in the alphabet, but not in the English alphabet.
For a user, changing their account (probably creating a new user, since rename apparently doesn't change the directory), is something they can do.
Changing all software to respect their perfectly valid name isn't something they can do.
They shouldn't need to change their name, but if they do, they can ignore all the broken software and go about their day.
This particular user is more capable than most, and found a workaround for this particular problem, which is good... But this is not likely to be the last of the problems.
Of course it would be better if all code was bug free. But that's impossible. As a user, avoiding unicode is a pretty easy way to avoid bugs like this - its the rational thing to do.
Polish may be close enough that an approximation is available in English, but there's an awful lot of languages that don't have a large overlap with English characters.
In the Asian case above, if someone with that name did try to "convert to English" they are ironically just as likely to end up with Akihito Abe as the ASCII, which will be just as broken!
Assuming that hypothetical guy is an average Japanese male(somewhat leaning right), he'd just turn IME off. Japanese input on desktop is consist of three following states:
- IME On state. IME capture and interpret keypresses as engraved and generate corresponding Kana-Kanji texts.
- IME Off state. IME passes through keypresses as engraved on keytops.
- Direct Input state. IME becomes dormant.
In IME Off state, the keyboard behaves as a plain jp106(or ANSI if it is) keyboard, like I'm doing right now. The cases where you would use conversion with IME on for an English word is when you have reasons for the word to be in "full width"(usually for typesetting reasons).
I don't think it's something that people should 'just know' that when Windows asks them their name during install time, they ought to use 7-bit clean ASCII for everything, no matter where they are in the world or how much they know about other languages. When Windows says "What is your name?", they ought to be able to use their name without things breaking.
I'm sure a computer savvy speaker of a fully-non-Latin language may still guess this is a good idea, but "computer savvy" doesn't cover everyone... and they shouldn't have to.
"Just use 7-bit-clean ASCII English" is not a solution to this problem.
Usernames and passwords are always 7-bit ASCII at least in Japanosphere, to the point it would look odd that you get to log into a computer using their own legal names. To use a computer for any useful purpose as a Japanese, you will have to be able to understand what "English" or "alphanumeric" or "half-width" means, which are non-tech terms for `char str[]`, and be able to constantly and quickly switch IME between ASCII mode and multibyte mode with an at most 2-key combination.
It might be a "you're holding it wrong" situation, but everyone has already learned how to hold it "correctly " - it'll be a disruptive change to default a "natural" hold that you suggest.
They could use a different name as their windows name (Do people use their real names as their usernames? I never do). Or, they would have to go through the pain of finding a real solution, like the author did.
Considering JetBrains seems unwilling to fix this bug, maybe the best solution of all is to switch to an IDE that works.
The problem is the technology, not the user using it in a reasonable way. ł is older than computers and the only reason computers struggle with it is lack of foresight or choosing to make things harder for most of the world by some of the people involved early on.
Obviously the IDE is at fault here. Rider has a bug with Unicode.
BUT, there is an easy workaround to avoid all Unicode related bugs: don't use Unicode. If that's morally objectionable for you, then you can keep fighting this fight.
I think it's reasonable to find that morally objectionable: English is the only language* that can be fully represented in ASCII, so pretending that ASCII is all you need excludes a large part of the world.
* yes, by and large. Many languages make do, but even the European languages that use the same script as English cannot be fully represented:
- Pretty much all mainland European languages use accents (simple example, in Spanish el and él are different words)
It's naïve of you to maintain the façade that English can be fully represented in ASCII. We've just had longer than other languages to adapt to that particular encoding technology, and the good luck to have a code set built to represent our language become the lingua franca of computer technology.
Avoiding unicode, or anything but 7-bit ASCII is like using chiseling text into a stone instead of pen and paper because the pen might break. Fix the pen! Or replace it with a computer (and we're back full circle)!
It is not morally objectionable avoiding, it's just stupid.
I also have non-latin characters in my name however I knew it was always an issue so I never used it in paths etc.
At some point, long time ago, I was tasked to do some maintance with Google Cloud service (can't remember the name of the service now) which was doable only through Python CLI utility and it failed with very similar Python error.
What I found out rather quickly is that utility took my name from Google+ profile, which did include those non-latin characters. No biggie - I thought and fired e-mail to support (yeah it was those times it was still that easy). Few hours passed and I received information that this won't be fixed anytime soon and the best course of action would be to change my name.
Of course, support person probably meant to remove the diacriticals from my Google+ profiles, but still it left unplesant aftertaste for years to come.