I think the recommendation to disallow any non-ASCII character is throwing out the baby with the bathwater.
How about code that wants to display some emojis? It would be cumbersome to use hex unicode everywhere. And while localisations should typically happen in a separate language file, it's very common to want some text in code intended for a single audience.
Blocking all the confusables might be tricky, and an allow list would be endless. Perhaps some magic pre-processor comment that says "allow unicode in this file".
> I think the recommendation to disallow any non-ASCII character is throwing out the baby with the bathwater.
Not throwing out all non-ASCII characters from code-files. Just throwing them out as being invalid identifiers in your code (think variables, function-names, etc).
> How about code that wants to display some emojis?
Fine. You quote that emoji in a string, and it's golden.
You try to make a variable with the name of an emoji however, you code crashes.
That would close this particular attack (but not the BIDI one the article mentions). But there is probably already too much code out there with π=3.14 in it to be feasible to do this.
I really thought that using the greek letter for pi (or theta, etc) was something you do to show your programming language supports unicode identifiers but that nobody actually does in real life. I wonder how people input this, do they know the Alt+xyz combo, do they select-copy-paste or is there another way that to write these characters that I'm not aware of?
Just to be clear, I don't mean people who are actually using Greek language for input - it's pretty obvious how they would type that character :)
Do you really have to write emoji in the code string? Similarly with international language characters. The sane thing is to use either json config files or i18n libraries.
If you are writing something intended for a single audience using i18n libraries can be unnecessary overhead. And emoji can also be icons like ⌘ that can be useful to display in the UI.
How about code that wants to display some emojis? It would be cumbersome to use hex unicode everywhere. And while localisations should typically happen in a separate language file, it's very common to want some text in code intended for a single audience.
Blocking all the confusables might be tricky, and an allow list would be endless. Perhaps some magic pre-processor comment that says "allow unicode in this file".