The next time someone tries to tell me that a true screen reader should use computer vision and machine learning (including OCR) rather than requiring applications to implement accessibility APIs, I will bring up this case.
"Why can't we just, you know, direct blind users to a special protocol that structures the data appropriately and then lets them parse it however they want?"
Me: 'We did! It's called HTML! Designers just broke it!'
IMO, HTML is still closer to that ideal than anything else we have. My guess is that given a random web application and a random non-web GUI (especially if the latter is multi-platform), the web application will be more usable with a screen reader.
I'd say markdown is even better than HTML for writing generic documents since it enforces simplicity. In particular, it forces a linear flow of the document and does not have any support for stuff like JS.
Is it possible for developer to make canvas accessible? For example, pronounce "You are one the road that goes from left to right, there is a shop on your top and a inn on bottom" like mud.
Real accessibility is about presenting the same information your other users have. So, instead of you typing the description, let each of the drawn objects have its own description and made them discoverable and navigable. I think that Google were trying to make flutter2 components accessible, but it means starting from ground zero and building the same stuff anew.
Html could have been that - or better, it was at first - but instead of creating a more specialized solution for running rich apps we decided to exploit html.
Right now we are in what I'd call the worse of both worlds, because we rely on html to do things it wasn't designed to, and there's no longer purity in any html out in the wild.
How hard is it to program while being blind ? What sort of development do you do ? i understand that frontend is impossible but what other difficulties do you face ?
Are indent based langauges like python harder than bracket based languages ?
Front end is not entirely impossible, but impossible on doing pixel perfect designs. Otherwise, I know blind people who do FE, not sure if most of it is professional though.
Indent based languages are actually easier. Every screen reader has a way to announce indentation in code, while brackets could be confusing if not formatted or verbose if properly announced.
My main issues are dev tools with bad accessibility. Also, it takes me more time to get acquainted with new code and sometimes omophones in the source code which require extra attention. Filtering through logs is also a bitch in most cases. Besides the dev tools, you can summarize the rest as bad IO speed.
Do you have some tricks for how you handle filtering through logs? Or some ideas if there could be a tool that could help you or mitigate your most critical issue[s]?
I found filtering through longs a major pain even for a fully sighted person like me, so I wrote a tool to help me with that, but it's fully in a "TUI" paradigm (i.e. curses-like), so I presume it wouldn't help you much (https://github.com/akavel/up). No promises, given that the tool as is scratched my itch, but I am honestly curious if something similar could reduce your PITA, including whether this specific tool could be made useful for you through some minimal effort on my side.
usually grep saves the day. I will check your tool, but what I need is for a terminal command that can recognize the meta fields from a log record and put them on a line separated from the main message. Also, it must be installed everywhere I work, which is not so easy. Putting logs in a table with filtering capabilities might be best, but this means web access to the location of the logs which is again tricky.
The idea of my tool is not really to help in some specific way of processing logs, but rather to make it easier to fiddle with grep and other Linux CLI filtering tools, by shortening the feedback loop vs. normal shell REPL. I'm not sure if that sounds in any way clear; also, it might sound strange that shell REPL is IMO too slow, or that it is in any way important, but I found it enough so that I invented a way to speed it up, and it seemed to hit a nerve of quite many people, seeing from the reception the tool got. I can try to explain more if you are interested and/or don't really understand what I might be even talking about. I tried to explain it in the readme, but for many reasons I have no slightest freaking idea to what extent it might sound understandable to you - not the least due to the fact, that for fully-sighted people one important way I tried to convey the idea is through an animated gif of a terminal window showing how the tool is used. As someone said, with really innovative ideas, it's often necessary to push them down people's throats to make them understand those; this gif is part of that effort, and fortunately seems quite effective, but even it is not enough for some people, and I'm by default assuming that it's for obvious reasons completely inaccessible to you. I was pondering just now whether trying to copy this animation to asciinema could make it in any way more accessible to you, but as of now I have high doubts if that would work at all (including that I have no idea whether the asciinema site is at all accessible to you, as well as that I'm highly suspicious the terminal ANSI sequences generated by a library I'm using are "tiniest diffs", so although the result might look indistinguishable to fully-sighted people, a screen reader might [or might not???] take them at face value and read them as a mess of random jumps and single-character changes on the screen). That said, I'm more than 100% happy if you managed to understand what the tool is doing, and just don't think it could be useful to you, whether as is or with some accessibility improvement attempts. Or if you don't understand but don't feel like wanting to dive deeper either.
Usually serializing each message as a single json line in a file.
Since it's all on one line you can still use grep, but then since it's machine-readable you can pipe the grep to anything that can parse json. Vanilla python3 works and tends to be a part of most ops toolkits. Such tooling can split out the fields onto other lines etc or in a more reader-friendly format.
I've been struggling with eye strain and have considered trying to approach development in a fashion similar to that takeb by blind devs. Any suggestions for guides or overviews for how I can get setup?
it depends on what you are working on and what you want to do. Generally, screen readers are not as good for programing as they are for plain text stuff, so they will be a limited substitute for whatever you are using now. If you are okay with working slower, they can help you listen through code and tool's messages providing relief for your eyes.
If you are using Windows, NVDA is the screen reader. Jaws is a bit too expensive for my taste without any significant edge over NVDA. The builtin narrator is still immature in my opinion. VSCode has excellent accessibility with a dedicated and involved team. Visual Studio also has extremely good accessibility support though I'm not using it. InteliJ sucks. Not completely, but enough that people do not see the benefit of using it. Eclipse is not popular these days, but it has good accessibility as well as far as I know. Sublime is not accessible.
If you are on Linux, the screen reader is Orca. It does not have the same level of support as the Windows stuff, but I know people who are developing on linux boxes so it is doable. Emacs must be good enough because it has self-voicing plugin and people who like and use it. As far as I know, VSCode for Linux has some accessibility features but I don't know how they compare to Windows.
If you are on Mac, your only choice is Voice Over by Apple as screen reader. It is good but not always perfect to my knowledge. I know people who use TextMaid, XCode, VSCode, and Emacs, but I don't have much feedback from there. It is totally doable though.
On Windows, I'm also using notepad++ as secondary editor because it is faster and works better for large files. Also, it is a good notetaking tool.
We can connect offline if you need some more info.
I am very interested in how blind developers work. I have been pondering how to make computers and development more accessible. If you don't mind:
Do you have preference between CLI, TUI, or GUI dev tools?
Is highly symbolic code harder to understand using a screen reader than plain language code? By symbolic, I specifically mean any characters that are not alphanumeric.
I don't have preferences on the interface. As far as it is accessible, I can learn to work with it. E.g. VSCode make everything possible to make their interface accessible and they are continuously fixing any reported issues.
When it comes to code, verbose is better. Abbreviations take effort to decode. I can remap some symbols to have different pronunciations, but it does not work always. E.g. I've maid the sr to speak the ":=" operator in python as "assigned from", but brackets have nesting and orientation, and too many of them get nasty to listen to or follow.
It would be really cool to be able to hook into where words started and ended. Then you could add a background tone/frequency rising in pitch with indentation level (and maybe have no tone for the root level).
Oh, if only speech engines broke down the utterance process and made it more open...
Indentation level is a solved problem and start and end of words is also a customable behavior. Speech engines as a whole are open to customizations. There are some problems though that are just not easy to solve at all. It is like with the regular expressions and html. Hooking the SR to the language server might be an avenue of possible improvements, but the problem definition on my side is currently too vague to formulate correctly.
I've just realized I assumed speech-to-text and text-to-speech were similarly complex and unincentivized toward open development. (I wanted to play around with augmenting speech-to-text for some time.) TIL.
So how is indentation level typically handled?
And what other types of customizations are typically leveraged from an output-device standpoint? (Maybe there's a reference I can google for?)
Comparing the problem space to regular expressions and HTML immediately makes sense, that's a very intuitive way of putting it.
I can relate to being completely stumped about how to replace missing functionality with software, in my case organizing information (which is impaired because of autism). What does the problem space around the text-to-speech vaguery look like?
Screen readers have the benefit that they have two parts. One of them is the "explorer" so to call it and the other one is a synthesizer. The explorer hooks to the accessibility services and apis of the host system and produces a text representation of the objects discovered. The synthesizer receives the text representations and maps them to sound output.
The easiest way of customization is to get between those two parts and to convert the representation through some rules, regex for example. That's how my rule with the ":=" operator works.
Indentation could be done either through announcing the number of spaces/tabs in the start of the line, or by defining how many of a symbol a level are and assigning a sound to each level that is played when the level changes. There is an option for doing both.
Screen readers have apis for extensions or scripts for more complex functionality. You can check those of Jaws and NVDA for examples. The apis are rather extensive and they allow for lots of customizations like improving support for a given program or general modification of the sr behavior.
I was thinking/imagining more along the lines of being able to drill down into phoneme pronunciation, adding micro-pauses to or pitch-bending certain syllables based on rules, for example, or having a firehose of machine readable annotations for a given utterance, including indications of the exact start and end times/samples of individual phonemes in the audio stream so you can then mix your own audio track that has additional augmentations in it into the final output, for example using your own synthesizer to modulate tones in the background representing the current indentation level. Yes, ridiculously complex; but by front-loading that complexity (and winning the data accessibility fights) it would be possible to do a lot of cool stuff...
I understand some people swear by JAWS as the generally best-in-class solution, which has admittedly put me off NVDA as I feel I'd sort of absorb a sort of biased sense of what's possible or how audio output software works in general. I guess I should just install NVDA already since it's the realistic option - if I started testing stuff in JAWS and talking about it the only reasonable assumption people would be able to make was that I was using a copy that had drifted ashore from the high seas, which would be kind of true...
Depending on what synthesizer you use, you might be able to get in its internals. Keep in mind that each screen readers can use different synthesizers so that both JAWS and NVDA might use e-speak, windows core voices or something totally different.
In regard to the idea that Jaws is best in class, I'm inclined to disagree. Jaws might be a bit better in MS Office applications and UIA support, I haven't used it for years. However, NVDA has the better web story and until recently it was the screen reader that was actually working with VS Code.
I see, I'll have to have a deeper look. (I'm on Linux, so I think my options are espeak and possibly Festival.)
Thanks very much for the perspective on NVDA. I'll definitely have to give it a go! I've been interested specifically in Web accessibility for quite a while.
Well, this is highly subjective. I'm paid to do python and node js from time to time and python really rocks for me. Not a small reason why I like python more is for the much better tracebacks. When looked in a console, it is much more pleasant to have the erroring line at the bottom which spares me copying the entire console in npp in trying to find the top of it.
That said, I know many blind devs who do java, c#, swift, c++ and so on. I had bad experiences with ide-s when I was starting to study software development on those languages and it've stayed with me, but it is not universal.
If I had the choice, I would not drop python, but I might add some of the functional languages or rust for the new ways of thinking they might teach me. So far, I've looked at them, but I haven't done nothing serious there.
With NVDA on Windows, when I read the comment normally, it's spelled out. When I read it character by character, I get "symbol FFF8" for each of the hidden Unicode characters. And when I move line by line through NVDA's linear representation of the web page, the hidden characters count against the length of the line for the purpose of word wrapping.
Narrator's behavior is weirder. If I turn on scan mode and move onto the line with the up or down arrow key, Narrator says nothing. If I read the current line with Insert+Up Arrow, Narrator spells it out like NVDA does. When moving character by character, Narrator says nothing for the hidden Unicode characters. And because Narrator doesn't do its own line wrapping but defers to the application to determine what counts as a line, the text only counts as one line.
Disclosure: I used to work on the Windows accessibility team at Microsoft, on Narrator among other things.
Still, it would not occur to most sighted programmers to review code using a screen reader. To me, this is another argument for having a truly diverse team (or community, in the case of an open-source project); a blind programmer who's already involved with the project would catch something like this. So in this particular case, blindness is truly not a disability.
Listen guys, don't get me wrong. As someone with Ø in my name, and both Å and Ø in my address, don't get me started on poorly written systems which cannot handle unicode properly. I've seen my name and address mangled in shipping forms, in airline tickets (every time) and even in my marriage-papers since I married abroad.
I literally have personal reasons for getting everyone, and I mean everyone, on the unicode bandwagon.
That said... Maybe it's because I'm a child of the late 70s and early 80s and learned to program on computers which simply didn't have non-ASCII characters at all...
But can't we all just sit down and admit that allowing non-ASCII characters in programming-language identifiers was a bad idea? Can't we in the next revision of EcmaScript (or Rust, or whatever) mandate ASCII-only identifiers when in strict mode or using modules or whatever? Having invisible characters represent executable code is not just a dumb a idea, it's so hazardous that you might call it borderline malicious.
There has to be some way to undo this damage, without breaking compatibility with the code which is already out there, right?
You can only type ~27% of my name with just ASCII (and even then one letter will not be exactly)... and i agree with you. If anything i'd go a bit further and say that, sure, use Unicode in places where you can find arbitrary text like documents, messages, etc but anything that has to do with the 'guts' of the computer should stay away from Unicode (or at least treat it as data, like how filenames are treated on Linux).
I disagree with the getting everyone on the Unicode bandwagon though, IMO Unicode has introduced a ton of problems exactly because it tries to be a ton of stuff at the same time. I don't know how exactly a better solution would be but i have a very hard time accepting that such a convoluted and error prone system is the best solution. IMO if decades later there are still issues with getting it right then there is something fundamentally wrong with the system itself and not with the applications and developers trying to work with it.
An existing working solution, even if not perfect, patched, with lot of baggage and technical debt is infinitely better than a non-yet invented ideal, perfect solution.
And even if the perfect solution existed right now, in a few decades it will be as filed with baggage as the current one.
Sometime one has to realize that hard problems are hard.
Adding a variable decorator/annotation like @Unicode(german,french) would be a good stop-gap. You could only use ASCII characters unless you specified the script that you want to use. One could even set a max limit on how many scripts per variable. Because while I have used German characters in variables before (only if I'm referring to some law or spec), I never had a use case for more than 2 scripts within one variable.
The multiple scripts per variable thing is implemented in Rust via a lint. For the explicit enabling of single scripts, I have suggested that for Rust, but sadly people preferred allowing all identifiers (while giving an option to only have ascii but I'd argue this is unfair for anyone who only wants to use a specific non-ascii language, why do they have to suddenly allow all languages in their code base?). There are also practical concerns, like who says what a language is, which characters it contains, how that language is called, etc? Someone has to maintain all these lists.
For your information the relevant Unicode specification is the Script_Extensions property [1]. (You can't easily filter by languages, so you should filter by scripts.)
I think this is a good idea because once in a while you need to write non-ascii characters in names.
This mostly comes up when implementing tax rules or government administrative divisions as some countries have names/concepts which have no good translation into English, so you are left with using the non-English name, which often contains non-ASCII characters.
The issue with this is less that this is possible and more that a lot of javascript ends up in production without ever getting compiled, linted, type-checked, etc. Stuff like this is designed to bypass what little human oversight there is to prevent bad things from happening. What is actually visible also depends on what fonts you have installed on your system. So, it's less clear cut than you think.
The problem is not so much that humans can't see this but that they are not looking very hard to begin with (otherwise, they'd be using the appropriate tools) and that we should rely less on them actively looking. Blind trust that things will be fine is the root problem here.
> The problem is not so much that humans can't see this but that they are not looking very hard to begin with (otherwise, they'd be using the appropriate tools) and that we should rely less on them actively looking.
And simply not allowing non-ASCII identifiers in the first place would be a move in that direction. Now you have one thing less to look for.
> But can't we all just sit down and admit that allowing non-ASCII characters in programming-language identifiers was a bad idea?
It's a bad idea only if all members in your team can easily produce and comprehend an ASCII-only code.
> Having invisible characters represent executable code is not just a dumb a idea, it's so hazardous that you might call it borderline malicious.
Not if those invisible characters do affect the rendering. Invisible formatting characters like ZWJ and ZWNJ are allowed because they are used in some scripts. The relevant Unicode specification [1] even provides a guideline to limit ZWJ and ZWNJ strictly to the context where they do affect the rendering.
That said, the Hangul filler and half-width Hangul filler were mistakes. They are purely legacy characters and never have been used in practice, so I encourage new languages to exclude them from the default (X)ID_Start/Continue set (Unicode can't do that because of the compatibility, maybe they can introduce another pair of properties without those characters).
> The relevant Unicode specification [1] even provides a guideline to limit ZWJ and ZWNJ strictly to the context where they do affect the rendering.
Which is exactly what I am suggesting by saying non-ASCII characters should be banned from being used as identifiers, not from being present in the code-file all together or in the form of strings, etc.
If the formatting of your output in your applications (as seen by the user) depends on the names you've declared your variables with, then you are doing something horribly wrong.
You seem to think those formatting characters as something that should be in the higher-level protocol like HTML. They are not. They are used when two consecutive abstract characters can be combined in two or more different ways. And those different renderings frequently have different meanings. That's why they can't be simply removed when normalized; doing so will destroy the text.
We seem to be talking past one another. What Id like to see banned is non-ascii in identifiers, variables-names and nothing else.
While you respond as if I want to banish anything non-ASCII from all parts of all code-files except from HTML-templates. That’s certainly not what I’m advocating.
The following is IMO perfectly fine:
var greeting = “Hello (cowboy emoji)”;
The following is IMO not:
var (emoji) = “Let’s party!”; // note identifier contains non-ascii
Do you still disagree? If so, can you outline why?
Okay, I think I see where you got confused. There are multiple levels of Unicode identifier support and you are probably not aware of all possible levels. Those levels are:
1. Identifiers can contain any octet with the highest bit set. Different octet sequences denote different names.
2. Identifiers can contain any Unicode code point (or scalar value, the fine distinction is not required here) above U+007F. Different (but possibly same-looking) code point sequences denote different names.
3. Identifiers can contain any Unicode code point in a predefined set, or two if the first character and subsequent characters are distinguished. Different code point sequences denote different names.
4. Same to 3, but these predefined sets derive from the Unicode Identifier and Pattern Syntax specification [1]---namely (X)ID_Start/Continue.
5. Same to 4, but now identifiers are normalized according to one of the Unicode normalization algorithms. So some different code point sequences now map to the same name, but only if they are semantically same according to Unicode.
6. Same to 5, but also has a rule to reduce unwanted identifiers. This may include confusable characters, virtually indistinguishable names and names with multiple unrelated scripts. Unicode itself provides many guidelines in the Unicode Security Mechanisms standard [2].
Levels 3, 4 and 5 are most common choices in programming languages. In particular emojis are not allowed for 4, so your example wouldn't work in such languages. For example JavaScript is one of them so `eval('var \u{1f600} = 42')` doesn't work (where U+1F600 is a smiling face). Both Python and Rust are at the level 5. Possibly unexpectedly, both C and C++ are at the level 3. Levels 1 and 2 are rare especially in modern languages; PHP is a famous example of the level 1.
Level 6 is a complex topic and there are varying degrees of implementations (for example Rust partially supports the level 6 via lints), but there is a notable example outside of programming languages: the Internationalized Domain Names. They have very strong constraints because any pair of confusable labels is a security problem. It seems that they have been successful in keeping the security of non-ASCII domains on par with ASCII-only domains, that is, not fully satisfactory but reasonable enough. (If you don't see the security issues of ASCII-only domains, PaypaI and rnastercard are examples of problematic ASCII labels that were never forbidden.)
I argue that the level 3+ is necessary and the level 5+ is desirable for international audiences. The level 5 would for example mean that `var 안녕하세요 = "annyonghaseyo";` (Korean) is allowed but `var (emoji) = "oh no";` is forbidden. I have outlined why the former is required in the last paragraph of [3]. Does my clarified stance make sense to you?
To be clear I’m completely oblivious to what Unicode identifiers are. As such I’m not talking about them, and they are out of scope wrt to my point.
What I am advocating is that identifiers used for symbols in the programming language (variables-names, function-names, class-names, etc), should be strictly ASCII-based.
That’s simple, understandable and should be a sane default anywhere.
My opinion is that since nobody without a doctorate in Unicode actually fully understands Unicode, having a rule-set for identifiers built on top of the already bewildering Unicode rule-set is a sure-fire way to engineer for unexpected consequences and/or security issues.
Sure. Allow it if you must. But you must opt in to use it. It should be a non-default feature everywhere where it’s available.
> That’s simple, understandable and should be a sane default anywhere.
This is an usual canned reason given to reject any internationalization efforts, and it is likely only "simple, understandable" and "a sane default anywhere" to people like you. As you didn't give why they are simple, understandable in general, I don't see how your arguments are universally applicable.
> My opinion is that since nobody without a doctorate in Unicode actually fully understands Unicode, having a rule-set for identifiers built on top of the already bewildering Unicode rule-set is a sure-fire way to engineer for unexpected consequences and/or security issues.
That can be said for about all security issues, not just Unicode. That doesn't make you to avoid writing anything, does it? For the record, it is a valid choice to not write anything, but we normally exclude that choice when we are talking about the technology. And the "bewildering Unicode rule-set" is an one-off thing, as it is not like that Unicode produces incompatible standards every year. (Python 3 adopted Unicode identifiers 14 years ago [1] and implementations never changed, only underlying databases have been updated.)
Ideally that lint would be on by default though. Most code doesn't use non-ASCII identifiers. It's not happened though because of uhm. political reasons.
Most code made by English speakers contains English word and Latin characters, so other languages and alphabets must be abandoned, and their native speakers must imprisoned until they understand their mistakes.
First thing I did when I first read the story was check my editor. I already had the "Zero Width Characters locator" plugin installed, but that covered less than a handful of specific space character type codes.
Still, the result was good: Looks like IDEA editors like Webstorm show invisible characters with colored background and a warning.
My test was from that first article and also now from this one copy the example code they contained or linked to from the browser into an open file.
Interesting. PhpStorm highlights the variable after `timeout` but does not highlight the variable after `http://example.com/`. Even pressing F2 to go to the next error goes to the first variable (the highlighted one) but not the second.
However, placing the cursor on either does highlight the second.
I'm using the Darcula scheme. Your screenshot obscures the second occurrence, so we cannot see if your light theme has the same issue with the second occurrence not being highlighted as Darcula has.
You are right, I missed the other one, it is not reported. You can see there is something because it takes space, but you have to deliberately go there to see it. There also is no warning from having the "No trailing spaces" setting active, so it is not seen as a space character even if it shows as such.
I'll write an Issue on youtrack, I'm sure they'll fix it. From the well over hundred issues I ever reported about 2/3rd were fixed (rest is obsolete, only a few that are really still open).
Yes, I file bug reports with lots of places, and Jetbrains is one of the best for actually doing something with them. It is one of the few non-FOSS applications that I am willing to integrate into my workflow (hmm, the only one I think).
EDIT (new comment because edit-period is long gone):
It's not too severe an issue, maybe not one at all(?), at least in this concrete example, because after removing the first occurrence of the hidden variable it now becomes a "not defined" real error and not just a warning in the second location.
Some time ago I managed to add a non width character in my PHP code. Because it had no width PhpStorm did not highlight it and I had absolutely no clue why there was an error in my code.
So it only highlights when it has width.
Edit: just added some non-space characters and at least in Rider they are now displayed as a warning. So I think this is fixed now.
While not as fancy, font choice may save you here too. I use vim and while the editor doesn't treat this character as special, my font (Iosevka term) doesn't include this character, and so it's rendered as the generic "missing unicode" glyph with the code inside it.
A similar thing to the Reddit post mentioned in the article happened to me too: I used a not-a-space character that looks like a space once, the text editor autocompletion remembered it and would occasionally substitute it for space. The code looked OK, but compilation failed or threw syntax errors in run time. This continued for several years until I completely reinstalled the editor, with full cleanup.
Hey, you have missed U+FFA0 HALFWIDTH HANGUL FILLER which has about the same property as U+3164 HANGUL FILLER!
Surely I expected this coming ever since I've seen the purported Trojan "attack", as the Hangul fillers are pretty much the only characters that are (X)ID_Start and have no visible glyphs [1]. If (X)ID_Continue is also considered ZWJ and ZWNJ would be another contenders. Attacks using those characters have much better chance than the Trojan "attack", but you need a very specific code to execute the attack. It should be obvious that a typical coding convention easily prevents them.
As much like the purported Trojan "attack", this kind of attacks need a better code review and tooling. You don't need to remove non-ASCII identifiers from existing languages: they have their uses when an entirety of your team speak languages not using Latin script. But you should be able to catch a new use of non-ASCII characters throughout your code base and compare that with your expectation.
[1] The Hangul filler comes from a legacy mechanism of KS X 1001 for unencoded Hangul syllables (it had only 2,350 out of 11,172 modern syllables). The half-width Hangul filler probably comes from a duplicate encoding of the filler in the IBM code page 933 to ensure round-trip conversion. Both are never used in practice, except for probably the Hangul filler that was briefly implemented by Mozilla and removed due to the compatibility issue.
Perhaps one could do something similar in JS as well. Like have a config that will make an interpreter fail if it encounters unescaped unicode in variable names. It does not prevent any unicode variable names, but you just have to escape them if the are from some list of "abusable characters".
(At least Chrome seems to be happy with `var \u6D4B\u8BD5 = 1;`)
Yes, we do. People from all over the world write software too. They should be able to use the words they know in code.
Also, it's totally cool to have mathematical symbols in code. λ, for example. Much more readable than the word lambda. The only reason these symbols are hard to type is our keyboards suck. They can be made easy to type with editor support though.
>Yes, we do. People from all over the world write software too. They should be able to use the words they know in code
My native language has non-ASCII characters and I do not expect nor do I want to be able to type them outside string literals. Specifically for the reasons stated in the blog post, among others. Writing in my native language is far, far down in the list of priorities as a professional coder, when security / compatibility are there too. Suggesting that non-native English speakers have to be able to code in their native language also would suggest that non-native coders do not take security / compatibility seriously, which would mean that they are unprofessional. I'm pretty sure that it's not your intention to suggest that, but that's kind of how it comes across. With all the problems eliminated by the use of English and ASCII, it would strike me as amateurish to not use English and ASCII wherever possible.
> non-native coders do not take security / compatibility seriously
That's not what I said at all. I don't see how you came to this conclusion.
> With all the problems eliminated by the use of English and ASCII, it would strike me as amateurish to not use English and ASCII wherever possible.
Not everybody speaks english. I've taught programming to quite a few people and they all attempted to use normal characters while writing code. There's absolutely no reason why that shouldn't work. I don't see how characters like ç or ã or ü could possibly cause security issues. Go ahead and ban the invisible unicode stuff but there's absolutely no reason why these common letters shouldn't work.
It is funny that you are using the existence of a segment of the population that I am a part of, to make your claim but aren't willing to listen when a member of the segment is trying to explain how non-ASCII characters and coding do not mix well.
Sure, you could make a fix for this specific case, but the problem mentioned in the blog post is not even close to the only problem of non-ASCII characters. In theory, yes, we could make a language and a full suite of tooling that would play nice with non-ASCII characters. But it's not like the whole non English speaking world is waiting for this to happen. People code in English even in teams where everyone speaks Finnish. Nobody even questions it, because it's so obvious that all code should be in English and ASCII. Everyone has shot their foot, putting in non-ASCII characters in the source code at some point of their career, if they have ever dared to try. That's how the reality is, and at the same time I hear people saying that the existence of those Finnish programmers means we have to have Unicode in source code.
>That's not what I said at all. I don't see how you came to this conclusion.
I didn't say you said it. I said that's how it (probably accidentally) comes across when you talk about something so carelessly. Non English speakers care about compatibility and security and take those seriously, therefore we pretty much always write code in English and ASCII.
> It is funny that you are using the existence of a segment of the population that I am a part of, to make your claim but aren't willing to listen when a member of the segment is trying to explain how non-ASCII characters and coding do not mix well.
Why is it funny? I'm also a member of that group. English is not my native language.
> But it's not like the whole non English speaking world is waiting for this to happen.
I don't think we should have to wait for this to happen. In many ways, it's already happened: most modern languages already support unicode symbols.
> People code in English even in teams where everyone speaks Finnish. Nobody even questions it, because it's so obvious that all code should be in English and ASCII.
Relatively few people speak english in my country. I have only a few friends who do. A whole team of people writing code in english just doesn't seem likely where I live. I actually tried writing english code in such a context once, the result was a mixed language mess that I quickly reverted back to my native language. Unicode support is great because it makes the non-english code much more readable.
Europeans in general seem to know english very well. This is not the case everywhere. Somehow making english a requirement for programming just doesn't sound fair to me.
It applies to other contexts besides code. For our user table we have a mariadb collation on the unicodes confusables list which avoids confusable usernames (treated as already existing).
Good code is maintainable code. And while you, as the original programmer, might be perfectly comfortable writing your code using Arabic variables and comments, what if the next person who has to maintain the code is from Korea? Or Russia? Or France? Or China?
OK, maybe you're a small startup in Taiwan and so you don't care about the next maintainer in your company not being able to read or write Chinese. What if you decide to open source your code? Or Meta decides to offer you a zillion dollars to buy you out, but after they do their due diligence, realize that the code is utterly unmaintainable should they decide to outsource internationalizing the code so it will work in Brazil, so that requires native Portguese speakers (who can preferably be paid low, low wages) --- but they can't understand the code because it's using Chinese variables and comments. And then Meta decides to back out from the deal?
If you're likely to work with an international team, it makes sense to use english. That's not always the case though. Plenty of those low-paid brazilian programmers you cited will never do that. Many of them don't speak english to begin with.
For example, the school I went to had a simple web application for student feedback. Attachments were allowed. People started running into issues due to non-ASCII characters in file names. I reported the issue to the IT department and even helped them fix it. The Python code was written in portuguese, accents and everything. Why shouldn't accents be used in this case? It's unlikely this code will ever be used in an international context.
ASCII is the standard for code for good reason. Everyone can type it. Put whatever you want in comments, but you shouldn't make people have to copy/paste your variable names.
It won't. The same approach works just fine in your build specification or other config files. And it doesn't solve the root of this problem, which is that you are compiling source code you don't control and don't audit closely into your binary. Sneaky text is not the only way of getting malicious code through code review.
> 1. It creates a security consideration with confusable identifiers (and lints don't always catch these)
O/0 and I/1/l are confusable characters within ASCII. I'm not kidding here, they are actual entries in the Unicode confusables database [1]. But no one wants to remove those characters from identifiers.
> 3. It may not render correctly depending on fonts
So does Unicode in comments and string literals. In fact the purported Trojan "attack" was mostly about string literals. So why should they be allowed in strings but disallowed in identifiers?
> 4. It may be hard to type depending on keyboard layout
Did you know that not every Latin keyboard layout supports a backquote (`)? This was the actual reason that the repr(expr) shortcut got removed from Python 3 [2].
> 5. There really isn't a good reason to use non-ASCII idents anyway
My canonical answer from the experience is that not every programmer who can understand English documentations can easily write and comprehend English in general. For those people having a non-ASCII identifier support is a great relief, as it frees them from choosing "correct" English identifiers. You can disallow them for your project if you want (or conversely, make it an optional feature disabled by default), but they are relevant for someone else.
> it frees them from choosing "correct" English identifiers
Even if you have fluent English skills, sometimes translations just confuse the issue. It's sometimes better to use an untranslated word instead of introducing ambiguity, especially when a term originates from a local law.
> O/0 and I/1/l are confusable characters within ASCII.
You're mixing up two different ways that people use the word "confusable": things that look similar in some fonts, versus things that look exactly the same regardless of font. I want the latter to be banned from source files but not the former.
Confusables are a defined concept in Unicode [1]. And there seems no other good way to define "confusables", since many if not most pairs of characters are distinguishable in some but not all fonts and you can always make a font that distinguishes every code point (I once did that, for example distinguishing Latin-Greek-Cyrillic homoglyphs in subtle ways).
But we are talking about Unicode identifiers, and the Unicode recommendation doesn't allow BiDi markers in identifiers and has a provision to limit the use of ZWJ and ZWNJ in them.
Non-ASCII identifiers can be useful for maths too. E.g. I use λ sometimes, especially in Python where "lambda" is a keyword. (I have AutoHotKey and Espanso hotstrings to make typing such symbols easy.)
This argument seems to assume that people who do not know English cannot figure out how to use ASCII character set for identifiers. If so, it is rather incorrect. Is there something else here, maybe?
The compiler disallowing them globally might count as that. But individual crates enforcing an "ascii only" policy makes sense, if they never plan to use non-ascii.
Personally I'd prefer even one step further: the compiler would disallow them by default, and you can opt into specific character sets/languages at a crate level. e.g. `AllowSpecialCharacters("de")` to enable on special characters common in German.
There's definitely a benefit to using a linter and a tool such as prettier. Using prettier pushes the hidden character onto an additional line in the checkCommands array which makes it much easier to spot that something is wrong even if you're not using the trailingCommas setting.
I think this eslint rule would also be able to defend against the initial destructuring of the query object by defining a regex that identifiers have to match which would exclude those invisible characters https://eslint.org/docs/rules/id-match
I imagine this can be defended again pretty easily with a lint rule that prevents these unicode characters in variables. pretty ingenious little hack though.
The eslint rule id-match, which require identifiers to match a specified regular expression, would be useful here. For example:
A malicious PR could also add the character to your eslintrc too though. You'd be forgiven for seeing the line change in the diff and thinking it was just some reformatting.
that would show up on a diff and would elicit a question in code review hopefully of why the .eslintrc file was being changed in this way. This another good argument for a comprehensive code review process.
also you could lock this file down with a CODEOWNERS file so only certain trusted contributors could modify the lint configuration. You could also do exclusionary pattern matching to make sure none of the bad characters do exist in identifier names... Or you could write your eslint configuration as a separate module to be npm installed... or you could write a eslint rule plugin that disallows non-ascii identifiers and then npm install that... lots of different ways to skin this cat to add security.
It seems like things displaying diffs could use a specific color for lines only changed by formatting or indentation (indentation can have significant meaning like in Python but this would probably be good enough)
I believe that Git diff - which has features not supported by regular diff such as --word-diff - can differentiate between whitespace-only line changes. The Jetbrains IDE, which I believe uses Git diff behind the scenes, will show who originally wrote a line even if it has been whitespace-reformatted later.
This whole story has been stupid. These ideas have been around for ages and are not novel to the security community. Yet we've seen headlines like "all programs ever are vulnerable to this new hack." The root cause is not unicode characters but instead untrusted text. It isn't like a malicious library would be unable to sneak backdoors in through ascii source anyway. Heck, we just had a big kerfuffle over this happening in the linux kernel this year.
Or worse! Go look at the dependencies for some large enterprise system built in java. How many raw jars do you think are being included in there? Has anybody looked at the bytecode of these jars?
The point is not that the vulnerability is a trait of javascript but to make a demonstration on how different unicode characters can be used to create a vulnerability, exemplified by a piece of javascript.
Yes, I understand it, but title and content doesn't explicitly mention this "detail" that same attact vector exists for other languages and data formats.
It isn't, any relatively dynamic language is going to have these or similar issues. Many moons ago I saw similar examples in bash, I'm sure they are possible in PHP, ..., ..., ...
In fact, even the more strict languages probably do to: the “accidentally run something malicious via care-free use of exec” is an issue in just every language that has “exec”/similar - it is a data trusting error in the programmer's logic not an issue with the language itself. The dynamic nature of some of JS's syntax is just one way to pollute the data being fed to exec amongst the other sources (user input, being too trusting of config in the DB or filesystem, and so forth).
Javascript is a very good option to use for examples though: most devs know it well enough and it is everywhere so the potential scale of the danger is obvious, even more so in light of people being far too trusting of dependencies pulled via NPM and the recent examples of malicious updates getting into common packages.
Maybe the title could be a bit less click-baity, though I'm not sure what would be used instead that wouldn't be overly wordy for a punchy article title.
>> That said, nothing on my buildchain actually throws an error or warning.
use hooks for CI on pre-commit / merge and pull requests e.g. like this pre-commit which would catch bi-directional trojan sources:
#!/usr/bin/env python3
import sys
import subprocess
bidi_chars = '\u202A\u202B\u202D\u202E\u2066\u2067\u2068\u202C\u2069'
for line in sys.stdin:
old, new, ref = line.split()
diff = subprocess.run(['git', 'diff', old, new],
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True)
if diff.returncode != 0:
print(diff.stdout)
sys.exit(f'git diff ended with rc={diff.returncode}, receive TERMINATED')
if any(c in diff.stdout for c in bidi_chars):
print(diff.stdout)
sys.exit('Possible Trojan Source Attack, receive REFUSED')
I wish github/gitlab would provide such features available out of the box which also follow best practice, so people can stop pasting them from the web or reinvent our own version in every team ...
My editor (vim) will warn me with a loud visual red block for any non ascii char outside a string literal. But I do not think that is enough. Compiler and interpreter must be more strict.
After seeing this thread I added the following to my vimrc:
highlight link NonASCII Error
autocmd Syntax * :syntax match NonASCII "[^\d0-\d127]"
Obviously haven't been using it long, and I'm not confident enough in my vim knowledge to vouch for its correctness, but it works in the limited amount of scenarios I tested so far.
It has `editor.renderControlCharacters` but only recently started displaying few dangerous previously invisible ones (directional overrides) natively [1], but besides that you had to use extension that adds highlights for non-ascii non-whitelisted [2] or predefined [3] characters.
This was originally done with the goal of trying to hide/encode one program within another using non-displayable characters (such as zero width spaces), I just never got around to it. But reading this article has kind of reignited that interest for me and I think I might take another crack at that soon.
A, the famous hangul filler. That's actually a Unicode bug they refuse to fix for some years now. It's still listed as identifier. I fixed that in my interpreter cperl.
The next bugs are actually all JavaScript bugs, as they accept Unicode identifiers but don't check against the Unicode security guidelines, ignoring any profile. Accepting bidi, mixed scripts, unnormalized identifiers. This is very common, 99% of all interpreters and compilers don't care about Unicode security at all. They are rather proud to accept everyone, and point fingers at colleagues who only accept ASCII english.
Identifiers need to be identifiable by a human. That's the whole point. And the system needs to block illegal identifiers.
Similar to filesystem drivers, which consists of path names as identifiers, but the driver writers think they are beyond such human issues. For them there is only garbage in, garbage out. Their pathnames are certainly not identiable. A directory can consist of bidi names, or Russian and Greek mixed scripts who all look the same. Or just not normalized. There can be a multiple of visually duplicate names, and you never know which is which. At least with domain names they came up with a punycode solution, but this was only the tip on the iceberg. And it was a rather awkward workaround.
I think the recommendation to disallow any non-ASCII character is throwing out the baby with the bathwater.
How about code that wants to display some emojis? It would be cumbersome to use hex unicode everywhere. And while localisations should typically happen in a separate language file, it's very common to want some text in code intended for a single audience.
Blocking all the confusables might be tricky, and an allow list would be endless. Perhaps some magic pre-processor comment that says "allow unicode in this file".
> I think the recommendation to disallow any non-ASCII character is throwing out the baby with the bathwater.
Not throwing out all non-ASCII characters from code-files. Just throwing them out as being invalid identifiers in your code (think variables, function-names, etc).
> How about code that wants to display some emojis?
Fine. You quote that emoji in a string, and it's golden.
You try to make a variable with the name of an emoji however, you code crashes.
That would close this particular attack (but not the BIDI one the article mentions). But there is probably already too much code out there with π=3.14 in it to be feasible to do this.
I really thought that using the greek letter for pi (or theta, etc) was something you do to show your programming language supports unicode identifiers but that nobody actually does in real life. I wonder how people input this, do they know the Alt+xyz combo, do they select-copy-paste or is there another way that to write these characters that I'm not aware of?
Just to be clear, I don't mean people who are actually using Greek language for input - it's pretty obvious how they would type that character :)
Do you really have to write emoji in the code string? Similarly with international language characters. The sane thing is to use either json config files or i18n libraries.
If you are writing something intended for a single audience using i18n libraries can be unnecessary overhead. And emoji can also be icons like ⌘ that can be useful to display in the UI.
Yes, the Unicode characters are a problem. But do the norms and tooling play a role here as well?
Explicitly casting types, like String parameters to integers would make this much more explicit. The convenience of accessing parameters via destructuring, vs explicitly request.getParameter("\u3164"). Having a static array of permissible commands declared elsewhere.
There's something to be said for verbosity and explicitness. Where the tooling and norms shun it, these 'invisible' backdoors can gain advantage.
Some examples of historic attacks you could embed in algorithms:
“Salami slicing” is a kind of embezzlement where eg an insider programs the computer to credit small amounts to the last account (and then opens an account with a name beginning with Z).
In the 90s there was a massive hushed up scandal where the programmers developing the early Barclaycard made the pseudo random number generator for pin codes just issue three distinct pins. This meant that a stolen card could be easily used because they could guess any pin in three goes before the ATM swallowed the card.
This is hardly an exhaustive list. It’s just to get peoples cogs turning… :)
> In the 90s there was a massive hushed up scandal where the programmers developing the early Barclaycard made the pseudo random number generator for pin codes just issue three distinct pins. This meant that a stolen card could be easily used because they could guess any pin in three goes before the ATM swallowed the card.
Took some digging to find any working links these days. The three pin thing is on page two but it doesn’t name which bank; I may have misremembered and it might not have been Barclays. The whole article is a good starting point for digging into other vulnerabilities and exploits too https://www.theregister.com/2005/10/21/phantoms_and_rogues/
Compilers and interpreters need a new pass to detect these characters in code and treat them as hard errors. This doesn't stop their use in comments where presumably they are still ok.
Alternatively, there needs to be an uptake of the use of code linters and pretty printers.
Just being curious I've pasted the example to Geany and VSCode and in both this invisible character was visible :) I can't remember setting some special character / whitespace visibility options but I think it is good to have this kind of options always on.
It is. The way it was explained to me, all the APIs you use are in English so naming variables in your local language is futile at best and would just require constant context switching.
I remember many moons ago MooTools announced international API translations as an April Fools joke. It did make me wonder if there’s an interesting programming experiment to be done there… but I’m a native English speaker so I’m not best positioned to know!
Not my experience. Plenty of codebases have variables in local language in France Germany and Sweden (where I have experience).
I actually have encountered a lot of problem with English codebases in those countries as they often try to translate regional concepts that are not directly translatable. This is particularly annoying when it comes to administrative stuff where one English word can refer to different local concepts (ex: geographical divisions of the territory) and translations always clumsy. I have even seen nasty bugs come from there, where a "county" had a different meaning in different places of the code as different teams had different idea of what a county was but didn't discuss it.
While this is an interesting hack, the larger issue in the example is allowing any query parameter to write into a subprocess. exec() immediately throws flags for me, especially when it isn’t necessary like in the case of making an http call. Even when it isn’t passing arbitrary inputs from the web to the command line, it’s susceptible to DoS that could crash the whole kernel instead of just the web server. I get that this is just a contrived example to show the risk of hidden characters, but please don’t use process.exec() unless you have no other options.