Hacker News new | past | comments | ask | show | jobs | submit login
DuckDuckGo \u202E (duckduckgo.com)
348 points by zeepzeep on Feb 15, 2022 | hide | past | favorite | 118 comments



Everyone here is asking if this is an "intentional easter-egg" or an "accidental bug"

But what about accidentally working-as-intended?

Sure it's a little trickier to read, but it's certainly not a "bug" that will cause any damage / danger / instability / etc.


I don't get your take.

Even the most strict definition of bug doesn't imply it has to "cause any damage / danger / instability / etc." to be one.

And I won't call it "work as intended" when the purpose of this feature is to provide an answer for human to read, and it failed on that.


I'd warmly beg to differ, I personally think it's illustrating how it is supposed to work, most elloquently.


I propose "accidental feature" for this sort of thing.


I like it, surprised the legions of Skyrim players haven't already coined that term


“It’s not a bug it’s a feature”


Problem is, this behavior is so outside of the range of common expectations, it's really hard to say if it's harmless or not and what are the worst cases for (ab)using it.


It's telling that the description (https://unicode-explorer.com/c/202E) even acknowledges that 202E is commonly used as an exploit. "The Right-To-Left Override character can be used to force a right-to-left direction withing a text. This is often abused by hackers to disguise file extensions: when using it in the file name my-text.'U+202E'cod.exe, the file name is actually displayed as my-text.exe.doc - so it seems to be a .doc file while in reality it is an .exe file."


> "accidentally working-as-intended"

An expression in French for this: "Tomber en marche" (literally: falling into walking). When something breaks we say it "tombe en panne" (falls into being out of service), when something works we say it "marche" (walks). So this expression is like "falling into a working state".

I wonder about the ratio of unknown bugs vs features that accidentally work, in the wild. Such features are time bombs waiting to explode during the next refactoring.


The page redirects to u202E (no backslash) which is a normal word. I think it's an Easter egg.


I don't think it's intentional but just recoginzes a unicode code point with the uXXXX syntax even without the backslash and then includes the literal character in the info box without any consideration for special characters.

For example this shows an @: https://duckduckgo.com/?q=u0040&ia=answer


I don't know. I feel at unease when the info banner reverses all the text ("This Instant Answer was made by the DuckDuckHack Community.").

Because the text looked very odd to me I highlighted the nonsensical text "noitatneserper lausiv" and context-menu searched it on Google. To my surprise it googled for "visual representation", and while retrying because I thought that maybe Google's engine auto-"corrected" the text, I noticed that even the text in the context-menu stated that it would google for "visual representation".

Then seeing that it was "noitatneserper lausiv" in reverse, maybe also in combination from the first hit "U+202E RIGHT-TO-LEFT OVERRIDE - Unicode Explorer", it felt like the browser had done something it should not do by actually applying the reversion to the info box.

When inspecting the HTML tag of the info box it displays the string "&#x202E U+202E RIGHT-TO-LEFT OVERRIDE, decimal...", but whenever I try to do something with it, it get's eiter reversed or messed up.

Another bug: When I select the entire text in the info box, I get " U+202E RIGHT-TO-LEFT OVERRIDE, decimal: 8238, HTML: No visual representation, UTF-8: 0xE2 0x80 0xAE, block: General Punctuation" <-- (btw, this was NOT what I had first entered into the textfield before this edit)

And trying to append a double quote to the text above, it inserts it at the beginning of the line, actually after the E202+U. When I expand the textbox so that the entire paragraph is in one line, E202+U moves to the end.

All this is creepy and I bet that it won't be long until an exploit with this uncontrollable Unicode character will hit the first vulnerable servers and browsers. This feels like Unicode is playing with fire.

Edit: From https://unicode-explorer.com/c/202E

> The Right-To-Left Override character can be used to force a right-to-left direction withing a text. This is often abused by hackers to disguise file extensions: when using it in the file name my-text.'U+202E'cod.exe, the file name is actually displayed as my-text.exe.doc - so it seems to be a .doc file while in reality it is an .exe file. There's even an xkcd comic for this character!


Probably a bug, browser history shows the title as "[object Object]", which is what happens if you print an object value that cannot be 'adequately' serialized to string in javascript e.g. ({}).toString()


The title of the page is "u202e at DuckDuckGo" which doesn't even have any funky unicode in it.

So you might have something else going on if it shows up as "[object Object]" in your browser history.


It is almost certainly an accident, but it might have been left in on purpose!


Ha, it changed. It was indeed a bug


You still have to be mindful of \u202e in anything new that you're writing, but browsers do a much better job of not having it bleed across elements like they did back in the 2000s.

Back in the era of forums that didn't support unicode correctly (2005ish?), it was trollish fun to post messages containing \u202E and watch the UI and all subsequent messages and elements get messed up. (One stray \u202E would flip the entire page contents following it.) I never took it to a level of abuse since it was easy to remove and then ban offenders, but it was fun in a one-off thread, and it always had great reactions.

I patched my own software to handle it, but I don't recall anyone really abusing it in a widespread manner. (Contrast this with the era of prolific and widely abused AOL/AIM exploits that would kill your IM client with malformed messages.)

IIRC, a bunch of messaging clients also didn't (or still don't) handle \u202e termination and it sometimes bled into new messages and even the text input box. That was pretty horrible and unfixable without restarting.

Obligatory XKCD: https://xkcd.com/1137/

Some shenanigans in the wild:

https://www.reddit.com/r/Unicode/comments/hc1rxi/i_put_a_rig...

https://twitter.com/mkolsek/status/1237123571341803522

(These are way tamer than the effects used to be.)

(Also, HN filters it out. I tried to have some fun. :P)


I have seen it used maliciously in the wild. Email attachments like invoice.tab.pdf are actually Batch files named invoice.fdp.bat with this character inserted.


Reversed: U+202E RIGHT-TO-LEFT OVERRIDE, decimal: 8238, HTML: No visual representation, UTF-8: 0xE2 0x80 0xAE, block: General Punctuation


Similarly, if I try https://www.google.com/search?q=u202e, the second result I currently get (YMMV) is from https://unicode-table.com/, and almost the entire snippet shows up backwards in the search results.


It's backwards on the original too: https://unicode-table.com/en/202E/


Yup the meta description field as written is flipped in the serp

What is even more hilarious is if you copy/paste out of the developer tools, that is also backwards after pasting.


https://unicodeplus.com/U+202E too. You can see the point where it switches from Left-To-Right to Right-To-Left.


Stacking combining diacritics[1] is also fun, to make extremely tall text.

Also fun is enumerating all the characters in the Private Character section[2] to see what UI symbols are able to be inserted into unintended places.

[1] https://www.unicode.org/charts/PDF/U0300.pdf

[2] http://www.unicode.org/faq/private_use.html https://www.unicode.org/charts/PDF/UE000.pdf


> Stacking combining diacritics is also fun, to make extremely tall text.

A bit OT, but here is a classic example of that (the much upvoted stack overflow post on parsing html with regex):

https://stackoverflow.com/a/1732454



I always wondered how people get these funny Twitter names, thx!


If there was ever a clear signal that working with Unicode is incredibly hard, it would be the fact that no one on HN can decide if this is accidental or intentional.


Let me take a stab at a definitive answer:

– It is unintentional for DuckDuckGo. The code for DuckDuckGo works correctly but no one who wrote that code thought about whether a reversal would happen.

– It is intentional for the browser. The code for the browser works correctly and someone who wrote that code actively thought about how to make a reversal happen.

I don’t think ‘accidental’ is the right word to use in either case because the outcome is what you would want.


The reason I used "accidental" is because it's not a bug (and you've alluded to that same conclusion too). You could argue it's accidental from the perspective of DDG if it happened by chance rather than design. But the distinction between "accidental" and "unintentional" is nuanced and I'd already offered "intentional" the alternative option so I'd argue you can pretty much use them interchangeably in this specific situation.


It certainly looks like a simple template that DDG applies consistently to all queries for a UTF-8 byte literal. It's the exact same template for a query for a more straightforward literal, like u0041.

So I think it's fair to say that it's not intentional in the sense of being a deliberately added easter egg. Of course, they might be aware of the behavior and decided to leave it that way.


The hardest problem in software engineering: to close with as-designed or out-of-scope.


And some of us don't even get what this is about. Should I be seeing DDG doing something particular here?


The "answer" tab is right to left


I had that turned off. Thanks for explaining it.


Joining two pieces of text and having one destroy meaning in the other is certainly a bug, most commonly a security bug. If you look at the search results in the original link, much of the discussion involves using it to hide file extensions and similar information hiding attacks.


A significant portion of the problem seems to be that some people can't even identify what's going because the tools they're using to inspect the page are also showing it reversed.



It's accidental, because other characters are also displayed: https://duckduckgo.com/?q=u20aa


Yes, but this is not a printable character.

None of these will be shown, but ddg will recognise them as control characters though. https://www.compart.com/en/unicode/category/Cc


It's intentional, because there is no RTL override in the HTML source, the string is merely reversed.


> no RTL override in the HTML source, the string is merely reversed

What? After opening the source, ctrl-f "representation" selects the reversed word. The source view just happens to interpret the RTL override.


but there is, see:

  document.querySelector(".zci__body").textContent.charCodeAt(0)
  document.querySelector(".zci__body").textContent.substring(1)


Our programming languages might need a unicode aware string concatenation operator, similar to locale aware capitalization. Joining LTR text to RTL text seems like it should result in combined LTR + RTL text, not letting the LTR marker override and change meaning.


It does look like HTML supports this via the <bdo> tag [0]:

  data:text/html,<bdo>&%23x202E;reversed</bdo>&nbsp;not reversed
So I guess this should be used to wrap any user-supplied text that allows arbitrary unicode.

Or using Unicode:

  data:text/html,&%23x2068;&%23x202E;reversed&%23x2069;&nbsp;not reversed
[0] https://developer.mozilla.org/en-US/docs/Web/HTML/Element/bd...


Umm, there's a little info button to the right that says that this 'quick' answer was proposed by DuckDuckHack community author.


Are there any lists of unicode characters (like the OWASP one) that should be blacklisted from most apps (not just for XSS, but even for desktop apps)?

Are there any good security guides/best practices for unicode sanitation?


How are users supposed to write "עבור אל duckduckgo.com כדי לחפש באינטרנט" without \u202E? It's perfectly normal for RTL languages to switch text direction in the middle of a sentence.


That should just render correctly thanks to the BiDi algorithm. The "override" control characters are a heavy hammer, and are extremely rarely needed. In fact, at this point I think it's likely that malicious use of these code points significantly outweighs correct use.

There are legitimate uses of BiDi control characters. My favorite one from my time on Android was the string "Google+", which would render as "+Google" in an RTL paragraph. The translators would usually "fix" this by just flipping the string so that it was "+Google", which would render correctly, but be incorrect when cut'n'pasted, read by a screen reader, etc. The correct solution is to use a left-to-right mark. The string "Google\u{200e}+" renders correctly in both LTR and RTL flow. And these "mark" characters are basically harmless, they cannot profoundly change the order, they just fix some of these ambiguous cases.

Correct use of BiDi control characters is explained here: https://www.w3.org/International/questions/qa-bidi-unicode-c...


And then you get Arabic and English text quoted in Japanese vertical RTL text and that's the story of how I actually died.


Do you read left RTL, middle LTR, right RTL; or right RTL, middle LTR, left RTL? (Just curious.)


Imagine it was the other way around: like you wanted to reference תֵּל־אָבִיב-יָפוֹ in the middle of an English sentence. That, but reversed.


It is different though, because with LTR RTL LTR:

1) once you've read the first LTR you end in the right place for the RTL

2) even if you consider the LTRs to be two distinct documents it doesn't change the order. I can imagine the LTR 'breaking' the RTL text such that the sections are treated separately.

But I understand what you were saying, so thanks for the answer!


On 1, no you’re not. After the first LTR you’re at the end of the RTL.

On 2, I’m not sure I understand. The situation is the same for both primary directions.


> Do you read left RTL, middle LTR, right RTL; or right RTL, middle LTR, left RTL? (Just curious.)

YES.


please don't blacklist U+202D and U+202E or the Private Use Area. my conlang has a right-to-left cursive script, and it's not in Unicode. the characters live in the PUA and my font renders them as a fallback. there's no mechanism for fonts to ask for RTL, so I have to use bidi override.


I do think its kind of sad that the PUA doesnt have various areas with different properties (RTL, whitespace, joining, etc)


Not a full security guide, but if you haven't seen this before it's useful to have...

https://github.com/danielmiessler/SecLists/blob/master/Fuzzi...


I've seen this before but either this is new since last time or I missed it, either way: lol

    # Human injection
    #
    # Strings which may cause human to reinterpret worldview
    
    If you're reading this, you've been in a coma for almost 20 years now. We're trying a new technique. We don't know where this message will end up in your dream, but we hope it works. Please wake up, we miss you.


I'm not falling for this one, I know no one misses me!


For unicode security considerations see http://www.unicode.org/reports/tr36/

The report is divided into visual and non-visual security issues. Our old friend RTL override is covered, but mostly in the context of URLs.


Put it inside a <span dir="auto"> ?

Anyways unicide category Cf is probably what you are looking for, but blocking them is probably wrong as they serve an important function.


I don't think this is a good place for a blacklist. Text effects should be encapsulated and reset at the end of the text block, the way bold or italic effects are.



Well done. I don't understand this TFA, nor do I understand (fully) the xkcd cell. But I get the connection. Thanks :)


U+202E is a Unicode codepoint, a control character that signals that letters (or other characters) should be printed right-to-left, as in Arabic or Hebrew.

What the DDG link illustrates is that when someone searches for information about that codepoint, DDG's autogenerated answer section accidentally _uses_ that control character (reversing the answer text) instead of just printing the codepoint.


Oh that's cute! Translation for anyone curious / lazy:

Punctuation General :block ,0xAE 0x80 0xE2 :8-UTF ,representation visual No :HTML ,8238 :decimal ,OVERRIDE LEFT-TO-RIGHT 202E+U

Love the demos :)


I'm not sure whether this is a bug or a feature^Weaster egg


Oversight, probably. By default, the code point is displayed next to that description, and they don’t turn that off for bidirectional control characters.

https://duckduckgo.com/?q=u1f4a9

(Yes, I have that one memorized)


If you look down the page, some preview elements are also reversed. I think this may be accidental.


I'm out of the loop, what kind of Easter Egg is it?


The text in the instant-answer bar is reversed for this result. Which could plausibly either be on purpose, or a result of the character itself being inserted and not escaped, so having its intended effect.


The funny thing is that search queries preceded by a backslash on DuckDuckGo are supposed to take you to the first search result, but that functionality seems to be buggy anyway:

https://www.reddit.com/r/duckduckgo/comments/sp9e5r/backslas...


Reminds me of searching for the terms "do a barrel roll", "recursion" or "askew" on Google. I'm sure there's plenty of others.


Instantly reminded me of a relevant xkcd: https://xkcd.com/1137/


Hey that's new to me, I'll use this, thanks.


Can anyone explain what this is all about? I’m looking at the link and threads and have absolutely no idea what’s supposed to be significant here


the Unicode codepoint with hex value 202E says "from here on, render the rest of the text from right to left" (something that's useful for Arabic scripts, for example).

Duckduckgo shows infos about the codepoint and the codepoint itself in a box between the search field and the actual results, and in it, the text is rendered reversed (right to left), because that's what the codepoint tells the browser to do (and DDG doesn't have extra logic yet to either inject another "now render from left to right again" marker, or otherwise prevents it from messing up the info box).


Thank you!!


Hahahah #metoo!


Easter egg or bug?


bug egg? it's also an instant answer from the community (the little info icon on the right hand side) so perhaps just presented that way due to how it was delivered by that specific community member.


Poe's Law applied to coding easter eggs? :D


That's the question!

(I think it's unintended though)


Easter bug?


"This Instant Answer was made by the DuckDuckHack Community.

Developer: Cosimo Streppone

Developer: mintsoft"


And somehow, the "external link" icon is outside the scope of Unicode.


> This is often abused by hackers to disguise file extensions: when using it in the file name my-text.'U+202E'cod.exe, the file name is actually displayed as my-text.exe.doc

So every programmer has to know about and support U+202E, but not filesystem programmers?


More like UI programmers? It seems that almost everyone has agreed that text-processing smarts inside a filesystem are a bad idea (see: the NTFS collation table, the APFS transition away from ancient-version-NFD-but-not-quite), although there is that island of (admittedly very smart) -insensitive but -preserving holdouts (casing on Windows, normalization on ZFS). Linus rants on the topic[1] passionately, if not very informatively.

Note that U+202E is a control code that has effect on display, not the logical order of the text (much like, say, a bare CR), so I can’t say what the filesystem is doing wrong here (except maybe for not rejecting this outright, but see re smarts above, this probably needs to be done on a higher level). You don’t blame the filesystem for believing the filename "A\rB.txt" starts with A and not B, do you? Even though ls will say otherwise.

Bidi IRIs (which are at that higher level) are kind of horrendous, though.

[1] https://yarchive.net/comp/linux/utf8.html


That's pretty much correct. Most of the filesystems I'm aware of just treat filenames as a "string of bytes" with some list of characters that aren't allowed, and perhaps a few other rules. Other than that, it's a free-for-all on names.


What do you want the filesystem programmer to do?


> What do you want the filesystem programmer to do?

Replace:

    if(bytestring_ends_with(filename, ".exe")) execute_file(...);
By:

    if(last_displayed_glyphs_equal(filename, ".exe")) execute_file(...);


The filesystem isn't executing anything so if anything you'd want the file manager or shell programmer to handle it. But yours is a terrible solution that would mean everyone else interacting with the filesystem to handle it too. Better to adjust the display code to treat extensions specially (if it doesn't already) and make sure that it is clear to the user what the real extension is.


    if (!isascii(c)) panic("stupid user");


  если (!кои(с)) авост(«тупой оператор»);
You wouldn’t want to live in that world, would you? I know I wouldn’t, and I have that as my native script and most of my filesystem in Latin. I’ve spent my childhood with a computer that ran a VGA-chargen-reprogramming hack at startup and later had to maintain a website stored in an encoding designed to preserve legibility after Latinization through amputation of the 8th bit (in case you’ve ever wondered where the illogical order of KOI-8 comes from). I do not want that world back, however fondly I remember my 286.


I probably wouldn't mind it if were the lingua franca in computing.


> I probably wouldn't mind it if were the lingua franca in computing.

And in programming, I don’t! It’s more like a weird pidgin lignua celto-germano-franca with funky morphology, but I love it nevertheless. I’ve read the Unicode identifiers spec, and frankly, however much I like my Agda with that special Unicode maths sauce, I’m not sure I’d be better off with that in my compiler.

A old and grizzled plant worker who needs a new computer-operated lathe, though, will rightfully tell me to take a hike if I try to sell him a machine that only speaks and accepts a foreign language, and his boss will support him. (It depends on the country: a French person will look down on you if you don’t try to speak their native language to them, and a Norse one will think you’re looking down on them if you do.) I might be able to hold out for a couple of decades, but ultimately, my computer will speak the lingua franca to computing professionals and the native language to users, or somebody else will build one that does.

This means user-facing, user-specified identifiers such as file names will need to support at least these two languages—and given a requirement for exchanging data in a global network, essentially every other one as well. You might try to tell users they’re supposed to use some other kind of identifier, but given these are still going to need to be human-readable, integrity-critial, equality-supporting, globally-exchangeable identifiers, I don’t see how that does anything except rename the problem.


Same works for urls.


What's next, searching for the word death causes you to die?


That would be an interesting instant answer.


Gives a whole other meaning to “I’m feeling lucky!”


Also known as "Top Gun"


Where does DDG get its search result? Do they scrape Google? If so how do they not bet banned both technically and legally?



They have their own web crawlers, as well as a deal with Bing (And perhaps others)


Extremely bad design. This kind of complexity should have been moved to some kind of post-processing spec rather than core Unicode. It's already causing issues and will cause more. The more universal something is, the more effort should be applied to keeping it simple.


... It’s not clear how? Except by telling every speaker of Arabic and Hebrew saying they want some of that delicious “plain text” action to go screw themselves (there are no purely-RTL texts, only bidirectional ones, not least because of the Indic numerals). AFAIU (at least from the full-length horror novel that is the CDRA) IBM tried presentation-order (and no-complex-shaping) RTL text for decades and gave up, so Unicode bidi is essentially the result of said giving up (and the “Arabic Presentation Forms” block the foul-smelling corpse of the idea).

Specify the dominant direction of your user-input-containing elements, people, and/or enclose the input in U+2068 FSI ... U+2069 PDI (after balancing outstanding bidi controls inside).


> Except by telling every speaker of Arabic and Hebrew saying they want some of that delicious “plain text” action to go screw themselves

The problem is not with Arabic or Hebrew. The problem is that this modifier affects other languages and characters in a way the vast majority of people clearly wouldn't expect (otherwise the story wouldn't make it to the front page).

> Specify the dominant direction of your user-input-containing elements, people, and/or enclose the input in U+2068 FSI ... U+2069 PDI (after balancing outstanding bidi controls inside).

The level of arrogance packed in this sentence is just mind-boggling.

There are many other "Easter eggs" in various basic technologies. I can assure you that no matter how high of an opinion you have about yourself, if you write any production code at all, you are guaranteed to be using something that contains other Easter egg design decisions. You're not aware of them, you're not mitigating them and therefore whether they will explode on you is mostly just a matter of luck.

Minimizing "Easter egg" design decisions is the only long-term viable way to get complexity in our already complex environment under control.


>> Specify the dominant direction of your user-input-containing elements, people, and/or enclose the input in U+2068 FSI ... U+2069 PDI (after balancing outstanding bidi controls inside).

> The level of arrogance packed in this sentence is just mind-boggling.

It’s not arrogance, really, it’s just that I’ve been reading on this exact thing for the last couple of months, and the relevant knowledge is rather unpleasantly smeared over multiple documents in several places (W3C and Unicode.org mostly), so I tried to condense the recipe into a single sentence and drop some terms an interested person could look up: I was attempting to pack information. I see now how that could come off as arrogant, but can’t think of appropriate circumlocutions that could ward that off without turning it into a full bidi-in-HTML tutorial (which I am not qualified to write, for one thing). I already write too many unsolicited tutorials in my comments, this is me trying not to :(

> There are many other "Easter eggs" in various basic technologies. I can assure you that no matter how high of an opinion you have about yourself, if you write any production code at all, you are guaranteed to be using something that contains other Easter egg design decisions. [...]

I’m aware I have limits! I know lots of those! I discover new ones every day!

(I dread the day I need to figure out how an 802.11 retransmission works and how to fight one, for one thing. I can’t do post-2010 JS frontend to save my life, and my database knowledge is somewhere around “there were those guys with the normal form, I think?”. Limits? I’ve got ’em.)

I also expect that once I know about a footgun, I have a responsibility to avoid it, and that people who have just encountered such generally want to hear how to avoid it as well. I’m not entirely competent at the communication part. Sorry.

As to the actual issue... I could say that if you’re handling multilingual text, then you should damn well know how multilingual text works, that it’s not peripheral to your problem.

But I don’t actually believe that, not completely: I think this bidi thing is needlessly hard and we should have directional-stack-balancing and directionality-isolating functions in our standard libraries the same way we have URL-escaping or HTML-quoting ones. Perhaps even have the templating handle most of these cases automatically. It’s like with SQL injection: I don’t have a right to complain people are writing vulnerable queries if we don’t have convenient tools to write correct ones. Unfortunately, in the bidi case, we don’t, so we’ll have to treat this like spun glass until someone makes them.

(That’s part of why I’ve been looking into this so much lately.)

[Previously]

> The problem is not with Arabic or Hebrew. The problem is that this modifier affects other languages and characters in a way the vast majority of people clearly wouldn't expect (otherwise the story wouldn't make it to the front page).

As far as I know, this is not solvable. Or rather, this specific thing is, and the right-to-left override (U+202E RLO) is kind of a screw-up due to this kind of nonlocal effect on surrounding text (it might even be a holdover from the IBM days?), but you can’t design RTL such that it can be ignored by unaware programmers, with or without directional controls. Last I checked (several years ago), a post in Hebrew would wreak considerable destruction on an LTR Facebook news feed, no controls required.

The problem is of distinguishing a white zebra with black stripes from a black zebra with white stripes: Are you looking at RTL text with LTR pieces inside or LTR text with RTL pieces inside? (If you don’t see why this would change the layout, the Unicode Bidirectional Algorithm spec has examples.) What if the pieces themselves include opposite-direction quotes? How do you know where the pieces end in the presence of characters with no intrinsic direction (punctuation, emoji)?

You can encode everything in LTR display order. Your RTL-script users, DBAs, search engine developers, etc. will hate you.

You can require explicit indicators. If this needs to work in plain text (and it does, if Arabic and Hebrew are to do plain text at all, because RTL text requires embedded LTR pieces fairly often), you’ll have to express that in format controls. But then if a user manages drop a right-to-left switch into English text, which couldn’t care less about RTL, the text will get completely messed up and the user gets to complain why RTL influences English. You may try to completely disallow controls in markup that has alternative ways of expressing directionality, but then your input method, your clipboard, etc. needs to know about every possible kind of markup, or every markup processor needs to generate equivalent controls. To at least limit the scope of the disaster, you declare that the effect of the controls ends at a paragraph boundary, but then you need to tell where that is, and the kind of “plain text” you inherited has no good way of distinguishing a mere hard line break from a paragraph terminator except by not-so-plain “protocol” conventions. So you’ll need to guess.

You can ditch explicit indicators and guess. Your processing algorithm will need to know which scripts have which direction, of course, but that’s not a problem. Given the presence of quotations and such in plain text, it’ll also need to learn about paired delimiters and which of them pair with which others, and try to recover when the pairs are wrong or unbalanced, because users are awful. Because of the aforementioned zebra problem, you’ll also need a way to guess which direction of a piece of text is the main one, which seems intractable without godlike NLP, so maybe just take the first character with a definite direction and tell people who start sentences with an opposite-direction fragment they lose? Overall, the whole guessing game becomes so complex it’s completely impossible to reliably embed an arbitrary fragment of user input inside your text unchanged (without inserting visible compensating delimiters, for example), so some kind of format controls that manipulate a stack of directions are called for.

The Unicode design does most of the above; it is complex and could undoubtedly be simpler—there’s like three generations of “no, that’s a bad idea, let’s try again” in there. But it seems like some indication from a programmer that they want to insert this inner thing, that should remain intact, into this outer thing, that shouldn’t get messed up in the process, would be required in any logical-order design at all; you won’t be able to just concatenate byte sequences. It’s acting on that indication that could stand to be easier.


> I could say that if you’re handling multilingual text, then you should damn well know how multilingual text works, that it’s not peripheral to your problem.

Trouble is, anyone using Unicode and accepting user inputs is effectively handling multilingual text, unless they explicitly filter it out. Which includes the vast majority of websites and even web-based user interfaces for standalone hardware.

> As far as I know, this is not solvable.

I am sure it is solvable in the sense that it is possible to make the behaviors less surprising and complicated without sacrificing people's ability to use right-to-left languages. There would have to be a discussion about underlying assumptions and real-life usage to achieve that, however.

Generally, though, I don't see legitimate use for ever reversing left-to-right languages when displayed to user. That's not what anyone would expect, not even the writers of right-to-left languages. And the myriad of malicious uses are kind of obvious. And the long-term effect of people abusing these will be websites banning more control characters, which will affect users of Arabic and Hebrew.

Also, with the way Unicode is being developed it is increasingly unclear what "plain text" even means these days. AFAIK, there isn't even a formal definition of that term. Maybe that's where the discussion should really start. What capabilities separate "plain text" from other things?


There is boustrophedonic ancient Greek and other languages. Unicode is kept generalized to support such schemes.

https://en.wikipedia.org/wiki/Boustrophedon


I strongly disagree. This is a necessary part to shared content text and pushing this type of functionality into another layer makes a lot of content non accessible in basic text format. This is precisely the type of control character that makes Unicode such a powerful and successful system.


Bad design by what definition? Unicode is all about unifying ALL the characters from ALL the character sets into a single one, while also being compatible with 7 bit ASCII.

Emojis made it into Unicode because Japanese had custom emoticons, that just had to be brought into Unicode. Then someone discovered them on iOS and they skyrocketed in popularity.

If you want everyone to use Unicode, you truly have to account for everyone's use cases. No exceptions. Even if it means including Emojis, Ancient Egyptian hieroglpyhs[0], or such an irrelevant thing for every language using the latin script as a "RTL override character".

[0] https://en.wikipedia.org/wiki/Egyptian_Hieroglyphs_(Unicode_...

I'd say it's perfect design, with pretty good implementation too.


&#8238;

damnit hn


It's intentional, if you inspect the `innerText` you'll see it's reversed there too:

    zero_click_wrapper.innerText.codePointAt(0)
Evaluates to 32. And if you think 32 = 0x20 could mean the next one would be 0x2E, then no, codePointAt(1) is 0x55.


`innerText` doesn't include the RTL marker, probably due to the fact that it is supposed to reflect the "rendered" appearance of the element (i.e. deleting certain invisible characters). However, `textContent` shows the RTL marker as expected.

I'm on the side of this being an unintentional effect.


> `innerText` doesn't include the RTL marker

I'm too under the weather to dig into this, but this might be a mismatch between Firefox and the spec. I don't see in the spec [1][2] where this character could be removed since it shouldn't count as whitespace for whitespace processing.

It looks like in Chrome `innerText` contains the override. And the innerText spec is only 6 or so years old (!) so it wouldn't be too surprising if there were was a lingering incompatibility.

[1] https://html.spec.whatwg.org/multipage/dom.html#the-innertex... [2] https://drafts.csswg.org/css-text/#white-space-processing


Why can't I just disable RTL on my system?

I do not speak a word of Arabic. There is no circumstance in which my life will be materially improved by correct RTL text rendering. I might want proper display of individual characters so I can copy-paste them, but I have no use for RTL text.

On the other hand, RTL causes a lot of unpleasant problems like this. Why can't I simply coerce all foreign languages into LTR?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: