> If you want to see bizarre sort rules, look up how french sorts accent charact...

bawolff · 2025-05-19T18:11:52 1747678312

Here is a blog post talking about it https://archives.miloush.net/michkap/archive/2004/12/31/3447...

Or a more technical version at https://www.unicode.org/reports/tr10/#Backward

Another case that is kind of weird is thai https://www.unicode.org/reports/tr10/#Rearrangement

thaumasiotes · 2025-05-19T20:33:46 1747686826

> Here is a blog post talking about it

I notice that post suggests that Académie française specifies that accents should be sorted in reverse, and includes a link over the words "Académie française", and yet that link doesn't go to a supporting document.

A while ago I complained on this forum that Amazon's hyphenation for Kindle ebooks is abysmally bad. (Which is still true.) Someone responded to say that the hyphenation algorithm for English requires this. I pointed out that the hyphenation algorithm for English is a lookup table; each word has its hyphenation defined in the table, and when you need to hyphenate a word, you look up the hyphenation points.

Another response linked me to a paper describing how this table can be stored as a set of rules that provide hyphenation points in arbitrary letter sequences rather than dictionary words. That paper is very clear about its goals; it is an advance in data compression, proposing a method of storing a lookup table that takes less space than the table does. It carefully goes over how to produce the ruleset from the table.

But somewhere along the line, people confused the data compression algorithm (of storing the lookup table as a ruleset) for the hyphenation algorithm. They will now tell you with a straight face that a single ruleset that seems to have gone around represents the hyphenation algorithm for English, even if the word you want to hyphenate wasn't in the table that that ruleset was prepared from. And this is false.

It looks to me like something similar has happened in English speakers' understanding of French sorting order. It's very easy to explain why the example quadruplet has the sorting order it does:

    cote
    côte
    coté
    côté

(Note that the Stack Exchange question from 2024 and the blog post from 2004 use exactly the same example.)

These four words have two pronunciations, and the pronunciations are grouped with each other. After that, "cote" comes first by virtue of bearing no accents, and "o" comes before "ô" for the same reason.

What's happening here is that although French generally pretends that "e" and "é" are the same letter, they aren't, which forces -e (not pronounced) to come before -é (pronounced!). "o" and "ô" actually are the same letter, and can be ordered flexibly.

The rule "sort the accents in reverse" arises as a coincidence; it happens to be the case that this distinction is most significant at the end of French words. But French speakers would reject this ordering:

    cetot
    cétot
    cetôt
    cétôt

This doesn't come up because those words don't exist.