Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
What's the most efficient language? (yakkomajuri.github.io)
65 points by yakkomajuri on Jan 23, 2022 | hide | past | favorite | 112 comments


For folks who are interested. This is actually an already well studied topic! It is called "language complexity". The consensus is that in aggregate (all language artifacts like writing, conjugation, syntax) no language is more complex than any other since complexity on one dimension (say chinese writing for Mandarin) is compensated in another (Mandarin morphology or conjugation).

This is called the compensation hypothesis at least in phonology (how we speak a language)

Pixels per character is certainly an interesting dimension of complexity and I would encourage the author to try to get this ready for publication at SIGMORPHON or something like that!

See https://oxford.universitypressscholarship.com//mobile/view/1... for more on complexity


> The consensus is that in aggregate (all language artifacts like writing, conjugation, syntax) no language is more complex than any other since complexity on one dimension (say chinese writing for Mandarin) is compensated in another (Mandarin morphology or conjugation).

That is broadly the consensus, with a couple of exceptions:

- Writing is not part of the language and doesn't factor into complexity anywhere. Chinese writing is much more complex than the writing system of most other languages, but that's just not relevant to the spoken language.

- Some languages are believed to be generally simpler than average due to having gone through a phase involving a large number of adults learning the language. Mandarin is one of those languages, as is English.


I've taken note to read up more on language complexity - had come across it before but didn't dig too deep!


> The consensus is that in aggregate (all language artifacts like writing, conjugation, syntax) no language is more complex than any other since complexity on one dimension (say chinese writing for Mandarin) is compensated in another (Mandarin morphology or conjugation).

This doesn’t sound right; it’s easy to imagine a terribly inefficient language (replace every letter/phoneme in English with thousands), and probably easy to make a language that’s worse in every aspect to an existing one

Which means you should be able to go the other way, and identify a language more efficient in every aspect than an existing one, unless they’ve all hit maximum optimality within their constraints.

But that’s unlikely, because language is burdened by the constraint of history, which tends to lock inefficiencies in place in favor of “minimum disturbance” when introducing change. And since we don’t care about that constraint when judging language efficiency, it should unlock some optimizations not yet applied.


> This doesn’t sound right; it’s easy to imagine a terribly inefficient language (replace every letter/phoneme in English with thousands), and probably easy to make a language that’s worse in every aspect to an existing one

Maybe GP was referring to actual languages, which are optimized by use, rather than possible languages.


Sure, but if they can be effectively compared in theory they should also be comparable in reality.

And given that languages were developed over different histories and constraints (and lengths of time), it seems to me that a “terrible” real-world language is likely available and identifiable. And though perhaps difficult to compare, then there should be a “best” language, or at least, “top-N” class of languages, that are clearly superior to their peers.

It’s highly unlikely that all languages are comparably well-optimized.


If you can show this, you could get a publication out of it. Until then, you might find that reading the existing literature referenced by the commenter above is helpful.


I’m not claiming I can do so, and asking me to do so is a cop-out — I’m simply reasoning about whether it’s possible to qualify a language at all — to do better than “they’re all the best” — because we know how to define something downright awful.

But yes, at the top of the optimization spectrum you will always see trade offs between things as they can’t push every boundary simultaneously. But it’s difficult to imagine every language has reached that point. Or that no two languages in the world fall under similar goals/constraints, that evolve to the same targets, that one would be better than another at the same goals.

How to find such a pairing, or define it, is a different issue. But identifying a terrible language should be possible, and thus I would expect we can do better than “everyone’s a winner”.

And yes, I can read the literature (give me a minute, gotta ship the book, and probably any existing counterarguments first…), or you could just tell me where my reasoning has failed, since I’ve laid it bare (I’m assuming you’ve read the existing literature, to determine that the answer I seek lies there… you’re not the kind of guy to send me on a wild-goose chase, are you?)


The difference between the "equal complexity" argument and your argument is a matter of degree not kind. I don't think anyone believes either extreme of the argument (e.g. I don't think anyone believes that for any reasonable real-valued metric you can possibly come up with any two languages will share the exact same value to the millionth decimal point).

The implicit contention of the "equal complexity" argument is that in fact the constraints that shape different languages are basically the same. Although on the edges the constraints are different, the overwhelming bulk of human experience and emotion is shared among all human populations. Hence we should expect all widely used real-world languages to converge at approximately the same "efficiency." While two societies from opposite ends of the world may seem completely different, compared to the fundamental human emotions that drive the vast majority of communication, those differences are minuscule. Likewise, even though different languages have been around for different amounts of time, beyond a certain time-frame (say several hundred years), they've had enough exposure to the constraint process to also converge on the same "efficiency." Just like any asymptotic process, beyond a certain cutoff additional time doesn't really matter. If you believe in a single origin hypothesis for language development in humanity, then there's an even stronger version of this argument, which is that all natural languages are of exactly the same age.

Hence, another way of framing the "equal complexity" hypothesis is the hypothesis as you make a metric a better representative of true "complexity," the metric is increasingly likely to assign different languages similar scores, resulting ultimately in languages with roughly (but not exactly) the same scores. A uniquely "terrible" (or "excellent") language likely indicates a human society with uniquely different emotions and mental frameworks (e.g. a group of humans who are incapable of getting angry), which currently does not seem to be the case for the overwhelming majority of human societies (various observations of the sort "this society is way more polite than this society" seem like trivialities next to things like "this society has no concept of truth"). There are nonetheless cases on the fringes which seem like interesting exceptions (the most famous is the Pirahã language) and have therefore inspired a lot of controversy ("this society cannot count" definitely counts as a big difference in constraints), but they seem confined to an extraordinarily small sliver of cases (and being so small and little-understood, also have a lot of controversy surrounding them).

Ultimately the "equal complexity" hypothesis is something of an empirical hypothesis, not a theoretical one. I don't think even among its proponents anyone would say it is literally impossible to have one language be more complex than another. Rather the contention is that for languages with sufficiently many native speakers, complexity tends to converge rather than diverge, and moreover this convergence is fast enough that for the overwhelming majority of natural languages, they are approximately of the same complexity.

All that said, in the absence of an agreed upon metric for complexity, there isn't any good way to have a rigorous argument for or against the "equal complexity" hypothesis. From my point of view, the hypothesis is sufficiently vague that I don't expect to ever see a good resolution. For me, the practical point of the "equal complexity" hypothesis is as a rule of thumb to keep in mind that provides an overarching nonrigorous intuition for several observed patterns that do exist across all languages and helps a lot with language analysis and learning (e.g. if one stumbles across a feature of a language that seems very complex, one should immediately be on the lookout for what kind of simple speech and writing patterns it enables, since those are likely to be the idiomatic version of what you're trying to say. And vice versa, if you come across a feature that seems astoundingly simple, think about where it fails and what other circumlocutions could be employed. These two rules alone are astoundingly helpful for developing idiomatic speech and writing in a foreign language.).


What’s the best animal? Sport? Operating system? In every case, it begs the question “best at what?”. Same here.


I think something is more complex the more rules it has. For example, if every instance of "is", "am", "are" is turned into "be", English would be less complex because there would be less rules. e.g. "I be happy. You be sad. He be bored". I believe this is how mandarin does it.


Each of those conjugations has a different subject. By standardising them all into the same word, it's possible that you make it more efficient for the producer (speaker or writer), but the receiver (listener or reader) has to put more effort into understanding.

This is why languages evolve. If something is pointless, it tends to disappear from the language over time. And it's why artificially produced languages just don't seem to work - they are perhaps more efficient at communicating across boundaries, but less efficient in the primary use case of communicating within a boundary.


I see what you're saying but in real life pretty much all languages are right on the efficient frontier. Just for instance, it turns out that in bits/second every language is in about the same range, with less dense languages spoken faster. It's not obvious a priori but the evidence bears it out. There are presumably small differences but they're pretty much washed out by noise.


Uh? Given that I'm French I doubt very much that this is true.. Foreigners learning French have to spend a lot of time learning stupid things such as is-it a she-stone (une pierre) or a he-stone(un caillou)?

Knowing where to write an or en when both sounds the same (on-om ai-è-es-est-ê ...), Lots of irregularities also.

All these things bring a lot of complexity without any real meaningful gain..

At the opposite a language like Esperanto is really easy to learn..


>All these things bring a lot of complexity without any real meaningful gain..

This assertion doesn't contradict the author's claim that no language is overall more complex than another. The usefulness of the complexity isn't part of the argument.

>At the opposite a language like Esperanto is really easy to learn..

Esperanto is a constructed/invented language, deliberately created to reduce complexity. It's implied that the author was comparing the complexity of different languages that evolved naturally.


Moreover the implied consequence of the "equal complexity" thesis is that over time, were Esperanto to be more widely used as a mother tongue for a large population, it too would collect additional complexity as certain stylistic choices would start hardening into first idiomatic and non-idiomatic language patterns and then further into new grammatical and phonological rules.


I'm not so sure: now we have academies which tries to codify language changes..


Haitian Creole speaker here. We're in a weird spot. Where the majority of the vocabulary come from French, the grammar from west African languages, and heavy influence from Spanish and English. Due to the majority of the population not able to write , the language is easy to learn and speak. But the writing is not as precise as French or English, making it a bit tedious when explaining things. Verbs only have one form, with marker words to indicate mode and tense. Numerous words have multiple definitions or are very vague, so meaning is usually conveyed across by using comparisons and analogy. Easy to do when talking and adding a storyteller like to a conversation, but tedious when writing. The redeeming quality is that the grammar is simple and the orthography more so.

https://en.wikipedia.org/wiki/Haitian_Creole


It's interesting, but flawed. Pixels don't matter, the cardinality of the set of characters does.

You could replace every kanji with a minimal visual entropy version and make a "more efficient" language by the pixel complexity metric. Sure, it'd be hard to read.

Information theory applied to linguistics is pretty well trodden ground. Some of the earliest applications of information theory was answering questions like this correctly.

Importantly, languages do not evolve to be "maximally efficient" in the bits per second sense. They evolved to be maximally effective. You'll find that the difference between entropy and word length (written or spoken) is almost entirely spent on channel coding. When you lose characters, or parts of characters, or words, or phonemes, the meaning can be recovered. If you were running at entropy a lost character would make the entire message nonsense.

Edit: the same goes for visual entropy too. The "most efficient" language is the one I just made up: where the language with the largest cardinality of words (I'll pick Kanji) has each character mapped to a unique bitmap of noise. Easier to just leave the characters as they are and let the computer do that part ;)


Using a fixed, non-trivial passage is helpful for normalizing across languages.

One wonders what the literature has to say regarding generalizing information across topics.

Is it easier to convey some topics in some languages? For example, Philosophy in Greek?

In our day, such inquiry may not be possible due to Political Correctness.


It is obscene to suggest political motivations for a strawman that does not exist. If you did a simple google search for "linguistics information theory" you would find that there are still studies constantly coming out.

Divorce yourself from your political tribe for just one moment out of the day.


Hey, you are right, totally. But the tone of your reply is off.

Take it easy, it is not so important.


Not at all. People are so consumed by feed that they benefit from a shock. I doubt some words on a screen from a human will do that, but I'd like to think that I tried.


Well, I honestly was worried about you. Glad you are not annoyed.


Indeed, ours is a time rife with obscenity, alas.


> This was a surprising one to me. Simplified Chinese was expectedly more efficient than Traditional Chinese, but both were beaten out by Cantonese (which also uses traditional characters).

Let's put aside the fact that this guy just equated "Chinese" with "Mandarin". This claim he's making suggests his entire measurement mechanism is broken.

Simplified Chinese is more "efficient" in terms of Traditional Chinese only in that it requires fewer strokes to draw each character. But the author spent quite some time harping on information. Given that Simplified and Traditional chinese have (for purposes of discussion here) identical characters, just represented differently, from an information standpoint they are exactly the same. Information means something, and it is not number of pixels.

I know this is outside the topic of the article, but it's worth mentioning that many Chinese scholars think Simplified Chinese has been a disaster. Traditional Chinese characters have strong visual relationships to one another depending on how they both sound, or whether they have similar meanings, or whether they were linguistically derived from the same historical source. This assists in memorizing them: a chinese person might come across some character she doesn't know but looks like a set of other characters, and can make a good guess as to what it actually is based on its relationship to them and its context in the sentence. The designers of Simplified Chinese broke a very large number of these rules: similarity didn't matter any more, just reduction in stroke count. As a result, Simplified Chinese is much harder to learn.


Imagine you are limited to only pen and paper and need to communicate with another using written Chinese. Then even though what you convey will be exactly same meaning and characters count, simplified will consumed less effort (less ink and paper as well) to deliver equivalent same message in Chinese Traditional. From language studies perspective they view a token as basic unit of information, but in engineering and practicality, OP ideas of pixel measurement would be better. You can't transmit information to humans using "token" but displayed form via pixels. As for Simplified Chinese is much harder to learn, let just say Mao had proven you wrong with vast majority Chinese (outside of China) learning Simplified form much faster and less effort than Traditional form. Even Taiwanese using simplified form to describe the name "tai" in Taiwan.


I'm sorry, but information means something, and it does not equate to number of strokes. A character can have millions of strokes and still convey the same information as a character with a single stroke.

There are other measures of "efficiency" of course. If the author wanted to argue for how long it took to write a sentence, then sure, Simplified is definitely more "efficient" than Traditional in this context. On the other hand, it's pretty odd to be defining languages in terms of written form rather than their spoken form. (And I'm not sure why Cantonese would be more efficient than Simplified Mandarin as Cantonese is normally written using traditional glyphs.)

Anyway, the author inserted information into this discussion. And if efficiency is in terms of information, then this pixel argument doesn't hold water.

As to the difficulty of simplified chinese: this is a well studied topic with a lot of scholarly analysis. I am pretty sure the literature as a whole strongly disagrees with your claim.


uh, what? once we're talking about simplified vs. traditional it should already be clear that the discussion is about the languages in terms of written form? in fact the entire article makes clear that it is about written language.

Not using "information" in a scientifically proper way is a fine criticism, I just don't understand why you seem to think the author did not define "efficiency" when it is one of the first points made..

re: simplified chinese, i guess im bad at searching? care to provide a review paper?


Hey, appreciate the comment.

The article is not about Chinese so I can't cover every bit of nuance associated with the language group. I have studied it for a bit, so I'm not completely lost in this space.

But putting aside the discussion of Simplified Chinese being better or not, the Limitations section does say:

"Maybe the fact that Chinese characters were originally representative drawings helps association in the brain despite the extra strokes?"


Once I took look at the original passages used, it's clear what went wrong. The "Cantonese" passage is neither in colloquial Cantonese, nor in proper modern standard Chinese. It just reads like a poorly machine translated jumble of characters.

Traditional Chinese:

> 當您開始使用 Google 服務,即表示您信賴我們對您個人資訊的處理方式。我們深知這份責任重大,因此會盡力保護您的資訊,並為您提供相關的管理功能。本《隱私權政策》旨在協助您瞭解 Google 收集的資訊類型以及收集這些資訊的原因,也說明了您可以如何更新、管理、匯出與刪除資訊。

"Cantonese":

> 在您使用我們的服務時,您將個人資料托付給我們。我們明白這項責任重大,因此我們會盡力保護您資料,並確保一切由您做主。此《私隱權政策》旨在讓您瞭解我們收集的資料類型、原因和更新、管理匯出以及刪除資料的方法。

Here are a few problems that immediately jumps out just from an initial reading:

* The Cantonese version uses "us"/"our services" instead of fully spelling out "Google"

* Cantonese speakers don't usually use 您 as it is pronounced the same as 你

* "Data" is usually 資料 and not 資訊; similarly for 了解/瞭解 and a few others

* Finally, the grammar and particles/conjunctions used would be totally different if colloquial Cantonese really is desired.


> Traditional Chinese characters have strong visual relationships to one another depending on how they both sound, or whether they have similar meanings, or whether they were linguistically derived from the same historical source.

This is not incorrect, but it's also less useful than you'd think, since Chinese writing has been around for thousands of years and both pronunciation and meaning have changed greatly in that time, so many combined characters are quite impenetrable. For example, the moon radical 月 yue plus the sun radical 昜 yang means, wait for it, 肠 chang "intestine". Now a student of Chinese will know that as a radical 月 is often (but by no means) always associated with internal organs, but all that really helps you do is avoid ordering dishes on the menu with an unfamiliar 月-something character.


> Given that Simplified and Traditional chinese have (for purposes of discussion here) identical characters, just represented differently, from an information standpoint they are exactly the same.

This is not quite true; sometimes separate traditional characters have the same simplified form. For example, traditional 後 ("behind") and 后 ("empress") share the simplified form 后.


I wonder if Hebrew scores so high because most words leave out the vowels. There are technically annotations on the consonants about which vowel should be pronounced, but outside primary school and language schools for immigrants nobody seems to use them. Arabic has a similar setup, I'm not familiar with Gujaranti but I wouldn't be surprised if it turns out to have a very similar setup.


Hebrew also omits the verb "to be". It is assumed in most cases, and tenses and possessives are denoted in suffixes. Prepositions become prefixes, as does the word "the."

So "I am happy" is

אני שמח

(Ani Sameach lit. "I Happy")

And

"The book is in the house" is

הספר נמצא בבית

HaSefer (the book) Nimtza (is avaiable) B'Bayet (in [the] house).

Six words become three.

(Hebrew was my second language, after Yiddish. Grew up in Brooklyn, and learned "English" on the streets.)


> Six words become three.

Sure, but should you be counting words or morphemes as a measure of complexity?


Well ה (the) and ב (in/at) might be comparable "words" too, so more like 6 words become 5.


Interesting. Spanish on the other hand is very flexible with ommiting the subject.

So "I am happy" would become "(Yo) estoy feliz" or (I) am happy.


So English was your third language? That's fascinating


Most Indian scripts including but not limited to Gujarati are syllabaries: each letter has a default vowel (a) and the shape is altered in systematic ways to indicate a different vowel.


glsh cld wn ths cntst hndly if we chs crflly nd use th occsnl matrs lctinis


“English could win this contest handily if we chose carefully and use the occasional …”?


My guess is “amateurish elocutionist” but I needed computer assistance:

    $ for t in matrs lctinis; do echo; grep '.*'$(echo $t | sed 's/./&.*/g') /usr/share/dict/words | grep -Ev "s$"; done
    
    Amaterasu
    amateurish
    amateurism
    humanitarianism
    masterstroke
    materialism
    materialist
    materialistic
    materialistically
    maturest
    metatarsal
    miniaturist
    misanthropist
    
    elocutionist
For my own amusement, I also came up with another grammatically plausible “parsimonious” (inserting the fewest extra letters) interpretation:

> Galosh-clad win this contest handily if we chase carefully and use the occasional mantras, elocutionist!


This is great! As someone who immediately recognized it as 'mater lectionis' I wondered how someone unfamiliar with this aspect of Hebrew would attempt to solve the puzzle.


Probably "matres lectionis": https://en.m.wikipedia.org/wiki/Mater_lectionis (probably should have been "mater lectionis" (mtr lctinis?) given the rest of the sentencr ;) )

I didn't know this term existed, but the letters felt like Latin to me.

Presumably the point is to use some well-chosen vowels where they are necessary for disambiguation.


Just so, and this is the reason the trick works so well in Hebrew: aleph, vav, yod, and ayin, serve as vowels in most contexts where such are needed.


I speak both a 'most efficient' language (Arabic) and a 'least efficient' (Malay).

My take ia that while Arabic is a very concise language, the learning curve is really steep. Verbs has to be conjugated to accommodate pronouns as well as tense. So in order to say 'I ate', the root word akala has to be conjugated to akaltu. Sometimes the conjugation becomes so complex that it hardly looks like its root word.

Malay meanwhile, while seemingly less efficient, is much more straight forward. There is no need to conjugate verbs. Instead to say 'I ate', you say saya (I) sudah (already) makan (eat). It's very accessible to beginners.

I think it's also interesting to see how complex a language can get in order to become efficient. Like can you combine more than pronouns and tenses in a single word to make it more efficient.


> I think it's also interesting to see how complex a language can get in order to become efficient. Like can you combine more than pronouns and tenses in a single word to make it more efficient.

Spanish has an interesting way of linking verbs and pronouns together:

For example: dáselo would be translated to English as: "(You) give it to him/her"

Or: pasáselas would be "(You) pass those to them"


Just like a programming language, in fact.


Humans are mutually programming each other using homoiconic phonetic languages.


Or like CPU architectures. CISC vs RISC


For anyone wanting to undertake analysis like this, a fairly common choice of text for comparing languages is the UN Universal Declaration of Human Rights. It is translated into a staggering number of languages.

See https://www.ohchr.org/en/udhr/pages/introduction.aspx


About 10 years ago, I used the translations of KDE texts to estimate how much space texts in a given language could take when translated into another language (we had an internal web page editor that only allowed putting texts and widgets at fixed x-y coordinates, so allocating extra-space for translation was a sad work-around). http://fabsk.eu/i18n/#en;fr


Some years ago I came across a handbook of Pittman shorthand.

I figured most of the time I spend reading is spent scanning the words with my eyes. Shorthand is so much more concise, it would surely save me time if I wrote a text-to-shorthand converter [ViolentMonkey], and then learned to read shorthand.

After much time learning shorthand, and fiddling with programming a conversion script, I got frustrated. So I wrote to one of the world's experts on Pittman shorthand (you can find anything on the net) and asked her about my project.

Surprisingly, she said that even for her, a world expert, she reads Pittman at the same speed she reads non shorthand text. Due to the density of the shorthand, and the fact that the mind has to decode it, makes it slower to read!!

Anyways, I never finished, but at least I learned the basics of shorthand :)

For the uninitiated, before computers, dictation was very slow, and reporters needed a faster way to record what they were hearing. They came up with various scripts, wherein lines represent sounds, and for a while these were taught in every college. Till today, shorthand is faster than typing, and there are reporters keypads that have created a shorthand for the keyboard, which is even faster. Much faster.


My vote is for Russian.

You know that scene at the end of Rocky IV, after Rocky wins the match and makes a speech, and the match announcer translates each sentence in Russian?

There's a point in that speech where Rocky seems to say a really long sentence, and then the announcer seems to translate the whole thing in, like, 3 syllables.

I always thought it was some kind of joke, as if the announcer decided most of that stuff wasn't that important to translate and decided to just skip to the gist of it.

It wasn't until I learned a bit of russian myself that I realised it was an honest-to-God translation of the whole thing, and it's just that Russian can be super-efficient sometimes.


This reminds me of the anecdote by Jimmy Carter about a Japanese translator doing what your initial impression described for a joke spoken in English. [0]

[0] https://vimeo.com/36578569


"The magazine had a little travel article, and it was written in English on one page and Thai on the other."

Interesting Sunday study but language is supposed to be spoken. Arab and Hebrew skip vowels in written texts so they seem to be more efficient. But filling in these vowels requires cognitive extra work. That's why both languages offer extra notation for people who aren't that experienced with the written form. Does it make a difference when speaking a language?


Counting pixels in a block of text to determine information density doesn’t sound like a very robust method to me. You could get very different results depending on the font you use to render the text.


It is worse than just sensitivity to different fonts, it ends up being sensitive to totally incidental details of the glyphs. In a Latin script, all the letters carry as much information. But this metric would treat a lowercase i as much more efficient than a capital W.


You are right but then is interesting to compare languages using Latin script. I find interesting that the first runner is Lithuanian, between non-Latin script languages, and the second is Swedish, with English coming third with quite a difference.

Lithuanian has quite a heavy use of diacritic signs, but it seems to pack more information than the others. And other languages with similar diacritic signs are not close to it.

Although the sensitivity to different fonts and details is real, we can draw some interesting conclusions from this Sunday exercise.


Isn't that intentional? Picking sprawling glyphs for common items, and simple ones for rare meanings would be "inefficient" in some interesting sense. Indeed 'i' is the most common letter in Latin.


English could be rendered in a 4x3 font (or even 3x3 font)..

In this test it'll comfortably beat out any other language.

See: https://fontstruct.com/fontstructions/show/325977/4x3_pixel


1x8 pixel font: https://dotsies.org/


thats cool, but looking at that font, I'm not sure I'd be able to tell an 'e' from a 'c' if there was no context. Eg: "figure e5" wouldn't be clear to me (but "colour" would be)


This might be a good use for fractal image dimensional analysis; a letter would have a dimension between 1 and 2, since it covers a portion of a plane. It might better measure the complexity of a glyph, rather than just the weight.

(https://en.wikipedia.org/wiki/Fractal_analysis)


Another approach might be to take some large, well translated text and measure compressed sizes of it in different languages. With a good enough compression algorithm that will take into account used character set size differences as well as common word lengths, etc.


This is a metric that is relatively simple to calculate but probably not a good approximation. If I were to do that though, I would choose the effective bit length of the symbol in the language, i.e. log_2(number of symbols in the writing system)


A lot of valid criticism here about the measure of complexity, but considering the author’s initial motivation, I think the approach makes sense: “The magazine had a little travel article, and it was written in English on one page and Thai on the other. The Thai version was so much shorter that I started to wonder if it was more efficient.”

If the goal is to identify which language would have produced the most subjectively visual “short article” per the description above, the approach sounds ok. For example, concerns with serifs, variations in character sizes, the information content of characters, etc become moot because they contribute to visual clutter and typography, which the author would want to count against the language.


> The Thai version was so much shorter

I don't think that Thai compresses much. Here is the above comment I replied to, Google translated into Thai (will be totally incorrect but a reasonably guess to length? I would guess English with spaces removed would be similar length?)

การวิพากษ์วิจารณ์ที่ถูกต้องมากมายที่นี่เกี่ยวกับการวัดความซับซ้อน แต่เมื่อพิจารณาถึงแรงจูงใจเริ่มต้นของผู้เขียนฉันคิดว่าวิธีการนี้สมเหตุสมผล: "นิตยสารมีบทความการเดินทางเล็กน้อยและเขียนเป็นภาษาอังกฤษในหน้าหนึ่งและภาษาไทยในอีกหน้าหนึ่ง เวอร์ชั่นภาษาไทยสั้นกว่ามากจนเริ่มสงสัยว่ามันมีประสิทธิภาพมากกว่าหรือไม่"

หากเป้าหมายคือการระบุว่าภาษาใดที่จะผลิต "บทความสั้น" ที่มองเห็นได้อัตนัยมากที่สุดตามคําอธิบายข้างต้นวิธีการฟังดูโอเค ตัวอย่างเช่นความกังวลเกี่ยวกับ serifs การเปลี่ยนแปลงของขนาดตัวละครเนื้อหาข้อมูลของตัวละคร ฯลฯ กลายเป็น moot เพราะพวกเขานําไปสู่ความยุ่งเหยิงทางสายตาและการพิมพ์ซึ่งผู้เขียนต้องการนับกับภาษา


Thai script "compresses" quite well compared to English because it does not use spaces between words, the vowel "a" is implicit, and many (but not all) other vowels are diacritics, not full letters. However, tone marking is inefficient (diacritics and silent letters), there are many duplicate letters for consonants, and the writing system and spoken language have diverged quite a bit.


> This was a surprising one to me. Simplified Chinese was expectedly more efficient than Traditional Chinese, but both were beaten out by Cantonese (which also uses traditional characters).

My first reaction to this is: "What? Cantonese?" I just knew something is wrong.

So I checked the data and immediately see the issue. `yue_Hant_HK` ("Cantonese"), `zh_Hant_HK` (Chinese, Hong Kong) and `zh_Hant_MO` (Chinese, Macau) all uses the same text, while `zh_Hant` (Traditional Chinese) and `zh_Hant_TW` (Chinese, Taiwan) uses a different text.

As it turns out, there is no Cantonese, just plain written Chinese (書面語, written language as we would call it). Both Hong Kong and Taiwan use the Traditional Chinese script, but the two don't exactly use the same vocabularies due to regional differences. For example, the term "privacy" is "私隱" in Hong Kong, but "隱私" in Taiwan. This is why there exists two versions of Traditional Chinese translations of the same text. Since they are likely done by different people (assuming they are not machine translations) they have different translation styles, which contributes to the difference in length of the two paragraphs.

(Also, Cantonese can be written in Simplified Chinese, but that's all I will say regarding this topic.)


the writing of cantonese is a complicated affair.

the only analogy i can cook up is that - imagine if formally, everyone wrote English in German, but when you speak you speak English, and when you read English you'd see German words and sentence structures but you will preprocess it into English before understanding it. and with different levels of formality, would You with underschiedenly Germandegrees speak English. but in no case would you speak German.


> Minä rakastan kahvia

This is like saying "I feel intimate of coffee". Finns wouldn't say that. Rakasta is translated as love but it's the kind of love reserved for immediate family or lovers, not just things you like a lot, and kahvi is in the wrong case, like having the wrong preposition.

Minä pidän paljon kahvista is a more normal way to say "I like coffee a lot". The English "to love" doesn't really have a good translation in Finnish.

And, since it's about efficiency, you can lop off the pronoun, since the verb conjugation implies it:

Pidän paljon kahvista

And, just a bit of soft context, the Finnish way of speaking is often terse, so if a Finn is saying they like something, it often had the same degree as when an American says they love something, otherwise they wouldn't bother mentioning it. So, you can get rid of the "much", too:

Pidän kahvista


The method is of course bizarrely disconnected from the goal, but the author does specifically say he doesn't care, when someone kindly pointed that out.

One is left musing about the overarching purpose of such a document, given that.

If one were continue down this road, instead of wisely and immediately turning tail and heading towards semioticians and other people who know what they're talking about: try compressing each font glyph instead. the big ones? have a lot of 'stuff' in them.

This isn't uh. Robust, you feel me, against things like yeah. Serifs. For example but, it ain't nothing.

Counting the black, on the other hand, I must concur this matters more to the person in the department who orders toner, than to the information theoretician of graphology.


I think the measure here is "information density on a physical surface". For example, someone commented on a 4x3 font for English: https://news.ycombinator.com/item?id=30046035. This allows you to put way more information on a pixel surface, but at the cost (at least to my eyes) of time spent reading, as this font is harder (at least to me) to read. Which I would compare to more efficient compression algorithm, that take more time to encode/decode.

This could be used when storing information on a physical media, like a book or a microfilm.


1x8 pixel font: https://dotsies.org/


I like it, but I would describe it as an encoding more than a font. I also think it's 1x5, not 1x8. In that case, it would be more than twice as efficient as the 4x3. In a way, it's a restricted ascii subset.


Wasn't there recently an article about bits of information per second, that all languages more or less are delivering the same information(tone differences, stress and other audial information also delivers information):

https://www.science.org/content/article/human-speech-may-hav...

Given the fact that Chinese writing is used for communication for various people, that each have their own language, the main component in learning Chinese is not about learning language - how to pronounce words, but what these characters mean - universally in all those languages - even English or Spanish people might communicate with Chinese if they knew meanings of Chinese "writing". This is big advantage when various people have to live together in one country, but the main disadvantage over alphabetic writing is that it is not easy to mass educate people to learn thousands of graphical representations and combinations of those "writings", than to learn 20-40 alphabetic letters, that(more or less) are related to sounds.

IMO, English is worst alphabetical language, and it is standing out, as it has not modernized it's alphabet(like rest of Europeans did 100-200 years ago) and most people learn English not from what they can read(it does not help, that various 20+ Brittish dialects can pronounce words differently), but binding to memory what is written - almost the same way how Chinese do. From my experience, there is staggering amount of written illiteracy among native British people, that I have never seen compared to other Europeans. Because of this experience, my English has become worse(some of the mistakes can not be blamed on keyboard), that what it used to be - also I don't care about errors anymore, because I have adapted to locals.

PS I think, that knowing different languages is like being different human. There are some languages, that makes you act very stressful and fast talking, and then there are some languages that are slower and where you can think before saying something. There is also difference between how jokes are present.


I find there’s another dimension to communication efficiency which is nothing to do with cramming information into bits, but rather is about working with a close-knit group.

What does it mean to be a close-knit group? When you spend a lot of time working, training or living together, you tune into each other’s frequency. You know what they are doing and how they are doing it. You each build a mental model of the others. You can focus contact time on the out-of-model communication that really matters. You can start to make accurate assumptions about what they are thinking. (This is infinite communication efficiency: information transmitted via zero bits!).

It’s like a kind of compression where your experiences together build a pre-shared dictionary - which of course is exactly what a written language is; this is a further optimization through customization.

I’ve seen this kind of “co-experience-bond” communication happen in small teams in the workplace, and I’ve heard third-hand about it emerging in military squads.

And of course my wife and I can communicate across a room with a look and an eyebrow.


I don't think the analysis is good here. If we want the most efficient writing language (written representation of some information), that could be emoji. You can't beat this: I ♥ . Three characters and can be translated (by reader) to almost any language without knowing target language letters and grammar.

[edit]: looks like HN doesn't allow U+2615 (hot beverage) character.


> You can't beat this: I ♥ .

It even works with very young kids.

I've "chatted" with my god daughter in WhatsApp using emojis way before she learnt how to read and write!


Emojis are pretty complicated to either draw or type. While drawing a heart is imo easier than writing the word love, typing it isn't.


> Three characters and can be translated (by reader) to almost any language without knowing target language letters and grammar.

You have to understand what a heart symbol is, and know that "I" is an english pronoun and how it's used. According to google translate, "I love" in Chinese is 我愛. That's only two characters. Emoji is very limited in what it can express, and unusable if you want to write with anything else than a keyboard.


I've often felt that English is a prime candidate for spelling reform. Being the modern lingua franca it makes sense to remove the unneeded vestigial remnants of other languages that adds extra unneeded complexity. For some reason it seems that English majors seem to be dead set against it, as they love etymology.


This is interesting. Most efficient? I think Spanish ought to be higher on the ranking.

- Spanish writing/reading is WYSIWYG. We write "Walk" but we read it as "wôk". - Link to other romance languages. Communication with Portuguese, Italian, French, etc... albeit difficult, it's not impossible. - Learning other romance languages. This one is interesting. Going from Spanish to Portuguese, Spanish to Italian, etc... is quite interesting since you find the connections between them really funny and insightful.

Caveats - The same could be said for the other romance languages but they don't have WYSIWYG, ie: Portuguese "vermelho" has the "lh" combo which add a soft i between "l" and "h", French "cuisine" omits the "e" when pronouncing the word, etc...


I am not sure whether I agree with the analysis, maybe will take more time to process whether the pixels metric really means efficiency.

However, one interesting point is that the author concludes "Hebrew" and "Gujrati" are the most efficient languages. The cultures to which these languages belong are both stereotyped to be very efficient and pragmatic businesspeople. So maybe they have a general culture which reflects in their day to day activities, even writing.

Arabic [which is the third on the list] also belongs to a people who traditionally have been associated with trade, before they found all the black gold.


> Out of all the French dialects included in the dataset, Canadian was the only one with different wording. I'd be curious to hear from someone who speaks French about whether the Canadian version has words that are actually not used elsewhere or if it's just a matter of choice of words.

As a CA_FR speaker, I think it's a bit from column A and a bit from column B. Tho, regional differences exists everywhere so I can't be absolutely certain. Proximity to EN_US and EN_UK probably affected CA_FR differently. Levels of language also affects the variety/frequency of anglicisms.


Comparing the fr to the fr_CA texts though, the Canadian one is wordier for no good reason. E.g., "nous nous efforçons de les protéger, tout en vous permettant d'en garder le contrôle." vs "nous faisons tout notre possible pour protéger vos renseignements et pour vous permettre de les gérer."


I think font selection is important to make it a fair comparison.

For each language, the minimal resolution "pixel font" that still works should be used. (E.g. Chinese hanzi are going to require more resolution than the latin alphabet.) The fonts used in Minecraft might be a good start.

However, once you do that, it probably doesn't make sense to only count the "on" pixels. Because in a minimal font, "off" pixels will probably contribute as much information as "on" pixels. So you should just count the total number of pixels required to render the piece of text.


1x8 pixel font: https://dotsies.org/


We have a different definition of "still works" :-)


This is sort of like saying “which metabolic process is most efficient?” across organisms. If you’re sufficiently reductive you can answer it, but only by stripping the emergent phenomenon out of the evolutionary context in which it emerged, i.e. by choosing an arbitrary definition of “efficiency”.

If you ask instead which process or language is “most fit” for its context, the answer is probably “they’re all pretty fit, except the ones we tried to design in a lab.” Seems like it ends up being more interesting to ask why they ended up being different and how it made them more fit.


This paper measures the efficiency as Shannon information calculated on a Google corpus correlated against word length: https://www.pnas.org/content/108/9/3526

This paper shows that speaker size correlates close to .9 with efficiency: https://royalsocietypublishing.org/doi/10.1098/rstb.2015.019...


Clearly this isn't intended as a serious way of "ranking" languages, but I wish there was a way to measure the "efficiency" of languages from the same geographic setup (and, preferably, family) but in different time periods, e.g. Akkadian vs. Babylonian vs. early Hebrew vs. early Arabic.

One would expect the efficiency to improve over time, but maybe this is not true.

Is there anything in this topic that one can read?


I’m also curious how modern languages compare to older variants. Even modern English to Shakespeare stuff.

I think to measure efficiency of a written language, you’d have to time how long it takes people to read it (and correctly answer some questions to prove they understood it).


"Spanish from Spain ranking at 14 and Latin American Spanish landing at 32"

Interesting result.

I tries to learn Spanish a few years ago, and read that south american Spanish speakers were unhappy, that people from Spain would say their Spanish is the true and better one, and that Spain decide what is Spanish and what isn't.

If that experiment holds true, their decisions at least led to a more efficient Spanish, haha.


The article continues: “The vast majority of differences are purely arbitrary word selections, though.”

He’s not measuring a property of languages, but of particular translations of the privacy policy.

By the way, sometimes North Americans encounter attitudes that UK English is in some way more authentic or better than their variety. That’s silly, too, and ahistorical. The difference with Spanish is that, like the French, they have an Academy that hands down official language judgments, and these are taken seriously by publishers, etc. The Academy is based in Spain, so they have a way to sort-of enforce their linguistic dominance. English is more anarchic.


It's not just the absence of an a academy resisting linguistic change, it's that the people who brought English to North America were English, so the dialects that have evolved in North America have been evolving from Old English just as long as the dialects in England. And moreover, languages don't change at a uniform rate. Until recently the center of linguistic change in the anglosphere was London, so British English, and particularly London English, was more innovative than North American dialects. I expect things have gotten more complicated since there are populous power centers outside of London now.


Also interesting is how the phonetics of some languages & dialects can result in faster talking.

I heard that Tamil having much less number of consonants, and lacking the ones that need more stress (I forgot the term for this) makes it much faster to speak. Although Malayalam sounds faster to me, probably due to accent it is spoken.


I guessed Hebrew would be between the densest ones.

In my experience using ktiv haser would make Hebrw win hands down. It's impressive to see how bigger is the English translation than the original text in translated Bible books.


I’m a little bit confused at the assertion that “cart” and “cat” are phonetically similar enough to need an extra letter. They’re completely different sounds to my admittedly en-GB ears.


I suggest you read it again. That is not the assertion at all:

> Consider the r in cart for instance. Without that r the word would clash with an existing word - cat, so the letter is significant in establishing meaning. The u in color is not, however.

The point is that what we call a "cart" and a "cat" are different things, and that the additional letter "r" helps us differntiate these two things, because we get two different words (rather than, say, calling both things "cat" or "cart"). By contrast, there is nothing called "color" in British English, so dropping the "u" from "color" creates no ambiguity in meaning.


While intersting, and the methodlogy has valid critique. I wish there were a figure plotting the results clearly. Such qualitiative reporting made the whole thing feel more opaque.


Slovak language have few single-letter words:

a - and

i - and also (archaic)

k - to, towards

o - about, at

s - with

u - near, next to

v - in, inside

z - from


I started learning Russian when the pandemic started, and I found it to be more concise than the other languages I'm familiar with.

Many phrases omit the verb, the case system precludes the need for connecting words, and also many little words are two or more words in other languages (туда - to there, сюда - to here, etc).


Slavic languages are quite simple to pick up because their writing and reading systems are almost the same. You can predict the sounds by the consistent rules of prununciation. Especially with the slavic languages which underwent modernization amd simplification late middle ages when adopting the latin script.

If they got rid of the complex morphology ala English, they could become the ultimate efficient languages.

https://en.m.wikipedia.org/wiki/Orthographic_depth


Congratulations, you just made the distinction between synthetic and analytical languages.


This is more to the general readership.

Well, no, he's discovered the difference between non-logographic orthographies that map spoken sound to graphemes at a close to one-to-one ratio (like Spanish or German where 'a' usually mean /a/ and those that don't, like English or Irish, assign one graph to a multitude of speech sounds. This is old news to most of you, but consider 'g' or 'sh' (which represents one phoneme). but 'g' can be alternately represent the sounds in Geronimo, good, through, gnat, tongue, and probably others I'm forgetting (ng). Plenty of other graphemes follow suit.

Analytical and synthethic languages alter meaning through predominantly differeent morphosyntactic mechanisms (and then meaning and pronunciation follow) Analytic languages are like Sanskrit or Turkish. Many changes in meaning come from altering the word by a suffix or the like or by phonemic alterations like vowel harmony (we still have a little of both in English perhaps) Analytic languages like Chinese, English, or French rely on (1) altering the word order to accomplish mostly the same thing. Again, it's a spectrum, and English has its fair share of analytic features. Synthetic languages might be a bit easier to learn, but any argument for the superiority of a single synthethetic language has to account for a plethora of typological features, like pitch, morphosyntactic alignment, pronunciation, pragmatics, elisions, clitics, particles, and so forth, that must be learnt. Any argument for the superiority of them as a whole runs into trouble at least at the point where languages seems to alternate between the two extremes.

To go off topic b/c its sunday and im bored: there is no strong deductive proof among linguists that words exist universally. I mean that many languages, especially the lesser-contacted ones, and especially those in North American, whose languages feature one l o n g word or two that convey the same meaning as ten in English. To a speaker of Mohawk the category 'word' has to have little use. As does syntax (but not morphology! This leads me to wonder how much the word is a written convention or limited geographically. We once assumed that there were at most three genders. Since then we've discovered languages with >7 and 0 genders (noun classes). Likewise, other languages have different parts of speech. Korean features a prominent topic-marker and a class of adjectives that occupy the the verb's position in the sentence and function like a predicate. They need that those words to make sense of communication; English speakers really don't. Is perhaps the word also a concept that some groups have need of and others do not.

On to the main topic, and particularly addressing the OP. Be careful not to confuse language with the script(s) they're written in. The to do not correlate beyond giving the an archeologist the ability to tell a logographic language from an alphabet. You project is cool when looking at various scripts from around the world. Secondly, be careful to claim, even in passing that rapidity/efficiency is superior. The Japanese nobility used to take eight seconds before beginning or continuing a conversation to allow for contemplation. The Ents had a similar convetion. There are benefits to the slow and inefficient. Clarity in speech is only one example.

Anyhow, an afternoon well spent.


With the exception of "z - from", all of these are true in Bulgarian. It's just that we use "ot" which means "from".

I think I am going to like central Europe.


"od" also means "from" but that's 2 characters.


Glad to see Cantonese tops the chart. Maybe not that surprising afterall


Efficiency comes from mastery.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: