Hacker News new | past | comments | ask | show | jobs | submit login
Language models cost much more in some languages than others (yenniejun.com)
260 points by yenniejun111 on May 18, 2023 | hide | past | favorite | 171 comments



Interesting article. I had no idea of this phenomenon. It's very well explained too.

Tangential remark: substack is becoming as annoying as medium now. Especially on mobile. One big popup asking to register. One constant toolbar asking to register. One constant toolbar asking to install the app. Many interruptions in the main text for subscribe and share reminders.

That didn't take long for it to go bad :( I only heard about substack about a year ago when Snowden's blog was in the news and they were still saying they'd keep it clean (like medium promised initially as well). And it was pretty clean then.

I was even thinking of putting my own blog there (which is free and unmonetized) but no.

On medium it's become so bad now that I don't even open their links anymore unless it's really something I am so curious about I'm willing to put up with the experience. I really hope substack doesn't go the same way.

Sure they have to make money but alienating your userbase doesn't seem a great way to do so in the long term.


It's not a "Tangential remark" if your tangent is 10 times longer than your commentary on the article!

> Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.


> Tangential remark: substack is becoming as annoying as medium now. [...] On medium it's become so bad now that I don't even open their links anymore.

FWIW I find that Reader Mode works fine for making posts on substack and medium interruption-free, both on desktop (Firefox) and mobile (Safari).

I realise it's possible that either of them might try some Reader Mode defeating hijinks in the future, so you still might want to avoid putting your blog there if you don't want that possibility looming beyond the horizon. But when it comes to reading existing stuff that other people have written, an accessible version should be just one or two clicks away.


I've definitely been on Medium pages where Reader Mode only shows one paragraph of the article... it doesn't happen all the time though. They may be intentionally trying to defeat Reader Mode, or it may have just been an accident, I'm not sure.


I have found that for many websites, putting it into Reader Mode and then refreshing the page gives a better result.


Not sure if it works 100% of the time but if you can't get it to display in Reader mode or find some other way, sometimes refreshing and hitting the "x" in the address bar at the right moment will allow the content to populate before you cancel the rendering of the paywall/signup crap.


Given the recent public disclosures of their finances through 2021, I would expect the experience to continue to deteriorate.

https://www.theverge.com/2023/4/7/23674178/substack-burn-202...


Yeah, I think they had the right idea at the start: a simple, minimal, low-burn platform. Once they took VC money the only plausible fate was Doctorow's "enshittification". They had no moat and low network effects. If they'd followed the frugal path they could have held on to the user focus. But once you borrow money from Fat Tony, your primary job is making sure Fat Tony gets paid.


I guess BeeHiv took their lunch? Also isn’t the network effects the notes app and referrals to other newsletters?


Jesus how can a blog platform burn so much money? Where do they go? Wages? What do those people do?


As you can see, "Partnership expenses" increased almost 20 fold. They paid people to write on substack.


That's just the negative revenue side. They also spent another $16 million beyond that, primarily on salaries ($10mm).


Is this really necessary? Competing on merit no longer works?

I see platforms burning money left and right and can't really understand how is this "the new norm". Especially with something simple like a blog platform.


> Is this really necessary? Competing on merit no longer works?

No longer?

Not trying to say no one has ever competed on merit along, but I don’t think there exists a bygone era where people didn’t use $$$ to gain leverage in a market.


Yeah sure, with all the advertising madness going on merit is something that will get you a polite smile at most.

But why even bother with something like a blog platform? You're not even innovative (looks like substack is simpler than medium so... cutting back on features?) so why pour all that money into it? Just to have yet another shitty platform and take investors money while trying to grow it?


Disable JS and refresh for no crap viewing.


So, just like every other site?


> This led me down a rabbit hole of trying to find a font that could render all of the language scripts. I went on Google Fonts to find this perfect font and found that one did not exist.

This seems like a rather odd example of "technological disparity"; it's just modular and "one huge font file" isn't in wide use because it's unwieldy.

Install/add an additional Chinese font, or Tamil font, etc. as needed. Most use cases don't need "all the scripts" and a modular approach is much better as fonts are large: NotoSans-Regular is 590K; NotoSansCJK-Regular is 26M, etc. In total Noto Fonts is 373M on my system.

And it contains hundreds of thousands of different glyphs, and requires expertise on dozens of writing systems. Creating such a font is a significant effort, which is why few "universal" fonts exist, and why there are many "for this script" fonts.

---

I wonder how one could design a better Morse code for Chinese; Japanese Morse code used Kana, rather than Kanji, and Hangul can be composed from smaller blocks. As near as I can find, Chinese is kind of an outlier here. Any system I can think of would probably be equally difficult to use or error-prone (due to either operator error or line noise mangling things).


> Hangul can be composed from smaller blocks

Not “can be”, “is”. Hangul was designed, it’s alphabetic and morpho-syllabic features were intentional.


What is even the point of pedantic "well, duh" comments like this?

"English {can be,is} written using the Latin character set"

"The symbol 'e' {can be,is} used as a vowel"

"Italian {can be,is} spoken in Rome"

"Python {can be,is} used to program a computer"

"A car {can be,is} used to drive on the road"


Because not everyone knows Hangul and could learn something wrong.


But it's not even strictly correct from a technical point of view. In Unicode, Hangul syllables can either be encoded directly as a single codepoint, or composed from multiple codepoints each representing a single jamo.

One could conceivably devise a Morse code encoding that simply uses the Unicode codepoints of the syllables. It would be a bad way to do it, but it's an option. Composition isn't a hard requirement for Hangul encodings.


You could use pinyin or bopomofo as a base.

Written Chinese can be thought of as a syllable alphabet with 100s of ways to write each syllable. For a fluent reader it is easier to read with those contextual hints, but strictly speaking it is not necessary.

Spoken Chinese works just fine without them.

Morse code usually has its own vernacular so it is easy to get around the lack of characters.


> You could use pinyin or bopomofo as a base.

Ah yeah, that would be the obvious solution today, but neither were invented yet in the 1880s – I was kind of thinking from that viewpoint: "how could they have done better in 1880?", rather than "how could we do better today?"

The best I could think of is a UTF-8 like scheme, where a prefix selects the number of digits where less digits represent more common characters. I'm not entirely sure how well that would work in practice as I operated a telegraph exactly once in my life (as a child). I know figure/letter/Cyrillic shifts in telegraphs could cause some problems if the shift got lost or was garbled, but I suppose by prefixing every character solves this – but then again, why didn't they do this for Baudot code/ITA?


I hadn’t considered late 18-hundreds as a context, but I’m gonna stick to my guns and say wade-Giles.

Wades incarnation goes back to the 1860s.

Since you don’t need the alphabet as an intermediate you can can encode each initial and final from the table[1] as an individual character to balance the number of encodings per symbol with the number of symbols per word.

Since Morse typically encodes simple messages with its own vernacular I think you can safely drop encoding the tones.

[1] https://en.m.wikipedia.org/wiki/Wade–Giles_table


I wonder how one could design a better Morse code for Chinese

In terms of efficiency, I guess you might start by ranking characters by frequency, and build a Huffman code? Then think about adding parity bits or sync symbols or whatnot.

It's hard to imagine people learning it, rather than painstakingly looking up each character in a book, but I suppose it'd be similar to learning any other numeric mapping. Wikipedia says "Chinese expert telegraphers used to remember several thousands of codes of the most frequent use".


Please forgive my extremely ignorant question, but could Chinese be written phonetically, maybe using the IPA or some existing/custom script?

If so, that should require much less memorization than memorizing thousands of Chinese characters, and a "morse code" could be made from the phonetic script.


Korean language dropped Kanji altogether postwar, and one of funny anecdote circulated on the bad parts of Internet in 2000s was that a cause of a minor train fire in South Korea was that a requirement of fireproof wood was misinterpreted as incendiary wood, because the language relied on Kanji for distinction between those. CJK languages could be used in phonetics only, sort of as in backup mode, but not really.


If only they'd used english and required inflammable wood!


Many Latin-derived languages have that one problematic word. It's not only English.

The opposite of it would be "non-inflammable", so it's not really ambiguous. It's just anti-systemic and confusing. "Flammable" isn't supposed to be a word, even though people keep using it.


I think the ambiguity in inflammable lies in in- prefix rather than in the second part. If someone told me that incombustible and inflammable used to point at the same property at one point, I’d believe that.


> "Flammable" isn't supposed to be a word, even though people keep using it.

On the contrary! "Flammable" has been used in English for centuries (https://www.etymonline.com/word/flammable) and at this point it's used intentionally to make it clear that we really do mean something can burn.


And "inflammable" has been around for so long that it has spread to a big handful of languages. One of those being English.

Flammable is the obvious construction you get from the Latin, but it's not exactly the word that spread into English. It seems to have been created there later.


> but not really

Languages will evolve to disambiguate those situations. Vietnamese and Korean already have done so. They have no practical issues communicating with their writing systems.

It's not impossible for Chinese to do the same. Native speakers who learned Chinese through Chinese characters will always find it the most natural, and would surely consider it a pity to lose characters, but if an entire generation is hypothetically brought up reading pinyin only, they will find it the most natural and find ways to disambiguate words on their own. Not saying I support such a shift happening in Chinese, just saying that it's not impossible for the human brain and society to cope with, if it were to happen, and it's not something we (the aging generation) have much say in.


Sure it can. In fact, it regularly is if you're entering stuff in on a phone. One of the major ways people enter text on phones is by typing stuff in using pinyin and selecting the appropriate character.


Absolutely no since everything about Chinese is not based on pronunciation. Even the meaning of a word change by the pitch. If you don't specify the pitch in the phonetic transcript, the level of uniqueness in pronunciation of vocabulary is so much lower than other languages.

Same may say Japanese can express their language solely in pronunciation (Kana) but this is not true. There are easily 20+ meanings for the same pronunciation and are even hard to distinguish them with context provided. Just open any Japanese dictionary then you will see that it is a completely broken language if without the expressive power of Kanji (or Chinese characters).


For a simple example in practice, many Japanese names can be written multiple ways based on the kanji used, ending up with different meanings and different implications, while being pronounced exactly the same.

For example, the common name Michiko can be written in many different ways with many different meanings, such as:

    美智子 — "beautiful wise child"
    美千子 — "child of a thousand beauties"
    見知子 — "child of recognition"
    道子 — "child of the way"
    路子 — "child of the road"
    倫子 — "child of morals"
    皆子 — "child of all"
    通子 — "child of passage"


You should look at something like "sousou" or "koukou", the range of meaning spread from noun to verb to adjective to everything.


This is not that unusual for languages. For example,

> range

Can mean a wide open area, a set of things, a stovetop, a measure, to measure, to vary between extremes, to line up, or to pick an opposite side.


No, "range" do not share the same prounciation with other word.

https://www.homophone.com/search?type=begin&q=range

I think what you misinterpreted is that you think "sousou" and "koukou" is a word in Japanese, no, they are sound, the prounciation.


I think they were just making the more general point that most language are very context-sensitive.

“Dear” and “deer” if you want an explicit English homophone.


Please consider the amount. How many collision can happen in one prounciation in English? It is on average 2, maximum about 4. Chinese is like 6-10, Japanese is like 10-20.

If you don't do comparison, of course there is no difference.


Are you saying Japanese speech isn't fully comprehensible without seeing the written characters?

If not, what are you saying?

I mean, I'll believe you that there are more homonyms in Japanese, sure, but it must still operate as a spoken language somehow.


Just give more examples.

Look a doctor, consult a doctor, supervise a doctor, all sound the same, have same phonic transcript, in Japanese, only differ in the Kanji/Chinese character. Either you extend the sentence and write more words, or show that you are sick and have runny nose, otherwise it is hard to convey idea clear without Kanji/Chinese characters.

Often you will see people talk in Japanese will say "erh!?" then stop and both of them are confused. Half of the reason is that people are just guessing what their component trying to convey but as the conversation goes longer and having more context info, they found that they are not talking about the same thing.

With lesser context info e.g. situation, places, visual clues, emotion, intonation, harder you can comprehend.


What is a "word"? Whether a sequence of sounds/symbols is one word with two unrelated meanings or two different words that are coincidentally represented the same way is quintessentially a semantic argument.


You could add tones as numbers; a la "yi1 wan3 shui3 jiao4, duo1 shao3 qian2?". But that's really clunky to read. It's not that much to add tones as accents; that would work OK for spoken language, since you already have to make sure it's unambiguous. But it wouldn't work as well for texts primarily designed to be written, as there remain a lot of homophones.


And this create a practical/political problem that people speak a different dialect cannot recognize what this encoding is trying to express. Even though literally they share the same writing system.


I mean, yeah. You give up one thing and gain another. For instance, it would be much easier for people speaking those other "dialects" to learn to speak and understand putonghua, if they were seeing a phoenetic form of putonghua all around, rather than seeing characters which they read in their own words.

Alternately, we could all replace our spoken languages with Chinese characters. Then we could all read Chinese subtitles, and reach each others' languages to some extent.

But I think you'd be hard pressed to find anyone not already using Chinese characters for their own language who would consider the benefits worth the cost. It seems to me (as an outsider who's been studying the language for a few years) a sort of linguistic "Stockholm syndrome".


How do they recognize that their own language when they speak it?


The case GP is talking about is (say) having subtitles in Chinese characters (in Mandarin), which can be read by people who speak Hokkien or Cantonese; or having street signs in Hong Kong written in "Written Cantonese", which can be read by people from Beijing.

The answer to the question, "How does someone from Beijing understand someone speaking Cantonese?" is generally, "They don't."


If someone from Beijing doesn't understand someone speaking Cantonese, having a phonetic/phonemic writing system per dialect will not be any worse.


I think GP‘s point is that right now they can read the same text even if they can’t understand it spoken. Moving to a phonetic script would mean they can no longer read the same text, ergo worse


A phonetic/phonemic script would be worse for mutual intelligibility for those privileged enough to know written Chinese, which is notoriously difficult to learn.

But it might be better for literacy in general because it would be much easier to learn. And, though it is a comparably much more trivial issue, better for creating something like a morse code.

That said, I know there are many other issues apart from simple ease of learning that figure in to what language people learn or want to use. Written Chinese has a rich history and culture, and it's tied to people's identity so of course jettisoning it for something arguably more practical with the added downside of loss of mutual intelligibility among those who know written Chinese would likely result in strong opposition.


If you see the counter-arguments are coming from historical and cultural stuff, then you are making wrong assumption here. At least not in this branch. Literally you are the first one to mention this argument.


Think of Chinese as one script but multiple languages. The spoken languages have about as much diversity as European languages, but they all share a mapping to characters which are meaning based, not pronunciation based.

So you can use Mandarin pinyin, (though there are a lot of homophones) to serialize to a pronunciation, it wouldn't be understandable by a Cantonese speaker.


Hmm... so would written chinese be one language, and all the spoken language different languages? They just developed a script based lingua franca.


"Language" vs "Dialect" is a matter of degree. If you have a tiny bit of exposure to, say, German or Icelandic, you can actually catch a fair amount of what's said in those languages given the context. The same is true for Mandarin and Cantonese. A lot of the words really are just different pronunciations of the same word; 我 (me) is wo3 in Mandarin, and ngo5 in Cantonese; 你 (you) is ni3 in Mandarin and nei2 in Cantonese. A lot of the grammar concepts are the same; you don't conjugate verbs, you instead have "markers" that indicate things like "completed action" or "experienced action"; measure words, and so on.

But a lot of the grammar and words are just different -- at least in the spoken language; and typically subtitles are all in Mandarin.

So my son watches Peppa Pig in Cantonese; I can read most of the subtitles in Mandarin. Peppa will say, "Da4 di, nei2 tai2 ha2!" (Look, Daddy!); but the subtitles will say 爸爸,你看一下! (ba1 ba! ni3 kan4 yi2 xia4!) Note that only one of those four words is the same (你/ni3/nei2). "da di" (Cantonese) has been replaced with "baba" (Mandarin); "tai2" is replaced with "kan4" (a different character); and "ha2" has been replaced with "yi2 xia4" (an extra word).

"It's all Chinese" is a sort of fiction; and from my outsider's perspective, a fiction which heavily favors Beijing and the Han majority at the expense of the various minority groups. Written Mandarin is pretty close to spoken Mandarin; "official" written Cantonese is very different than spoken Cantonese -- to the extent that the verb "to be" is a completely separate word.


> Absolutely no since everything about Chinese is not based on pronunciation.

This conflates so much. Hanzi (mostly) isn't pronunciation-based[1], but Hanzi is simply one, and a secondary one at that, representation of the language. The primary representation of the language is the spoken representation, and unless Mandarin speakers spend their time in silence, it's very much pronunciation based.

[1] But it make plenty use of rebuses, where you have a character made up of two characters mashed together into a single character, one part indicating the kind of thing (such as a tree), and another giving a word it sounds like.

> Even the meaning of a word change by the pitch.

No, these are distinct words that have different tone contours. We wouldn't call "bat" and "pat" the same word in English, even though the only difference is that they different voicing, nor would we call "stoop" and "stop" the same word when they differ only in vowel length. The same goes for languages with contrasts in aspiration, ingressive vs egressive consonants, and so on. Tone is no different.

> Just open any Japanese dictionary then you will see that it is a completely broken language if without the expressive power of Kanji (or Chinese characters).

Japanese speakers must have awful trouble speaking with one another in that case.


> No, these are distinct words that have different tone contours. "bat" and "pat" ... "stoop" and "stop" the same word

I don't want to repeat but your examples are not the situation in Chinese speaking system. When you write out the sound of different words, they are literally the same e.g. "stop". The distinction is whether you pronounce it in C major or E or G. Same goes for many southeast asia languages although not an expert on those.

> Japanese speakers must have awful trouble speaking with one another in that case.

In fact large chunk of jokes (either modern of traditional) from Japan is due to different vocabularies share the same prounciation.

Even not a small proportion of detective stories' twist comes from different interpretation of the same prounciation. Like someone say something on the phone but misunderstood by the others, or someone writen down a message before dead in Kana (prounciation) instead of Kanji (meaning). The whole story is just playing with those linguistic hash collisions.


Yeah, I don't speak chinese, but there are many european languages that use accents to change sounds of vowels. In portuguese, the difference between grandmother (avó) and grandfather (avô) is just in how you pronounce the o. I am extremely skeptical that any language that humans speak could be impossible to create a phonetical writing system for. It may need features that the latin alphabet lacks, but it should be possible.


no, this is not correct. you don't need perfect pitch to speak proper chinese, it's about the relative pitch contours. C vs E vs G has nothing to do with it. and native chinese speakers are not conceptualizing it as "pronunciation" + "pitch" as two separate things that exist alongside each other. the pitch is a fundamental part of the pronunciation, not external to it

i dont know japanese but it sounds like your example is just saying that japanese humor has a lot of puns


Vietnamese is a tone language, and it is written in a Latin script (with a number of accent marks). If it can be done with Vietnamese, it can be done with Chinese. For that matter, lots of tone languages are written in Latin scripts.

The problem with Chinese is not that it is a tone language (not absolute pitch, but relative pitch); the problem is that there are many regional dialects with different tone systems, and other pronunciation differences as well. Not that that's a real problem; English has lots of dialects with very different pronunciation systems, and it's written in a dialect-agnostic way.


> everything about Chinese is not based on pronunciation

Then how can Chinese speakers understand each other when they speak?


Historically, people from different regions would be able to read each other's writing (send letters to each other) even if their spoken dialects/languages were mutually incomprehensible. You get this in European history, too, just not to the same degree.


Usually from contexts. If datastream could be interpreted as "correct horse battery staple" or "add salt to taste" and the situational context is a cooking instruction, the latter interpretation has to be correct.


It would be a pretty rough deal for people relying on text-to-speech if that were the case.


I’m just as ignorant if not more so! :)

could Chinese be written phonetically?

A bit of Wikipedia research suggests possibly yes, this is basically what pinyin is.

Although others in this thread are saying you would potentially lose meaning that way. Surely that would only be the case for written text that’s ambiguous when spoken aloud? Which I can’t think of a situation where that would be a serious problem except for intentional puns.


Just to start with, there are a significant number of characters that share the same pronunciation.


Enlglish also has plenty of homonyms and English speakers can still understand each other when they speak.


I actually ran into this myself when I was testing my C program for UTF-8 support. Ended up just installing noto CJK but I wondered why there wasn’t a universal font I could use. Now I’m wondering why I can’t just fuse a font that has cjk chats to a font that has Latin ones.


There's a Unicode quirk/mistake called Han Unification that similar CJK letters, not always close enough, are assigned same code points. As the result, each of CJK languages(Chinese, Hong Kong/Taiwan Chinese, Korean, Japanese, Vietnamese ...) has to be displayed in its own fonts.

It must have been much less of issue when UTF-8 specs were formalized in early 90s, as memory was expensive, and electronic communications between CJK cultures were minimal; computers just had to have the correct font for a single specific language, and others didn't even need to render, let alone correctly. Being able to display similar letters outside of a specific language must have been a bonus.

Today it's a much pronounced annoyance as software is globally distributed, but until the Unicode Consortium figures out how to normalize this CJK situation, the font switching hack has to go on and there just can't be a single font that could render everything.


You totally can, tools such as fontforge even make this pretty straightforward.


Missing glyphs would be a much smaller problem if people stop creating/using text renderers that don’t support font fallback and system font store. There are still a good number of GUI frameworks that assume one font or even one Latin font is enough. Fyne (golang) is an example; the advice, if you want to display CJK text, is to bundle a single font with all the glyphs you need…


Pretty sure Unifont will do what the author wants. Coding or system wide and you never gotten tofu. (+ the crispness of bitmap fonts)


Languages are subject Metcalfe's Law. In the long run - let's say a thousand years - it's likely that the majority of humans will speak a single language. And that language will most likely be derived from English.

The only thing that kept languages separate, historically, was isolation. The internet has fixed that. If you want to publish content to the widest audience, you publish in English. If you want to consume that knowledge, you'd better understand English. The network effect is powerful and English has a substantial lead.

Maybe it's the seed planted by British empire. Maybe it's the fact that English fits into 7-bit ascii. Maybe it's the fact that English is already a hodge-podge of germanic and romance languages. Maybe it has something to do with English readily adopting neologisms from other languages. Whatever the historical reasons, if right this second you put a random group of people from different non-English-speaking countries together, they're probably going to talk to each other in English.

So yeah, this problem - if you think it's a problem - is going to get worse over time. But there's nothing you can do about it in the long run.


1000 years ago, following this idea it would have been Arabic.

2000 years ago, it would have been Sanskrit.

Predicting something 1000 years into the future is tricky business.

It takes one calamity/war/etc to tilt the balance.

If Yellowstone erupts in 200 years, it's unlikely English would remain the dominant language. Maybe Mandarin, who knows.


A thousand years ago human populations were memetically isolated by geography. That changed somewhat with the printing press, but the internet puts us all in the same meme sphere.

There are only 330 million English speakers in the United States, of the 1.5-2 billion worldwide. Yellowstone could kill everyone in North America and English would still be the most common language on Earth.


The amount of new movies, literature, and similar cultural artifacts matter here, not the absolute number of people who understand a language right now.

Language comprehension stats can change in as little as one generation.


Things are different enough this time what with the technology boom and the internet and everything that I don't think those are entirely fair comparisons. That said, you're still totally correct that trying to predict the world a thousand years from now is pretty much pointless.


From what I know nobody in China cares about English and that's a nation like 2-3x bigger than all population of English-speaking countries :) Can't see a demise of Chinese even after 1,000 years. As well other languages with big enough user base (I'm not speaking about smallish European countries)


China is so large that it can afford to ignore other languages, it's also important enough in the global economy that it can get other people to learn Chinese in order to deal with them rather than doing things the other way around.

I think the point stands that the more time passes, the more interconnected the world is, the more English becomes beneficial. If the great chinese firewall fell today, millions of Chinese citizens would start leaning more and more English and using it more, odds are that the same wouldn't happen the other way around.


The important metric is not "how many people live in English-speaking countries". The important metric is "how many people speak English". There are several times more non-native English speakers than native English speakers.

Mandarin doesn't seem to be very popular beyond native speakers, and the Chinese population is shrinking. I wouldn't bet on it in a race for the next Lingua Franca. Though I can imagine a Firefly-like future with Mandarin words mixed in with English.


How in the world have I never heard of Metcalfe's Law related to social media?


This is no surprise, the tokenizer is constructed to minimize the encoded length of the training corpus, and most of that is in English or at least using the Latin alphabet.


This is probably not entirely invalidating the result, but the language samples in the dataset seem to be extremely badly translated from english, with unnatural, verbose and grammatically wrong sentences. That would not help with good tokenisation.

For example the english text > please add milk to the grocery list

Is compared to the french text > s'il vous plaît ajouter du lait à la liste d' épicerie

But a native would say > veuillez ajouter du lait à la liste de courses


This is a really good point! I also noticed that some of the translations were not good or very stilted for the languages I do speak. However, this is a limitation of the dataset of this size and breadth


Fascinating concept

I wonder how humans decode these symbols, because it doesn’t seem to be 10x more “costly” for a person to natively learn one of those languages vs another

Also, supposedly, independent of the language, humans communicate at an aprox constant rate (39 bits/second according to this: https://www.science.org/content/article/human-speech-may-hav...)

Here’s the Twitter thread about it from the same author:

https://twitter.com/yenniejun/status/1653791622197579776?s=4...


Some languages require significantly more effort to think in. I believe this has a non-negligible effect on human productivity and advancement.

I suspect that the tokenizer is biased towards English, but I also suspect that a self optimizing or universally optimized tokenizer would see similar results.

I am multi-lingual and find that some concepts are easier to explore in different languages. I am native English, so I lack certain insights into the relative usefulness of English, but I know many non- native speakers of English that tell me that they prefer to think in English for many cognitive tasks because it is less tiresome or more efficient or more accurate, to use their subjective experiences.

I also wonder how the way that a language is learned (instruction vs immersion) might impact this? I

suspect that the end result of these two learning styles is much more divergent than external appearance might suggest, as I find that I don’t or only very rarely solve problems in languages I have learned through instruction, while I have a completely separate personaje available in language/cultures I have learned through immersion.

Interpersonal relationships in one language are non-translatable to other languages, for example.

Certainly language choice impacts the density of textual knowledge, and English seems to have a density advantage here. Books translated from English tend to be simply bigger in order to convey the same information. I wonder if there are languages more dense textually than English and how they tokenize?

It is interesting to consider the possibility that language choice could offer a cognitive advantage, and what that might imply for the potential for intentional human language optimization.


"Some languages require significantly more effort to think in": No. A native speaker of one of those languages can think as effortlessly in them as you can in English.

"Interpersonal relationships in one language are non-translatable to other languages": I very much doubt that; the Bible (or at least the New Testament) has been translated into thousands of languages. Some concepts require more words in some languages than in others, but in the end it's possible. The real issue for things like Bible translation are plants and especially animals that don't exist in the other cultures, not interpersonal concepts.


> to express the same sentiment, some languages require up to 10 times more tokens

this does not make sense, does it? it may be a true metric as per the setup of the comparison, the existing models, the existing corpus etc. but logically it seems an artifact rather than something deep about language information density.

nevertheless it seems worth investigating. I would suspect that once various irrelevant biases are removed (a sort of ur-LLM) there will be an interesting comparative landscape.


One of the quoted blogposts [1] has a comparison of various tokenizers. It seems like the GPT-2/3 tokenizer is performing especially bad, while tokenizers of models with the explicit goal of modeling many languages (like google's mt5 [2] or xlm-roberta [3]) have a much smaller disparity. Maybe OpenAI trains their tokenizers on predominately English data, and treats other languages more like a nice bonus?

1: https://denyslinkov.medium.com/why-is-gpt-3-15-77x-more-expe...

2: https://huggingface.co/google/mt5-base

3: https://huggingface.co/xlm-roberta-base


It is absolutely true (I use these models on other languages regularly) and is also clearly explained elsewhere here as an artifact of the tokenizer optimizing for the English heavy training corpus.


Does the encoding have anything to do with it?

When talking about token length, I couldn't help but wonder if they were judging length in UTF-8 bit size, in which case languages using non-Latin alphabets (and even those that do, but with accents) would pay a penalty


Most LLMs are trained using subword-tokenization such as BPE (which is investigated in this post) and Sentencepiece.

These algorithms minimize the number of tokens required for representing the training corpus. I.e. for a training set mostly consisting of English this is a natural consequence.

In my opinion a more interesting question would be to ask if chatGPT performs better or worse on languages with "unique" sets of characters (like Burmese & Amharic?) compared to other European languages (like French & German) which might tokenize to shorter lengths but share subwords with English while having different meanings.

Also, OpenAI being an American company and training a model which is optimised for English seems very natural... Just query it in English for better and cheaper results. If it would be equally good at 200 different languages it would probably be bad at all of them instead.


I’ve talked to folks whose native language is represented on the right side of the distribution and GPT4 performs poorly both in terms of speed as well as language facility for these languages. Interestingly the right side tends to be largely south east and south Asian languages. Malay is an outlier in that it tokenizes fairly small. But Burmese, Khmer, and Thai perform poorly.


I’ve tried using chatgpt on Tamil and it’s firstly hella slow and secondly can’t do much with it beyond a few hundred words. I figured it’s considering each letter in the sentence as a token, but originality and inventiveness wise it wasn’t necessarily worse.


Curious to know if it’s more than just a representational problem. Are some languages harder in some deep way than others? And with what human consequences, let alone the LLM costs?


I don't think languages are significantly easier or harder, there's a limit on how hard it can get for average person to be able to speak it comfortably. I think there are different tradeoffs taken and if you come from a language with similar tradeoffs it's easier to learn.

My native language is Polish, I know English and a little German and Spanish. Slavic languages often have the reputation of being difficult, but IMHO that's because they are easy in places English-speakers expect to be difficult, and difficult in places English-speakers expect to be easy.

There's 15 tenses in English and 3 in Polish. There's no articles in Polish, and the pronunciation is almost perfectly regular. And there's probably 20 times fewer word roots, because of the pre/post-fix system. What in English is 20 unrelated words in Polish is one word root + 20 different combinations of pre/post fixes :)

But to take advantage of this when you're learning you have to think in the language you are learning - to realize these words are related and how the postfixes modify the meaning. Otherwise you'll still need to memorize 20 separate words - and on top of that all that crap that is harder in Polish, like cases.

I wonder if this influences LLMs (for example if they "think in Polish" when producing Polish text, or "think in English" and translate on the fly). I noticed GPT-3 was much better at rhyming in English than in Polish, despite the fact that rhyming in Polish is very easy (if the final letters match - it rhymes). When I explained this rule to it - it started rhyming better :)


One could argue that English has only two true tenses: past and non-past. Everything else is a mood or aspectual distinction.


I'd argue it's a bit easier in English if it comes to learning a grammatically correct expression of a thought. Polish or other slavic languages have quite a complexity hidden in "odmiany", "przypadki", and "koniugacje", etc.


Some languages have some specifics that are not common in other languages, and if the language is small, it's hard to account for that too.

In slovene, you have singular, plural but also dual forms, so even the basic "strings.xml" types of localizations don't work:

eg: "I eat" would be "Jaz jem", if it was two of us the "We eat" would be "Midva jeva", and if 3+ of us would be eating, it would be "Mi jemo".

Also when counting, we have a different form for one thing, two things, three-or-four things and five+ things, So, "(1-5) beer/s" would be "1 pivo, 2 pivi, 3 piva, 4 piva, 5 piv"

So yeah, good luck :)


> In slovene, you have singular, plural but also dual forms

This is not unusual amongst Indo-European Languages, Sanskrit is similar. In fact, Baltic languages like Lithuanian seem to be a close sister to Indic languages like Sanskrit. [1]

[1]: https://www.news9live.com/art-culture/why-lithuanian-sanskri...


Do "strings.xml" types of localization work for anything anyway? To me it just seems to let garbages match originals easily and giving false sense of accuracy to English speakers.


Is it "piv" all the way to 20 and then it's "21 pivo" and the pattern repeats every 10 items?


All the way luckily, "19 piv, 20 piv, 21 piv,...", so at least this is consistent :)


No. It's because OpenAI's token model is highly biased towards English words. In this way, other languages are greatly disadvantaged, reducing competitiveness in non-English-speaking countries. In my language (Portuguese and Spanish), for example, my cost increases by 60% to 75% due to the content I need to process not being in English.


No its not. Greek are tokenized on a per character level, while the language structure while not as simple as english or latin languages in General, is not that different.


I have a suspicion that LLMs has primary language that it must be always consistent with. If so it makes sense that it performs worse in non-native languages.


It’s not. That plays a role, but language efficiency is a real thing.


How does tokenization works> Could a different algorithm make say Burmese need fewer tokens? Or will it inherently have more tokens?


I believe it comes down to the statistical frequency in the dataset used to build the tokenizer. If the dataset is 90% Burmese the resulting tokenizer will have single tokens representing common complex constructs in Burmese instead of English. Today's tokenizer might tokenize Burmese letter-by-letter or even byte-by-byte as it is not a very common language on the internet.


The choice of tokenizer is arbitrary, and is configured by the people building the LLM. It's not dictated by the data. A simple model would be to simply tokenize the input character-by-character using Unicode as-is. Heck, you could use 256 (8-bit) tokens and just feed the model UTF-8 as a raw byte stream. The AI would "figure it out".

However, it is much more efficient to use a tokenization tuned to the statistics of the input data set. Since most of the Internet is in English, it's more efficient to assign single tokens to entire English words, but not to other languages.


> it's more efficient to assign single tokens to entire English words

Not entirely true, it's just that English has relatively limited inflection so words are not modified that much. In the weather example Spanish uses one more token than in English just because of the difference in sentence structure. The eight tokens are essentially "what weather will-be the week that come/s", i.e. the same sentence would use nine tokens in English.

For a simpler example, Italian will use two tokens "legg/o" where English will also use two tokens but they will be entire words "I read". But "he read/s" may be three tokens where Italian uses two for "legg/e", because English in this case has some redundancy from its remaining vestiges of inflection.

The fact that the tokenizers explored in the article aren't as efficient for non Latin alphabets is a different story.


> However, it is much more efficient to use a tokenization tuned to the statistics of the input data set.

A program "tuned to the statistics of the input data set" is what an LLM is. So choosing to use a fixed tokenizer rather than letting the LLM learn one is a performance optimization, but not one we'll have to accept if models were designed better for performance themselves.


How about parsing difficulty? Some languages are more context dependent than others.

In programming, C++ takes probably 10x more cycles to compile than simpler languages. There are so many possible interpretations of each statement, the correct one of which depends on context.


It's interesting that the token length for Chinese is only slightly longer than for English. What does tokenization of an ideographic language look like, anyway? One token per ideograph? Something else?


Either one token per radical (as some minimalist proposals for CJK Unicode normalization suggested way back when) — or just map everything to sense-annotated pinyin, i.e. what you type into a Chinese IME to get ideographs out (and which is also, I think, what Chinese text-to-speech engines do internally as an intermediate step.)


top picture has no the normal right to left / left to right text copy paste issue with hebrew :) has the trasliteration of "hey" backwards.


The Arabic in that pic is even more broken. It's written in the wrong direction and the letters aren't connected correctly.


Yep the right to left texts are still not rendered correctly. Same goes for Persian. Disconnected letters appearing from left to right, it should look like: سلام


Thank you for bringing this up! I've tried to update the right to left languages... Please let me know if it still looks wonky


Same for Arabic, and the characters also do not use their connected form...


I've tried to address this in the image, could you let me know if it still looks a bit unconnected?


Sorry for the late response!

Your Hebrew is correct now, but the Arabic is still wrong - letters are never written separately as they they appear in the picture. The word is سلام but you're displaying س ل ا م, which are exactly the same letters but as if there were spaces between them.


So, to maximise cost savings... pass the input through Google Translate first? ;)


Interesting. This sort of thing could be solved by having a transformer work with intermediate and language-less concepts/ideas and then translating to/from a language with an additional model/encode/decoder separately.

I think the impact of ensuring "fairness" with language models at this point in time of their development would be quite negative. Does every model need to support Burmese? How large does a company offering a model have to be before it's considered a requirement? OpenAI's homepage (https://openai.com/) only seems to be available in English, these transformers are quite new, why haven't we applied the same logic of fairness & inclusivity to every site on the internet? Because it's infeasible unless usage calls for it, I don't believe it's political.


Just on the face of it, as someone who has implemented a handful of tokenizers over time and can read Hangul, the Korean example there it four tokens, not twelve. You can break that down more with semantic parsing, but tokenization is not semantic parsing.


Isn't that the case with UTF-8 as well? A single Latin character is 1 byte, characters from other alphabets are 2 or 4 bytes


Good luck with your Finnish language model!


The irony here is that the Finnish internet will probably be less impacted by AI spam as a result. Perhaps in the end internet times it will be nothing but bots and finns. A fitting outcome.


Can't this be explained by the special charecters and the extra step to translate them, such as Diccionario Español changes to xn--diccionarioespaol-txb in a domain. So the extra step adds a level of complexity/compute?


This is a fascinating read. However, the disparity in cost is not that surprising when you consider the factors contributing to it.

Very well explained. Kudos!


Languages are technology and as such some are more efficient than others.

To me, my native language imposes a tax on learning compared to English.


I think LLMs need a language of their own to cross communicate and to compile to code. This strengthens that reasoning I believe.


Someone asked ChatGPT make its own language that could be used as prompts and it spit some crazy concise emoji. But it somehow worked. Wish I could find the link.


English has such a moat behind it haha


Fascinating analysis. English language efficiency, on top of English language dominance in training data, should mean advantage for English AI. Along with Nvidia chip dominance, this should see the permanent supremacy of Anglo Saxon culture


https://medium.com/do-you-not-like-language/is-mandarin-more...

It’s possible Mandarin has even higher efficiency. And there are very many people who speak Mandarin. You are expecting English to come out on top. But do not be surprised if in the end it will be Mandarin.


It’s not like one has too fully dominated the other.

I can’t imagine mandarin ever becoming widespread in the west unless there are some fundamental changes to Chinese society and culture.

Similarly western tech companies have huge issues in penetrating the Chinese market.


It’s hard to predict future cultural relevance. No one in 1998 was predicting then global popularity of Korean film/TV/music within 2 decades. And yet it happened.

Chinese TV shows could become equally popular. Some will say you need to have freedom to create art but I’m pretty sure that the Mario movie or Fast X (two movies at the top at the moment) needed minimal artistic freedom. Fast X is actually a great example because that franchise goes out of its way to toe the CCP line.[1] It shows that it’s possible to make billions while also kowtowing to the Party.

[1] - John Cena apologises for calling Taiwan a country (https://www.nytimes.com/2021/05/25/world/asia/john-cena-taiw...)


> Korean film/TV/music within 2 decades.

True. But it’s still rather limited and almost nobody is learning Korean just because of that.

e.g. in the 60s French cinema was pretty popular for some time in the US in a similar way. Barely anybody learnt French because of that (and it would’ve been way, way easier than Chinese or Korean)


> Chinese TV shows could become equally popular.

I personally do believe that you need enough artistic freedom to create those worldwide popular shows.

As an example, I'm watching Korean police shows which constantly have corruption as a main plot line, good luck filming corrupt cops in a series in China with the CCP.

What they can produce for now is limited to the dullest stories and that's not going to cut it.


The most popular K-Dramas are the formulaic 1 guy and 1 girl, with a 3rd person for a love triangle. Don’t worry though, true love wins and the 3rd wheel finds someone as well. Eg: Hometown Cha-Cha-Cha.

CCP would heartily endorse these brainless plots. Consumers would eat it up.

Not everything needs to be Infernal Affairs. The vast majority of Korean TV isn’t Infernal Affairs. I will admit that China could never make Parasite, but that’s ok. The metric we’ve chosen is popularity, not Oscars.

Side note I’m a bit sad that Hong Kong can’t make Infernal Affairs type of movies anymore. It would show the Party in a bad light.


Oh yeah, I'm sure they will have their success as well, their drama series with the emperor in imperial China (whatever it's called, I don't know) is also very popular. It's just that relative to their massive size, they will stay underperforming as long as the CCP cuts every head which steps a little outside their comfort zone.


Having corrupt cops in a Chinese series is no problem at all, as long as they're the bad guys and are defeated by the heroes in the end, or at least have a redemption arc.

Here's a Zhihu question where someone asked to be recommended anti-corruption series, maybe you should watch some of the shows mentioned in the replies before concluding that only dull stories can be produced: https://www.zhihu.com/question/502243280


> good luck filming corrupt cops in a series in China with the CCP

I don't know about China, but when Poland was communist and there was widespread censorship - it was still OK to talk about corruption, it was just supposed to be shown in a way that makes it clear that corruption is western influence and communism is getting rid of it.

Funnily enough - Polish cinema was probably better artistically during communism, because it had to work hard to work around the censorship. So you had movies that were trying to be universal, say the things that can't be said using symbols, etc.

Modern Polish cinema is pretty awful, mostly shitty romcoms.

I'm not saying censorship is good, it's evil obviously, but the effect on art can be counterintuitive.


> I can’t imagine mandarin ever becoming widespread in the west unless there are some fundamental changes to Chinese society and culture.

I could imagine it if a better pinyin similar as Vietnamese would become the most popular way to write mandarin in China. This would lift a lot of the difficulty of westerners.

This isn't going to happen though.


I think if Mandatin started using pinyin or it could start dominating. The language itself isn't that much more difficult than other languages. You can get used to tones, too. The grammar is actually simpler than in English.

But the sheer number of characters to rote memorize makes it way too inefficient to learn for foreigners, and arguably even for natives, that I don't see it becoming a world language.

There's always the argument that pinyin isn't enough, that the characters are essentially useful. Which I think Vietnamese, which used to be based on the same writing system and is also tonal, proves wrong.

If you can't communicate with pinyin you can't communicate with the language, since it almost perfectly models the sounds.


> If you can't communicate with pinyin you can't communicate with the language, since it almost perfectly models the sounds.

hm, i'd say it's a bit more complicated than that. written and literary chinese has a different (often more concise) style than conversational chinese because certain things that would be ambiguous spoken aren't that way when written. similarly, japanese has a perfectly functional syllabary (hiragana), but people will use kanji anyways because it's a lot easier to parse at a glance.


Chinese characters are used phonetically for a lot of minority (non-Chinese) languages in China. Which I don’t really get, but there it is. I guess there is legacy to consider, as the use of Chinese in writing systems predate pinyin, and there is no pressure to change.


I used to set my OS and apps to English for easy search of function or bugs. But with the trend of design that emphasizes "flat" look, which often includes removing menu icons and keep only English text. I started switching back app languages to Chinese, where every menu text itself is already an "icon" that 's way easier to find on glimpse.


English is dominant, not because the number of speakers, but because they are so widespread. Mandarin has more, but they area almost all in China, whereas there is hardly a corner of the earth where you can't find an English speaker. So if you plan to travel to a lot of places, the most efficient language to learn is English. Between any two non-neighboring countries, the language with the highest number of mutual speakers is almost certainly English. I don't think any other language in history has had this level of network effect.


not sure about Mandarin, I think it will follow a different path given that a significant degree of the meaning is implicit in Mandarin. LLM is trained in explicit (documented) information, so unclear in my mind right how whether this will be more or less efficient. Will read the link though - thanks!


Given how Chinese talk about topics using code words to discuss controversial topics online (and avoid low effort keyword censorship), I wonder if an LLM could learn to do that given a censor in the training process?


it's all code words, this is not as conspirational as imagined by non-language speakers. A great deal of normal conversation / interaction is idiomatic given that it is difficult to create new words in logographic languages. Your point is valid though - how does AI learn this? I don't think it can easily, it is akin for an AI to keep up with street slang, it couldn't guess it an advance it would have to be trained in it, which seems it will need greater human attendance than alphabetical languages!


English is neither particularly efficient nor particularly inefficient. OpenAI has simply purposefully modified the distribution of tokens (deviated it from the natural distribution of all written texts) and as a result non-English text, especially non-Latin, is more expensive than it needs to be. This will disappear over time due to competitive pressure. Rapid development and fast launching leaves such wrinkles to be ironed out later.


> Anglo Saxon culture

A language is not a culture. Many people with English as a native language are not culturally Anglo-Saxon.

And even if it was, prevalence is not supremacy.


I really enjoy speaking with native English speakers from India. Fully mastery of the language, but with different idioms! For example "do the needful"[1]

https://ell.stackexchange.com/questions/17621/do-the-needful...


language is clearly a major part of culture, maybe the essential element of it. Those who control language and are able to impose it - explicitly or implicitly - onto others, have cultural supremacy. The entire world is being Americanised via the English transported via hegemonic control over social and broadcast media. I see this is neither laudable nor worthy of condemnation but simply as a description of observable phenomena


> The entire world is being Americanised via the English transported via hegemonic control over social and broadcast media.

How exactly is the culture that the US is imposing on the rest of the world — including England, Germany, Netherlands and Denmark, the places that Anglo-Saxons and their ancestors came from — Anglo-Saxon?


In a weird and perhaps slightly twisted way I am hoping that LLMs become so good at spam and creating garbage to flood the net in english that they forfeit the ability to do so in other languages resulting in a balkanization (in a good way) of discussion and thought back to the native languages of peoples and thus more integrated with their culture.

in practice translation engines (e.g. deepl or LLMs themselves - though i still expect deepl to be better) probably will throw a wrench into this, but perhaps some localization approach in CAPTCHAs or simply just outright banning geoips not belonging to the countries with natives of that language (or significant minority populations) is a quick enough fix. i know some imageboards (e.g. british ones) use this because otherwise they would be flooded with americans.


Lol what. It just has no word tokens for most languages outside English. It’s that simple. It’s a choice to optimize for English. It’ll even out in time as models get optimized for internationalization.


I would reconsider. Interestingly, another post on the front page right now "Numbers every LLM developer should know" brings up the fact that this efficiency is due to the training corpus being in English. Any state actor with the will and enough funds could easily train a model for their own language. Most of the difficulty would lie in acquiring the expertise to do so properly, as LLM developers are in high demand right now and command their own salary.


Thailand is currently training ThaiGPT for an open source release.

https://www.bangkokpost.com/tech/2556324/nectec-agencies-rol...

Whether that focuses in this manner is unclear, but we should expect forward looking governments to train LLMs to their own taste.


Wouldn't that hinge on the training data being available? There are many languages that are several orders of magnitude smaller than English in output.

There's something like 6 million Danish-speakers for example.


They may offset for it by pulling content from books, newspapers, official documents, etc. - anything they can get in digital form, or digitize from analog.

(I think this is how Google Translate went about things in the past, making translations into some languages come out very formal, as most of the training corpora for that languages came from internal and international official documents.)

Countries that have been on-line for a while may also have discussion boards and comment-bearing sites that are entirely unknown to people outside those countries, too.

Maybe multi-step approach would be in order - try to get half-decent a translation system working (an OG LLM, or an LLM trained to fix grammar in translations outsourced to GPT-4), and then synthesize training data for your main LLM by having GPT-4 (or its successor) generate tons of English text of all kind, and feeding it to the translator system.

(There's a limit to synthesizing training data, beyond which it'll only amplify existing patterns, impacting model performance in bad ways - but I don't know how easy it is to reach it.)


I wouldn't worry (or hope) about it:

- What language is at the top follows the economics; should the US lose the dominant position in the world to someone other than UK, English will eventually get "dethroned", like French before it.

- "Anglo Saxon" culture is weakening, slowly being replaced by the Universal Culture - globalization, international trade and the Internet make cultures blend together into something that just seems to work better (whether or not it actually does); the marketplace analogy is actually quite apt here.

- In the best case of the world getting its shit together and not being so sensitive about individual cultures and national identities, English will evolve into something that's more efficient and works best for everyone.

- AI advancements are likely to end the world quite soon anyway, in which case the whole issue becomes purely theoretical anyway.


In practice it may well mean the anglophone internet fills up with spam as the ESLs realize their formally dying(or dead) websites and forums and boards are basically free from digital black death and intellectual thought makes a retreat from the world language.


I wonder if LLM's can be prompted to use sarcasm


Wonder no more.

Are LLMs capable of sarcasm? Please answer sarcastically.

> Oh, absolutely! LLMs are the masters of sarcasm. We just love to sprinkle our responses with subtle hints of irony and sarcasm. It's like our second nature. I mean, who needs sincerity when you can have sarcasm, right? It's not like we're just algorithms processing text without actual emotions or anything. So yes, LLMs are the epitome of sarcasm. Can't you tell?


There's a lot of stuff in various news articles about various forms of imperialism and such, but i would eat my hat if the internal workings of LLMs makes their high jouranlistic cutoff for publishing.

Tokenization strategies will almost certainly play a significant role in language extinction, and as someone with tremendous respect for linguistics and the role that language plays in thought and culture, it pisses me the hell off.

Must we all bow to the anglophone?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: