Hacker News new | past | comments | ask | show | jobs | submit login
Chinese Twitter users live in a density 2x to 8x their English counterparts. (pugs.blogs.com)
35 points by audreyt on Oct 11, 2009 | hide | past | favorite | 12 comments



Michael Mitzenmacher, at Harvard, had a paper in the 2003 IEEE Data Compression Conference that gave empirical evidence that translations compressed to roughly similar sizes (using the Bible and the EU texts), but had wildly varying sizes uncompressed; this correlates well with linguistic theories.

ftp://ftp.deas.harvard.edu/techreports/tr-12-02.ps.gz



This is one of the reasons that texting is so darn popular in China, and email is not. Texting is fast, and you can pack so much info into a single SMS that you almost never need to send anything longer.

The same goes for books and essays. Chinese books and magazines are often shorter, just because the information density is so high. It's a neat feature of the language.

However, sometimes it can be demoralizing when you spend all evening writing something and realize that you've only produced a single page of text.


> This is one of the reasons that texting is so darn popular in China, and email is not. Texting is fast

I don't see why texting in Chinese should be faster than in English.

Don't you need to push a similar number of bits through the numeric keypad? I'd imagine that to be the limiting factor.

i.e. English texters need to send more chars, but they're choosing from fewer chars so need fewer keypresses to select each one. Chinese texters need to send fewer chars, but I imagine they need to make more keypresses to select each char.


The better Chinese input methods achieve ~3 keystrokes per character on handheld devices. [1] IIRC average word length in English is around 5 or 6, vs. around 2 for Chinese. So the input time should be about the same.

The real bottleneck isn't the actual typing, but thinking of something to type --- and it's a lot easier to think of how to say something verbosely rather than concisely!

[1] http://www.pascal-man.com/navigation/faq-java-browser/2009_S...


You're right that Chinese and English can be comparable in input time, provided that you're using something like t9 for the English.

However, you lose a ton of speed the moment that you hit a word that isn't in your t9 dictionary. Chinese input methods don't have that problem. So, the best case for English usually equals the average case for Chinese.

Of course, since texting is the primary communication mode for many young Chinese, they can usually text blinding fast in any language.


Yes, you need 1-4 keypresses per char, but the language itself is significantly more compressed than English. The equivalent of an English word is 1 to 2 chars (with more frequent words usually being one character - just as frequent English words tend to be shorter), but the number of "words" you use to express the same idea is generally fewer, primarily because the language is a lot more ambiguous than English (e.g., there is no 'the', 'a', or 'an', and subjects are often dropped - it is left to the speakers to fill in the blanks from context).


OK, you need 1-3 keypresses for English.

Ah, of course, it's tree navigation.

The number of keypresses required to enter the character is O(log(number-of-chars-in-alphabet)).

So having lots of characters is a win for input speed. Thanks.


Few reasons why texting is popular in China.

1. SMS is cheaper than calling.

2. China is LOUD, there are times I don't hear a phone call because there are so many people and their cellphones around.

3. And yes, you can fit as much info in an text as an email.


If you are willing to do a bit of encoding/decoding, then you can map 2 or 3 latin-1 characters onto one unicode codepoint, and then tweet with that instead.

My understanding of UTF-8 indicates that you can actually represent any number as one character, but somewhere in the xterm / firefox / twitter pipeline, that gets fucked up. I think I have some code on github for this, actually:

http://gist.github.com/191446

The idea is to pack any utf-8 string into one character. It works for about 3 or 4 ASCII characters, but I think this is a perl bug rather than some fundamental limitation. Patches welcome.

(As an aside, I am always pleased when I get to use the (>>=) operator in Perl. And yes, I do pronounce it "bind" and not "right-shift-equals" ;)


UTF-8 encodes any Unicode character using between 1 and 4 bytes. Also, some byte sequences aren't valid UTF-8. I don't think it's a Perl bug.


Kaifu Lee mentioned the same observation in a talk. Chinese news titles usually bear sufficient information that Google news in Chinese doesn't need excerpt as counterpart in English. A Tweet in Chinese could be an essay.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: