Curious how Twitter detects when it's Chinese or Japanese to restrict the length...

darkengine · on Nov 8, 2017

They've made CJK characters (including fullwidth characters of any kind, without exception for Latin letters or Arabic numerals) count as 2 characters. In a mixed alphanumeric/CJK tweet, the CJK part will simply count doubly towards your character limit.

kurthr · on Nov 8, 2017

Interesting... of course 70 characters of Mandarin would be the equivalent of over 400 characters of English. I suppose that plus goog-Translate would be another even less readable way to avoid the limit... since Chinese language speakers by and large just use WeChat.

lifthrasiir · on Nov 8, 2017

It is a brain-dead solution: any Unicode scalar value not matching /[\u0000-\u10ff\u2000-\u200d\u2010-\u201f\u2032-\u2037]/ doubles the cost. [1] The primary range ends at U+10FF because it conveniently excludes virtually all CJK characters (Hangul starts at U+1100) with relatively low error rates. Yet, it's still brain-dead.

[1] https://twitter.com/FakeUnicode/status/928030981805588480 (used to be /[\u0000-\u10ff]/ during the test period)

akanet · on Nov 8, 2017

Try it! I wonder, too.