The technical reason is that it's a stupid idea. Not all encodings can be safely...

3pt14159 · on March 2, 2010

General question: Why didn't the UTF-8 boys and girls make it safe in the first place? This doesn't sound like rocket science. "This character maps to that character, this character to that one." I don't understand how we have unicode snowmen, but we can't safely round trip characters.

halostatue · on March 2, 2010

Some of it has to do with Han Unification (http://en.wikipedia.org/wiki/Han_unification).

Mostly, though, it's because some of these characters are overloaded. If you've got a Windows system, go into the DOS window and type "chcp 932" (you may need the Japanese language files installed). When you type '\', you'll get '¥' (making "C:\Program Files\" look like "C:¥Program Files¥").

In the systems where what become CP932 were first used, the backslash wasn't necessary in Japanese, so that character point was used to encode the yen symbol. Other systems used the backslash, so it was encoded as a different point. When JIS unified the existing Japanese code pages, it couldn't very well go back in time to change all that old data, so it merged the two encodings on many things. So, there's only one Unicode codepoint for the yen glyph ¥, but in this one encoding there's two different characters for it.

This is the most blatant example of a problem with Unicode transcoding, but as far as I know, it's not the only one.

See http://www.mail-archive.com/linux-utf8@nl.linux.org/msg02337... for what could be done, but probably won't.

flogic · on March 2, 2010

There isn't a good idea that a standards body hasn't fucked up. Maybe it's because these problems are harder than they seem. Or maybe they're just plain screwed up. Or maybe it's Adobe.

pmjordan · on March 2, 2010

That page says there are duplicates in CP-932 because of different vendor variants. If those characters are otherwise entirely identical, calling it a single encoding seems wrong. Wouldn't you just have a CP-932-IBM and a CP-932-NEC encoding?

halostatue · on March 2, 2010

Because someone unified two similar encodings into CP-932/ShiftJIS. That someone who unified them wasn't IBM or NEC (both of whom made competing systems and had made different encoding choices, but whose choices were mostly compatible).

Rules for dealing with legacy encodings: 1. They make no sense. 2. If you think they make sense, remember that you weren't there so refer to rule 1.

alextgordon · on March 2, 2010

Are these encodings common enough that they're worth supporting in the basic String class? Surely another class could be provided to handle these edge cases, they don't have to be natively handled.

There's no need for any data loss to occur - the String class would merely not support converting from non round-trippable encodings.

halostatue · on March 2, 2010

If they're not natively handled, you can't regex them. If they're not natively handled, you can't convert any numbers that might be in them to numeric values.

Yes, they're common enough (especially in Japan) and encodings have to be baked deeply in if you really want to use everything that a Rubyist expects to be able to use.

bhiggins · on March 2, 2010

Practically speaking, what data are you talking about? Which characters don't map? The mapping here looks pretty complete, except for "DBCS LEAD BYTE". Is that a problem? Other non-round trip mappings look like it's because different vendors have different meanings for that character. That's not Unicode's fault.

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOW...

halostatue · on March 2, 2010

If there's ~398 byte combinations that can be translated only one way, but not the other, you're still losing data. (The big one that always gets pointed out? '\' and '¥'.)

earthboundkid · on March 14, 2010

"(The big one that always gets pointed out? '\' and '¥’.)”

The point there though is that Japanese DON’T want \ and ¥ to map properly. They want \ and ¥ to be considered the same. So Unicode isn’t losing information, it’s forcing a distinction the Japanese don’t want to be able to make.