Hacker News new | past | comments | ask | show | jobs | submit login
A rant about Ruby 1.9 String encoding (github.com/candlerb)
86 points by dboyd on March 2, 2010 | hide | past | favorite | 52 comments



I know nothing about the author, but there are some statements made that suggest that the author hasn't had to deal with the wild-and-woolly reality of encodings out there in a lot of extant data. One only wishes that all data were UTF-8.

What Ruby 1.9 gets absolutely right is that its String implementation is completely encoding agnostic (by which I specifically mean that it doesn't force your data to be encoded in a particular way). There are encodings for which there is no safe UTF-8 roundtrip (you can successfully convert the data to UTF-8 nicely, but when you convert back to UTF-8 to that encoding, you won't get the original input back; you'll get a slightly different output).

Rubyists in Japan don't have the luxury of dealing with Unicode all the time; they still get lots of data in ShiftJIS and other encodings. (The same is true of Rubyists elsewhere, but since US-ASCII is a proper subset of UTF-8, most folks don't know the difference; Win1252 is a pain in the ass, though.) If you have to do ANY work with older data formats, you curse languages that force you to use UTF-8 all the time instead of letting you work with the native data.

Most developers don't think about i18n nearly enough in any case; there's a lot more to worry about that simply using Unicode doesn't solve for you. Even the developers of Ruby have to worry about the fact that LATIN SMALL LETTER E WITH ACUTE (U+00E9) is the same as LATIN SMALL LETTER E (U+0065) COMBINING ACUTE ACCENT (U+0301); it doesn't begin to address the capitalization of 'ß' ('SS', which isn't necessarily reversible) or that in Turkish 'ı' capitalizes to 'I', but 'i' capitalizes to 'İ'. Don't EVEN get me started on number formatting...

EDIT: Added the last paragraph.


The roundtrip thing is an edge-case that doesn't really justify inflicting the non-deterministic pain on everyone. Python 3 and Java have taken the 'one true internal encoding' path and while hardly free of warts, it's an approach that is practically saner. The alternative is making some people's hell everyone's hell, forever.


"Hardly free of warts" doesn't even begin to cover the pain that's dealt with if you have to deal with these external encodings.

And, if you've got loads of data in an encoding that doesn't roundtrip, it's hardly an edge case.

Ruby's implementation is supposed to be such that if you want UTF-8 support and know that your (text) inputs and outputs are always going to be UTF-8, you never have to think anything differently than you did in Ruby 1.8. If it isn't working that way, then I think there's a bug.


All good points. However, there are more issues with dealing with strings in the wild than just round-tripping between encodings.

Many of the issues I've dealt with when data mining involved mixed encodings within the same document, documents labelled with the wrong encoding in the metadata, and documents with no encoding information. There's only so much you can do as far as sniffing character sets and languages to avoid mojibake and other, more subtle, problems.

For my purposes, converting to Unicode and losing round-tripping is only a minor concern, whereas dealing with non-Unicode encodings is often a source of major problems.

So, personally, having worked both in languages that deal with strings by converting them internally to Unicode, and ones that treat them as encoding-tagged byte streams, I definitely favor the ones that deal with them as Unicode. But, my purposes aren't everyone's, and I'm not convinced there's a paradigm that would suit both usage patterns.


There is no need to make an internal encoding comply with utf8 or any standardized encoding since it is internal. The point is that the existing solution simply doesn't work out well.

I personally won't upgrade to 1.9 if they don't fix that. Even with simple code snippets the ruby 1.9 solution has caused too much pain to even consider it as an eligible option. I personally rather switch to groovy or python than ruby 1.9. The way Ruby 1.9 handles encodings sucks. Period.

BTW the author has written on that subject several times and he nows it quite well.


I agree. I'd say a good 5% of development time on a new Ruby 1.9 project of mine has been spent dealing with strings. I've taken to the idea that as long as _everything_ is UTF-8, then I'll be okay, but good luck enforcing that! Especially since the whole word seems to default to ASCII, while actually _using_ multi-byte chars anyway. I think I've got it now, but no, not really. It just works on my development machine. On my server, where it actually counts, I'm getting garbage where I should be getting an accent. I've spend two days now just trying to figure out how to debug something like that!


Dealing with string encoding is sometimes a PITA. But I think as English speakers, we are usually sheltered from the problem because most programming languages are English centric. I'd like to hear opinions from people who don't speak English.

If you think the way 1.8 handles (or doesn't handle) encoding is just fine, try things you typically do, but with a different language.

For example:

  ruby -ryaml -e'p YAML.dump("こんにちは!")'

You might also try things like inserting in to a database, parsing documents, etc.

IMO, not having an encoding associated with some text sucks if you're a non-english speaker.


> I'd like to hear opinions from people who don't speak English.

I'm sure people who don't speak English will read that and answer you straight away ;)


This "rant" is just a part of an effort by the author to catalog the behavior of strings in ruby 1.9. You can see the whole project here: http://github.com/candlerb/string19

The main (runnable) documentation file is here: http://github.com/candlerb/string19/blob/master/string19.rb


It seems the most obvious solution is to store strings in a standard encoding (say UTF-8) and to always convert strings to it at the time of their creation. Is there a technical reason why Ruby doesn't do this?


The technical reason is that it's a stupid idea. Not all encodings can be safely round-tripped through UTF-8, which means you can end up losing some data. (Consider http://homepage1.nifty.com/nomenclator/perl/ShiftJIS-CP932-M... as a quick example: "Actually, 7915 characters in CP-932 must be mapped to 7517 characters in Unicode. There are 398 non-round-trip mappings.")

Loss of text data is bad.


General question: Why didn't the UTF-8 boys and girls make it safe in the first place? This doesn't sound like rocket science. "This character maps to that character, this character to that one." I don't understand how we have unicode snowmen, but we can't safely round trip characters.


Some of it has to do with Han Unification (http://en.wikipedia.org/wiki/Han_unification).

Mostly, though, it's because some of these characters are overloaded. If you've got a Windows system, go into the DOS window and type "chcp 932" (you may need the Japanese language files installed). When you type '\', you'll get '¥' (making "C:\Program Files\" look like "C:¥Program Files¥").

In the systems where what become CP932 were first used, the backslash wasn't necessary in Japanese, so that character point was used to encode the yen symbol. Other systems used the backslash, so it was encoded as a different point. When JIS unified the existing Japanese code pages, it couldn't very well go back in time to change all that old data, so it merged the two encodings on many things. So, there's only one Unicode codepoint for the yen glyph ¥, but in this one encoding there's two different characters for it.

This is the most blatant example of a problem with Unicode transcoding, but as far as I know, it's not the only one.

See http://www.mail-archive.com/linux-utf8@nl.linux.org/msg02337... for what could be done, but probably won't.


There isn't a good idea that a standards body hasn't fucked up. Maybe it's because these problems are harder than they seem. Or maybe they're just plain screwed up. Or maybe it's Adobe.


That page says there are duplicates in CP-932 because of different vendor variants. If those characters are otherwise entirely identical, calling it a single encoding seems wrong. Wouldn't you just have a CP-932-IBM and a CP-932-NEC encoding?


Because someone unified two similar encodings into CP-932/ShiftJIS. That someone who unified them wasn't IBM or NEC (both of whom made competing systems and had made different encoding choices, but whose choices were mostly compatible).

Rules for dealing with legacy encodings: 1. They make no sense. 2. If you think they make sense, remember that you weren't there so refer to rule 1.


Are these encodings common enough that they're worth supporting in the basic String class? Surely another class could be provided to handle these edge cases, they don't have to be natively handled.

There's no need for any data loss to occur - the String class would merely not support converting from non round-trippable encodings.


If they're not natively handled, you can't regex them. If they're not natively handled, you can't convert any numbers that might be in them to numeric values.

Yes, they're common enough (especially in Japan) and encodings have to be baked deeply in if you really want to use everything that a Rubyist expects to be able to use.


Practically speaking, what data are you talking about? Which characters don't map? The mapping here looks pretty complete, except for "DBCS LEAD BYTE". Is that a problem? Other non-round trip mappings look like it's because different vendors have different meanings for that character. That's not Unicode's fault.

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOW...


If there's ~398 byte combinations that can be translated only one way, but not the other, you're still losing data. (The big one that always gets pointed out? '\' and '¥'.)


"(The big one that always gets pointed out? '\' and '¥’.)”

The point there though is that Japanese DON’T want \ and ¥ to map properly. They want \ and ¥ to be considered the same. So Unicode isn’t losing information, it’s forcing a distinction the Japanese don’t want to be able to make.


Java stores all strings in UTF-16. It probably seemed like a good idea at the time but has led to problems later. There is now a huge body of broken code out there which assumes that each 16-bit char value represents a complete and separate Unicode character.


Um. Wasn't Java's original choice UCS-2 (same as NTFS originally)? UTF-16 has surrogate characters and the length will be one or two 16-bit values depending on whether it's a surrogate. UCS-2 was a fixed 16-bit character size, long since deemed a mistake.


You're right. Java originally used UCS-2 and later switched to UTF-16. http://java.sun.com/docs/books/jls/first_edition/html/3.doc.... The problem is that dealing with surrogate pairs is so awkward that most developers don't even bother.


> Is there a technical reason why Ruby doesn't do this?

Han Unification. Gory details here: http://en.wikipedia.org/wiki/Han_unification


I'm not a Ruby-head, so maybe I don't have the necessary context to understand what's going on, but: As far as I'm concerned, the only time I care about encodings is when I'm trying to do I/O. I'll always have to choose an encoding when engaging in I/O, explicitly or implicitly. While I'm shuffling strings around in memory, I expect it to behave like a sequence of unicode code points. If the implementation wants to allow multiple implementations for performance reasons (e.g. because reading UTF-8 from file, converting it to UTF-16, processing it, converting it back to UTF-8 to send it over the network is inefficient) then that's fine by me. I'll still want to tell it what encoding that byte stream from which I'm writing is, and what that stream I'm writing to expects. If I've understood the rant correctly, Ruby (1.9?) confuses byte streams with strings. That was a bad idea in C and with the complexity that unicode adds, this is an even worse idea nowadays.


I also thought the exact same thing, but I know that encoding stuff is so complex that I may be missing something. For what its worth, I believe Cocoa does things this way: a string is an abstract object of unicode code points, when its time to read or write you must choose an encoding.


The Java and .NET infrastructures make that choice for you too. I guess I wasn't there when Java first got started, it's possible that there was a backlash[1] against this decision by the people who are using encodings that aren't subsets of unicode. I don't get to see many "enterprise" systems, but I get the impression those applications where it's important to use some kind of not-unicode are sufficiently rare that they're best served by specialty libraries. You can't make everyone happy with one solution, but if a simple solution works for 99%, then I'd call that a roaring success.

[1] a backlash other than the OMG-2-bytes-per-ASCII-character hysteria, which is irrelevant for these purposes, they could just as well have chosen UTF-8


If I've understood the rant correctly, Ruby (1.9?) confuses byte streams with strings.

My understanding was just the opposite: now that strings are associated with encodings, he can no longer assume that a1 + a2 results in a string with the same encoding as a1 and a2, since a1 and a2 can have different encodings.


The implication is that the programmer must know what is used internally, though, because things will go horribly wrong otherwise. That's a sign of a very leaky abstraction - if you need to know the format and behaviour of the underlying byte stream, you'd probably be better off using the byte stream directly, the abstraction is just introducing uncertainty where it should be isolating from such concerns.


It's also supposed to be an incorrect implication for Ruby most of the time. It should be possible to do a single encoding trivially (e.g., if all of your data is shift_jis or UTF-8), but rather than trying to (badly, and usually unsuccessfully) hide the encoding difficulties from the programmer if you've got mixed encodings, Ruby has chosen to be a little more up front about encodings.

The reality may be a bit differently, but I recall seeing an email message from Matz on ruby-core last year suggesting that it was supposed to be trivially easy to work with one encoding (specifically mentioning UTF-8, but implying others).


Is there a technical reason why Ruby doesn't do this?

More likely it's political. Matz (the creator of Ruby), and many of its early contributors, are said not to like Unicode.


That doesn't jive with what I know of Matz. Japan (and much of Asia) adopted Unicode much slower because both UTF-8 and UTF-16 are less efficient than many of the native encodings for representing much of what they need; there was definitely some cultural tone-deafness in the Unicode community when it did Han unification (which may be a good idea, but could have been handled better).

Matz (and the people he works with who use Ruby to get their jobs done) needs access to data that's Not Unicode. Painfully Not Unicode as in it doesn't necessarily round-trip.


The performance is better if you use an encoding with a fixed byte-length character-set.


No it isn't.


And performance... My non-utf8 1 GB file whose number of lines I need to count thanks Ruby 1.9 for not uselessly and forcefully making an in memory utf8 copy of it.


Having given up porting a medium-sized Rails application to Ruby 1.9 just last week I though I'd share some of my experience. Ruby 1.9's String implementation is fine, but will require lots of changing to existing code. Rails 2.3.5 is definitely _not_ ready for Ruby 1.9 and UTF-8. For example some regexps in helpers are not encoding aware, rack 1.0.1 which is required by 2.3.5 doesn't play nicely with encodings and many more small and annoying problems (template encoding for example; the app I was porting mostly uses HAML which supports Ruby 1.9's string encodings in the latest versions; but no luck for the few erb templates)

All in all, this is a huge transition which will take a while to propagate through the hole Rails stack.


My sup (on ruby 1.9) just crashed on me because I dared to try to write a mail with UTF8 in it.

This is not a trivial screwup! This is the sort of screwup that should make everybody who's using that wretched platform think thrice before continuing to use it.


I believe that Python, Java, C#, Objective-C and Javascript all have the same basic approach to this problem. The Ruby way is better for handling some Japan-specific problems. But that's at the cost of making life harder and less predictable for everyone else.

It's a pretty straightforward tradeoff. Of course people who are not Japanese will naturally be upset to pay a cost in complexity for a feature of benefit primarily to a programmers from a single country. Non-Japanese Ruby programmers will just have to decide whether their solidarity with Japanese programmers outweighs their personal and collective inconvenience.


Rants about strings and character sets that contain words of the following spirit are usually neither correct nor worth of any further thought:

  > It's a +String+ for crying out loud!  What other
  > language requires you to understand this
  > level of complexity just to work with strings?!
clearly the author lives in his ivory tower of english language environments where he is able to say that he "switched to UTF-8" without actually really have done so because the parts of UTF-8 he uses work exactly the same as the ASCII he used before.

But the rest of the world works differently.

Data can appear in all kinds of encodings and can be required to be in different other kinds of encodings. Some of those can be converted into each other; some Japanese encodings (Ruby's creator is Japanese) can't be converted to a unicode representation for example.

Also, I'm often seing the misunderstanding that "Unicode" is a string encoding. It's not. UTF-(8|16) is. Or UCS2 (though that one is basically broken because it can't represent all of Unicode).

Nowadays, as a programming language, you have three options of handling strings:

1) pretend they are bytes.

This is what older languages have done and what ruby 1.8 does. This of curse means that your application has to keep track of encodings. Basically for every string you keep in your application, you need to also keep track what it is encoded in. When concatenating a string of encoding a to another string you already have that is in encoding b, you must do the conversion manually.

Additionally, because strings are bytes and the programming language doesn't care about encoding, you basically can't use any of the built-in string handling routines because they assume each byte representing one character.

Of course, if you are one of these lucky english UTF-8 users, getting data in ASCII and english text in UTF-8, you can easily "switch" your application to UTF-8 by still pretending strings to be bytes because, well, they are. For all intents and purposes, your UTF-8 is just ASCII called UTF-8.

This is what the author of the linked post wanted.

2) use an internal unicode representation

This is what Python 3 does and what I feel to be a very elegant solution if it works for you: A String is just a collection of Unicode code points. Strings don't worry about encoding. String operations don't worry about it. Only I/O worries about encoding. So whenever you get data from the outside, you need to know what encoding it is in and then you decode it to convert it to a string. Conversely, whenever you want to actually output one of these strings, you need to know in what encoding you need the data and then encode that sequence of Unicode code points to any of these encodings.

You will never be able to convert a bunch of bytes into a string or vice versa without going through some explicit encoding/decoding.

This of course has some overhead associated with it, as you always have to do the encoding and because operations on that internal collection of unicode code points might be slower than the simple array-of-byte-based approach.

And whenever you receive data in an encoding that cannot be represented with Unicode code points and whenever you need to send out data in that encoding, then, you are screwed.

This is a defficiency in the Unicode standard. Unicode was specifically made so that it can be used to represent every encoding, but it turns out that it can't correctly represent some Japanese encodings.

3) Store an encoding with each string and expose the strings contents and the encoding

This is what ruby 1.9 does. It combined methods 1 and 2: It allows you to chose whatever internal encoding you need, it allows you to convert from one encoding to the other and it removes the need to externally keep book of every strings encoding.

You can still use the languages string library functions because they are aware of the encoding and usually do the right thing (minus, of course, bugs)

As this method is independent of the (broken?) Unicode standard, you would never get into the situation where just reading data in some encoding makes you unable to write the same data back in the same encoding as in this case, you would just create a string using this problematic encoding and do your stuff on that.

Nothing prevents the author of the linked post to use ruby 1.9's facility to do exactly what python 3 does (of course, again, ignoring the Unicode issue) by internally keeping all strings in, say, UTF-16. You would transcode all incoming and outgoing data to and from that encoding. You would do all string operations on that application-internal representation.

A language throwing an exception when you concatenate a Latin 1-String to a UTF-8 string is a good thing! You see: Once that concatenation happened by accident, it's really hard to detect and fix.

At least it's fixable though because not every Latin1-String is also a UTF-8 string. But if it so happens that you concatenate, say Latin1 and Latin8 by accident, then you are really screwed and there's no way to find out where Latin1 ends and Latin8 begins.

In todays small world, you want that exception to be thrown.

Conclusion

What I find really amazing about this complicated problem of character encoding is the fact that nobody feels it's complicated because it usually just works - especially method 1 described above that has constantly being used in years past and also is very convenient to work with.

Also, it still works.

Until your application leaves your country and gets used in countries where people don't speak ASCII (or Latin1). Then all these interesting problems arise.

Until then, you are annoyed by every of the methods I described but method 1.

Then, you will understand what great service Python 3 has done for you and you'll switch to Python 3 which has very clear rules and seems to work for you.

And then you'll have to deal with the japanese encoding problem and you'll have to use binary bytes all over the place and have to stop using strings altogether because just reading input data destroys it.

And then you might finally see the light and begin to care for the seemingly complicated method 3.

Sorry for the novel, but character encodings are a pet-peeve of mine.


I liked your response a lot. Thank you. And I was almost persuaded by it. In fact I was persuaded for about 5 minutes after reading it. But at the last second, a thought occurred to me: if there's a deficiency in Unicode that prevents its use for Japanese, isn't the right solution to just fix that deficiency?

I mean, Unicode is meant to be an abstract representation of glyphs, separate from any encoding, that works for all of Earth's languages. It's tailor made to be a programming language's internal representation of a string. This is its raison d'etre.

So it seems to me that #2 is definitely The Right Way™ and that if there's some problem with Unicode that has kept Ruby from adopting it, they should have worked on fixing it, rather than breaking Ruby. OK, "break" is probably too strong a word for the state of Ruby 1.9. And in the real world, fixing an international politicized standard like Unicode is probably impossible. So I can see that this pragmatic solution might have been the only one available. But still, it seems wrong to me.

Out of curiosity, what exactly is the deficiency in Unicode that caused Matz to go with option 3? I presume there are epic flamewars all over the internet about this issue, but I just haven't been paying close enough attention.


It seems like this would be a problem on Python as well, no?


Python "cheats" by converting everything to Unicode internally. It seems like a simple solution, but it's not a solution since not everything can be converted safely back to the original encoding.


Python 3 doesn't use an internal encoding, it uses unicode. Python 2 can use unicode strings if you specify them. I'm not familiar enough with ruby to comapare the two, but I don't see a problem with how python handles it. You sometimes need to know about your string encoding, which is just a fact of modern programming.

[EDIT] I think I missed my point slightly. Python 3 doesn't change the encoding of strings, it decodes them to unicode. You can encode the string back to the original encoding without loss.


If a CP932 '\' is interpreted as '¥', but is exported as CP932 '¥', there's data loss. Unless Python 3 keeps the original data around when it converts to Unicode (probably UTF-8 or UTF-16), there will be data loss in those cases. It's unavoidable.


UTF-\d is not unicode. Unicode isn't an encoding, it's the decoded representation of a character encoding.

    [edit] clarified example
    >>> s = u'\xa5' # shiftjis decoding of \
    >>> print s.encode('shiftjis')
    \
    >>> print s.encode('utf-8')
    ¥


He has posted some alternatives too

http://github.com/candlerb/string19/blob/master/alternatives...

The first suggestion seems like the logical solution to me however I don't need to deal with this stuff on a day-to-day basis...


Extrapolating, the same rant applies to all duck-typed languages: the number of possible outcomes for a=b+c explodes depending on types and contents of a,b and c, therefore his first assumption about one dimensional space is incorrect.

This is why most C++ teams prohibit their members to overload operators.


Only if your static types for strings include a seperate type per encoding. Most languages including the ducktyped just use a single encoding internally "utf-8" for example. So the String type is always compatible. For Ruby though it sounds like the string type could be any encoding. That to me sounds dangerous. It's not caused by the language being ducktyped. It's caused by the language having the same type with multiple behaviours. Ruby made strings into a mine field for 99% of programmers and safer for 10%

Not sure that's a good ratio.


Precisely. This sounds like an argument for the superiority of static typing more than anything.


AOL. 1.9's encoding support feels thoroughly wrong to me.


"even Ruby's creators, who are extremely bright people"

There's a hidden "it turns out that" in this sentence.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: