Hacker News new | past | comments | ask | show | jobs | submit login

I know nothing about the author, but there are some statements made that suggest that the author hasn't had to deal with the wild-and-woolly reality of encodings out there in a lot of extant data. One only wishes that all data were UTF-8.

What Ruby 1.9 gets absolutely right is that its String implementation is completely encoding agnostic (by which I specifically mean that it doesn't force your data to be encoded in a particular way). There are encodings for which there is no safe UTF-8 roundtrip (you can successfully convert the data to UTF-8 nicely, but when you convert back to UTF-8 to that encoding, you won't get the original input back; you'll get a slightly different output).

Rubyists in Japan don't have the luxury of dealing with Unicode all the time; they still get lots of data in ShiftJIS and other encodings. (The same is true of Rubyists elsewhere, but since US-ASCII is a proper subset of UTF-8, most folks don't know the difference; Win1252 is a pain in the ass, though.) If you have to do ANY work with older data formats, you curse languages that force you to use UTF-8 all the time instead of letting you work with the native data.

Most developers don't think about i18n nearly enough in any case; there's a lot more to worry about that simply using Unicode doesn't solve for you. Even the developers of Ruby have to worry about the fact that LATIN SMALL LETTER E WITH ACUTE (U+00E9) is the same as LATIN SMALL LETTER E (U+0065) COMBINING ACUTE ACCENT (U+0301); it doesn't begin to address the capitalization of 'ß' ('SS', which isn't necessarily reversible) or that in Turkish 'ı' capitalizes to 'I', but 'i' capitalizes to 'İ'. Don't EVEN get me started on number formatting...

EDIT: Added the last paragraph.




The roundtrip thing is an edge-case that doesn't really justify inflicting the non-deterministic pain on everyone. Python 3 and Java have taken the 'one true internal encoding' path and while hardly free of warts, it's an approach that is practically saner. The alternative is making some people's hell everyone's hell, forever.


"Hardly free of warts" doesn't even begin to cover the pain that's dealt with if you have to deal with these external encodings.

And, if you've got loads of data in an encoding that doesn't roundtrip, it's hardly an edge case.

Ruby's implementation is supposed to be such that if you want UTF-8 support and know that your (text) inputs and outputs are always going to be UTF-8, you never have to think anything differently than you did in Ruby 1.8. If it isn't working that way, then I think there's a bug.


All good points. However, there are more issues with dealing with strings in the wild than just round-tripping between encodings.

Many of the issues I've dealt with when data mining involved mixed encodings within the same document, documents labelled with the wrong encoding in the metadata, and documents with no encoding information. There's only so much you can do as far as sniffing character sets and languages to avoid mojibake and other, more subtle, problems.

For my purposes, converting to Unicode and losing round-tripping is only a minor concern, whereas dealing with non-Unicode encodings is often a source of major problems.

So, personally, having worked both in languages that deal with strings by converting them internally to Unicode, and ones that treat them as encoding-tagged byte streams, I definitely favor the ones that deal with them as Unicode. But, my purposes aren't everyone's, and I'm not convinced there's a paradigm that would suit both usage patterns.


There is no need to make an internal encoding comply with utf8 or any standardized encoding since it is internal. The point is that the existing solution simply doesn't work out well.

I personally won't upgrade to 1.9 if they don't fix that. Even with simple code snippets the ruby 1.9 solution has caused too much pain to even consider it as an eligible option. I personally rather switch to groovy or python than ruby 1.9. The way Ruby 1.9 handles encodings sucks. Period.

BTW the author has written on that subject several times and he nows it quite well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: