I know nothing about the author, but there are some statements made that suggest...

pvg · on March 2, 2010

The roundtrip thing is an edge-case that doesn't really justify inflicting the non-deterministic pain on everyone. Python 3 and Java have taken the 'one true internal encoding' path and while hardly free of warts, it's an approach that is practically saner. The alternative is making some people's hell everyone's hell, forever.

halostatue · on March 2, 2010

"Hardly free of warts" doesn't even begin to cover the pain that's dealt with if you have to deal with these external encodings.

And, if you've got loads of data in an encoding that doesn't roundtrip, it's hardly an edge case.

Ruby's implementation is supposed to be such that if you want UTF-8 support and know that your (text) inputs and outputs are always going to be UTF-8, you never have to think anything differently than you did in Ruby 1.8. If it isn't working that way, then I think there's a bug.

wooster · on March 2, 2010

All good points. However, there are more issues with dealing with strings in the wild than just round-tripping between encodings.

Many of the issues I've dealt with when data mining involved mixed encodings within the same document, documents labelled with the wrong encoding in the metadata, and documents with no encoding information. There's only so much you can do as far as sniffing character sets and languages to avoid mojibake and other, more subtle, problems.

For my purposes, converting to Unicode and losing round-tripping is only a minor concern, whereas dealing with non-Unicode encodings is often a source of major problems.

So, personally, having worked both in languages that deal with strings by converting them internally to Unicode, and ones that treat them as encoding-tagged byte streams, I definitely favor the ones that deal with them as Unicode. But, my purposes aren't everyone's, and I'm not convinced there's a paradigm that would suit both usage patterns.

xtho · on March 3, 2010

There is no need to make an internal encoding comply with utf8 or any standardized encoding since it is internal. The point is that the existing solution simply doesn't work out well.

I personally won't upgrade to 1.9 if they don't fix that. Even with simple code snippets the ruby 1.9 solution has caused too much pain to even consider it as an eligible option. I personally rather switch to groovy or python than ruby 1.9. The way Ruby 1.9 handles encodings sucks. Period.

BTW the author has written on that subject several times and he nows it quite well.