I know it's a trade-off, but IMHO it's a very poor one.
That's premature optimisation. API is forever. This decision sacrificed easy internationalisation and correctness of data for minor performance benefit in current implementation.
It's a big deal, because node.js isn't merely encoding-ignorant (like PHP), it actually removes higher bits. If you forget to specify encoding somewhere, your text will be malformed.
ASCII is 7-bit (encoded in 8 bits - the high bit is ignored) and UTF-8 takes 8 bits for most characters, but can take 16+ bits for some characters.
Node is built for massive scalability on applications that (mostly) pass text from one source to another. Thus, having to convert the encoding of every string that passes through node can be a bottlenck.
It should be noted that "most" here presumably means "most characters in an average English or western/central European language text" as out of the ~2^21 (~2 million) Unicode code points, only 128 are represented using 8 bits in UTF-8.
It doesn't matter. Whenever ASCII is an option, UTF-8 is optimal too.
ASCII is not an option for languages other than average English with poor typography and inability to deal with foreign names and addresses (e.g. LinkedIn made horrible mistake of using Latin1 initially. I still have contacts with &xxxx; visible in their names).
I think node.js should use UTF-8 by default, and require users to consciously switch bottleneck parts of their apps to ASCII.
I wasn't stating my opinion in my last post, just facts/clarifications.
But yes, I agree that UTF-8 would be a better default than ASCII unless someone provides hard evidence that encoding/decoding is a severe performance bottleneck in most real applications. (even then, I'd default to the "correct", not the fastest)
Felix doesn't seem to realize that JavaScript already has native functions for this. All of his code can be simplified to decodeURIComponent(escape(utf8ByteString)).
It's not the size, it's the work needed to decode and encode data.
Node.js is really, really good at shunting I/O around - it's ideal for writing things like proxies and file upload handlers. With ASCII, the bytes that come in are the bytes that go out again. If you're dealing with UTF-8 and unicode strings every time some data comes in you need to decode it as UTF-8, then pass the unicode string around within Node, then encode it back to bytes before you send it off again.
That makes a lot of sense for a web framework like Django (in fact it's what Django does) but Node is more of an I/O toolkit, so that performance overhead isn't welcome unless it's explicitly needed.
Modern CPUs are constrained by speed of memory, and amount of calculations you do on each byte doesn't matter that much. Node.js already takes the hit by copying memory to convert UTF-16 to ASCII.
That should really be in big red text in the docs considering it actually destroys bits. The api also seems inconsistent wrt net.Stream writes are encoded in ASCII but plain writable streams default to utf8: stream.write(string, encoding='utf8', [fd])
If you want to keep track of hot, fresh node-y goodness independently of the Ubuntu release cycle (as I do), then please enjoy my nodejs PPA builds.
They're built for lucid, but run fine on maverick, and like everything else in that PPA, are used in production (thus I have an incentive to maintain them well).
https://launchpad.net/~jdub/+archive/ppa
Enjoy!
(Note: I build a static version of node built against the internal copy of libraries it ships, rather than the dynamic build used by the main Debian and Ubuntu node packages. I really only do this to avoid maintaining those libraries in my PPA as well, and ryah keeps up with their updates anyway.)