Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

IRC messages have CRLF message delimiters (and ASCII space field delimiters) and no quoting mechanism in the protocol. They're delivered over a long-lived synchronized TCP stream. Does it just happen that no 8-bit sequence people normally want to send on IRC ever manages to collide with 0D:0Ah?

I haven't seen unicode messages on IRC channels, but I don't spend much time on IRC anymore, and so this is interesting new information for me --- but there's more to being 8-bit clean than simply supporting internationalized character sets.



> Does it just happen that no 8-bit sequence people normally want to send on IRC ever manages to collide with 0D:0Ah

This is not possible on a technical level, having nothing to do with IRC itself, but instead written into the encoding design of UTF8.

First of all "7 bit" physical communication never really existed in the age of TCP - the protocol has always moved 8 bits at a time around. The "7 bit" era refers to nobody actually agreeing what codepoints within x80 ~ xFF actually mean. This is even partially true today - not everything has agreed on speaking UTF8 (hi Win32 APIs).

On the actual point of why neither 0x0D nor 0x0A will ever "manage to collide".

In a single-byte encoding (called codepages, https://en.wikipedia.org/wiki/Code_page#Noteworthy_code_page...) 0x0D always means just that, as pretty much all ASCII-derived codepages do... well, respect ASCII ( note - this does not touch on the horror of EBCDIC, which is alive and well today (2015) too ).

In the case of UTF8 any continuation byte can only carry values in the range of \x80 ~ \xBF, and any leading byte can carry values in the range \xC0 ~ \xF7. So no matter how you slice and dice things, the resulting UTF8 will have every ASCII character meaning itself (this includex \x0D and \x0A ), and the only ambiguity when mistakenly treated as any single-byte encoding would be in the "what do we do with the upper 7bit range" part ( \x80 ~ \xFF ). More info here: https://en.wikipedia.org/wiki/UTF-8#Description

True, other multibyte encodings are not so convenient: for example \N{MALAYALAM LETTER UU} ( http://graphemica.com/%E0%B4%8A ) looks really bad CRLF-wise in both UTF16/UCS2 an UTF32.

But this is why UTF8 "won" for all intents and purposes. And this is also why "escaping" is not necessary under virtually any modern environment, so IRC lacking any such mechanism is not really relevant.

( No opinion worth sharing on the rest of the article/discussion ;)


Yeah, IRC is generally UTF-8 with some Windows-1252 mixed in.


Shift-JIS is still the most common encoding for Japanese channels (but not being able to embed ODOA isn't an issue there either).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: