Byte order marks are wrong and not part of 'clean' 8-bit handling. All applicati...

jefffan241 · on Dec 14, 2016

I wasn't disagreeing with anything was just saying how I handle it with generated csv's. I do not disagree in the slightest that Byte order marks are wrong and not part of clean utf 8-bit handling. BUT, when you have a client and you need to generate a csv for them with characters that are only valid in utf-8 and that client's program will only open the file correctly if you add the Byte order mark, then you add the byte order mark.

joncrocks · on Dec 14, 2016

Which, ASCII or UTF-8?

If you have a file without a BOM, you have to pick one.

As every 8 bit combination is an ASCII character of some kind, you can interpret every UTF-8 character as a combination of ASCII characters. And what you output will be different to what was input (unless you restrict yourself to single byte UTF-9 characters).

Without some other way of indicating the encoding format of a file, a BOM can be a tool to indicate "It's probably encoded using UTF-X".

caf · on Dec 14, 2016

What? No. ASCII is a 7-bit encoding: only bytes with the top-bit zero are valid ASCII, and all of those bytes represent exactly the same character in UTF-8. UTF-8 is a strict superset of ASCII and this is not by accident.

jcranmer · on Dec 14, 2016

UTF-8 has many nice properties. One of the nicest is that most random binary strings are not valid UTF-8. In fact, the structure of UTF-8 strings is such that, if a file parses as UTF-8 without error, then it is almost certainly UTF-8.

If it's merely ASCII, it doesn't matter. Nearly every charset contains all valid ASCII texts as a strict subset. UTF-7, UTF-16, UTF-32, and EBCDIC are the major counterexamples, and UTF-7 and EBCDIC aren't going to come up unless you're actually expecting them to. (Technically, ISO-2022 charsets can introduce non-ASCII characters without use of high bit characters, since they switch modes using the ASCII ESC character as part of the sequence. In practice, ESC isn't going to come up in most ASCII text and ISO-2022-JP (the only one still in major use) will frequently use 8-bit characters anyways).

The only useful purpose of a BOM is to distinguish between UTF-16LE and UTF-16BE, and even then it's discouraged in favor of actually maintaining labels (or not storing in UTF-16 in the first place). You can detect UTF-8 in practice without a BOM quite easily, and it's only Microsoft who feels obliged to need them.

LukeShu · on Dec 14, 2016

As caf said, ASCII is a 7-bit encoding.

However, the question "which?" can still apply. There are many encodings that are a superset of ASCII. UTF-8 is a superset of ASCII, but so are ISO-8859-X (for any "X"), Windows-1252, and many others.

joncrocks · on Dec 14, 2016

Gah, yeah, you're right, I was thinking of CP-1252.

When I've had problems in the past with this it's been around windows machines, which love their own encoding formats.