CSV is a great example of this phenomenon. There is a "spec" RFC4180 and there a...

niftich · on Dec 13, 2016

In fairness, usage of CSV-like formats pre-dates the CSV RFC by almost 30 years, which was authored in 2005 specifically to try to formalize a de-facto spec:

> Surprisingly, while this format is very common, it has never been formally documented. [1]

[1] https://tools.ietf.org/html/rfc4180#section-1

fnord123 · on Dec 13, 2016

> For most users, Excel is doing the right thing. Blaming Excel for not handling CSV files according to the spec is passing the buck.

No. Excel is wrong when it comes to CSV.

Paste a Unicode string into Excel. e.g. Beijing in Simplified Chinese (北京市). Now Save As Windows CSV as beijing.csv. Close the file. Open beijing.csv. The cell now reads `___` (on Excel for Mac 2011 - maybe they deigned to fix it).

Excel just outputs bad data.

jefffan241 · on Dec 13, 2016

I don't know how you do it in excel but if you generate a csv with UT8 data you can add a byte order mark[0] as the first byte and it will render correctly.

Once you add that, excel will open the file with utf8 encoding (if you use the utf8 byte order mark obviously). I haven't tried with other utf-* encodings.

Again don't know how to tell excel how to add that though :/ I've only had to deal with arabic in generated csv's.

[0] https://en.wikipedia.org/wiki/Byte_order_mark

mjevans · on Dec 13, 2016

Byte order marks are wrong and not part of 'clean' 8-bit handling.

All applications and operating systems should assume files WITHOUT a BOM are either ASCII, or the superset there of, UTF-8.

Please remember to say WHY you disagree if you do.

jefffan241 · on Dec 14, 2016

I wasn't disagreeing with anything was just saying how I handle it with generated csv's. I do not disagree in the slightest that Byte order marks are wrong and not part of clean utf 8-bit handling. BUT, when you have a client and you need to generate a csv for them with characters that are only valid in utf-8 and that client's program will only open the file correctly if you add the Byte order mark, then you add the byte order mark.

joncrocks · on Dec 14, 2016

Which, ASCII or UTF-8?

If you have a file without a BOM, you have to pick one.

As every 8 bit combination is an ASCII character of some kind, you can interpret every UTF-8 character as a combination of ASCII characters. And what you output will be different to what was input (unless you restrict yourself to single byte UTF-9 characters).

Without some other way of indicating the encoding format of a file, a BOM can be a tool to indicate "It's probably encoded using UTF-X".

caf · on Dec 14, 2016

What? No. ASCII is a 7-bit encoding: only bytes with the top-bit zero are valid ASCII, and all of those bytes represent exactly the same character in UTF-8. UTF-8 is a strict superset of ASCII and this is not by accident.

jcranmer · on Dec 14, 2016

UTF-8 has many nice properties. One of the nicest is that most random binary strings are not valid UTF-8. In fact, the structure of UTF-8 strings is such that, if a file parses as UTF-8 without error, then it is almost certainly UTF-8.

If it's merely ASCII, it doesn't matter. Nearly every charset contains all valid ASCII texts as a strict subset. UTF-7, UTF-16, UTF-32, and EBCDIC are the major counterexamples, and UTF-7 and EBCDIC aren't going to come up unless you're actually expecting them to. (Technically, ISO-2022 charsets can introduce non-ASCII characters without use of high bit characters, since they switch modes using the ASCII ESC character as part of the sequence. In practice, ESC isn't going to come up in most ASCII text and ISO-2022-JP (the only one still in major use) will frequently use 8-bit characters anyways).

The only useful purpose of a BOM is to distinguish between UTF-16LE and UTF-16BE, and even then it's discouraged in favor of actually maintaining labels (or not storing in UTF-16 in the first place). You can detect UTF-8 in practice without a BOM quite easily, and it's only Microsoft who feels obliged to need them.

LukeShu · on Dec 14, 2016

As caf said, ASCII is a 7-bit encoding.

However, the question "which?" can still apply. There are many encodings that are a superset of ASCII. UTF-8 is a superset of ASCII, but so are ISO-8859-X (for any "X"), Windows-1252, and many others.

joncrocks · on Dec 14, 2016

Gah, yeah, you're right, I was thinking of CP-1252.

When I've had problems in the past with this it's been around windows machines, which love their own encoding formats.

tedunangst · on Dec 13, 2016

Excel for Mac can't process BOM correctly.

jefffan241 · on Dec 14, 2016

I did not know that. I'm on Linux and the client is on windows so only had experience with it on those 2 OS's. Good to know.

alogray · on Dec 14, 2016

IIRC, it was not fixed at release for Excel for Mac 2016 but there is now an option to Save As a "CSV UTF-8 (Comma Separated)".

rspeer · on Dec 14, 2016

That's amazing. Finally.

I don't suppose I should get my hopes up too much that that option is going to be more prominent than the terrible default of saving as Mac OS Roman, right? (Whoever decided that Excel on OS X should export CSVs in an obsolete encoding for Mac OS Classic must have been trying to hurt Mac users.)

fnord123 · on Dec 14, 2016

I tried it out on Office 2016/365/15.25 whatever on OS X. It works when writing data out. That's great news! And they also put it high up in the possible formats to output.

Unfortunately, to open the UTF-8 file again, it doesn't work ("åŒ—äº¬å¸‚"). You have to make a new workbook and then import the csv in, specifying UTF-8 csv. It's pretty messed up but at least it's possible now.

mnarayan01 · on Dec 14, 2016

You can make it work with Unicode (see e.g https://gist.github.com/mnarayan01/61bedce2b55e258d8f8c), it's just an enormous PITA.