Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The issue is CSV is not compatible with CSV. It's not possible to write a spec that covers all CSV files in the world. CSV means things that are mutually incompatible in the less common cases, and the only way to really parse them correctly is to know which variant generated it. But you can't even tag that variant in the file by your criteria, as existing CSV parsers won't understand it.


Couldn't tools that read CSV files scan them first and see which variant best matches the file? The questions arise about which end of line character(s) are used and how double-quotes and commas are handled. There can't be that many ways to escape them, and there are three sets of end of line characters mentioned - just see which one is used (i.e. don't assume only \n if you run into \r\n or \r alone).

The assumption that most software uses is that the import file will be in the same variant of the format as what that tool exports. That seems to be more of a problem than anything else.


> Couldn't tools that read CSV files scan them first and see which variant best matches the file?

Sure, and they sometimes do that if they have to ingest CSVs whose origin they don't control (although not every system implementor cares enough to do it).

But that's still just a bunch of shitty faillible heuristics which would not be necessary if the format was not so horrible.


It also doesn't prevent a human or other system doing:

cat input1.csv input2.csv > output.csv

resulting in a single file containing multiple formats.

Also, what variant is this:

    1,5,Here is a string "" that does stuff,2021-1-1
What is the value of the third column?

Is this a CSV file without quoting? Then it's

    Here is a string "" that does stuff
Or is it a CSV file with double quote escaping? Then it's

    Here is a string " that does stuff
This is fundamentally undecidable without knowledge of what the format it is.

You can decide to just assume RFC compliant CSVs in the event of ambiguity, but then you absolutely will get bugs from users with non-RFC compliant CSV files.


That's true. You could scan the file and see if there are any other types of double quote escaping happening, but if there isn't any that wouldn't help either. It's also negated by the multiple formats in the same file point.

So, yeah. Can't really be done without making too many assumptions that will break later.


>Couldn't tools that read CSV files scan them first and see which variant best matches the file?

Yes. And my software does that. But it is always going to be a guess which the user needs to be able to override.


By this argument UTF8 can't exist. And yet here it is.

PS: I never said 100% forward/backward compatible with all variant at the same time and without any noticeable artifact. I meant compatible in a non blocking way.


What are you talking about? UTF8 is a single well-defined specification, and detecting that data is definitely not UTF8 is trivial.


And yet it is forward/backward compatible with ASCII and non blocking against all it's ill defined variants.


ASCII was well defined, CSV was not. Therefore they could take the highest bit, which they could know that was unused per the ASCII spec, and use that to encode their extra UTF-8 information.

Also UTF-8/ascii compatibility is unidirectional. A tool that understands ASCII is going to print nonsense when it encounters emoji or whatever in UTF-8. Even the idea that tools that only understand ASCII won't mangle UTF-8 is limited - sure dumb passthroughs are fine, but if it manipulates the text at all, then you're out of luck - what does it mean to uppercase the first byte of a flag emoji?


To be fair, there is basically no way to manipulate arbitrary text at all without mangling it, UTF-8-aware or not. What does it mean to take the first 7 characters of a UTF-8 string which might contain combinator characters and left-to-right special chars? What if the text uses special shaping chars, such as arranging hieroglyphs in cartouches? You basically need a text-rendering aware library to manipulate arbitrary strings.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: