It makes me cry how much time has been invested in formats like CSV's, TSV's, et...

dev_dull · on Sept 1, 2019

We’ve discussed this at length and I’m squarely in the DO NOT camp. The drawbacks of using non-readable meta characters exceeds the benefits:

1. TSV can be read and imported by almost everything.

2. People can add and adjust TSV files from any editor.

3. What’s the way to insert meta characters again In VIM? And now nano? Argh I’ll just try and copy and paste it. Ugh that doesn’t work.

Just use CSV/TSV folks. Anything more complicated and reach for a better serialization format (json, yaml) and not a better delimiter.

cbsmith · on Sept 1, 2019

2. More accurately, people can screw up TSV files from any editor. Have you not seen embedded spaces used instead of tabs and totally mess everyone up?

cbsmith · on Sept 1, 2019

Like a lot of things, #1 is only partly true, particularly if you have embedded tabs.

The number of times I've had to deal with parsing errors because of embedded carriage returns, commas, tabs, etc., sometimes costing millions of dollars is just... upsetting.

It's like JSON... we say everything can parse it, but really what we've got is some approximation that will come back to haunt us... and once you've done all the work to make sure everything is precise and correct, you'd really have saved time if you'd just used the tools we already had in the first place.

mruts · on Sept 1, 2019

Dealing with tabs and carriage returns has sometimes cost you millions of dollars? If so, that sounds like a great story and I would love to hear it!

jlgaddis · on Sept 1, 2019

Especially if (as it sounds) it's happened more than once!

Annatar · on Sept 1, 2019

Both JSON and YAML are very difficult to construct parsers for.

To make matters worse, both formats are full of pitfalls:

http://seriot.ch/parsing_json.php https://arp242.net/yaml-config.html

using either of these, in my opinion, is extremely misguided.

chucksmash · on Sept 5, 2019

Can't find a source for it now, but one of the design goals of JSON iirc was that it should be easy to parse.

You should have a go at writing a JSON parser to convince yourself it isn't hard. Can manually lex it in less than 150 lines of Python.

dev_dull · on Sept 1, 2019

The flip side to that is that json and yaml parsers exist in every language, and would be more than capable of replacing any logic you’d find in a CSV.

Annatar · on Sept 2, 2019

CSV's are a data structure. They do not contain any logic.

tinus_hn · on Sept 3, 2019

Just use these formats if you want to be stuck in the last century when internationalization was that odd thing you could easily afford to ignore. Otherwise, use a real format.

cbsmith · on Sept 1, 2019

3. If you don't know how to use your editor, maybe you should learn. If your editor can't insert even all the characters in basic ASCII, it's not an editor.

dev_dull · on Sept 1, 2019

Go ahead and give every engineer who comes across your meta-delimited file a little rundown on how it works. "Can you show me how to import it into google sheets?" "I need to email it to someone, can you show me how to change it?" "My IDE says I need a plugin to read it? Do you know anything about that? Can you help me set it up?"

They'll all agree how clever and useful the meta characters are of course, but only after you've given them your time in learning about it. No thanks no thanks no thanks. For me I'd rather deal with a little bit of serialization headache then a support headache.

cbsmith · on Sept 1, 2019

...and this is why we can't have nice things.

jefftk · on Sept 1, 2019

While that might have been a good idea if we'd started a while ago, at this point CSV/TSV are so much more established that using commas or tabs doesn't actually save pain.

zachrose · on Sept 1, 2019

But wouldn’t you have to account for the possibility that those separators exist in the content your working with, putting you back at square one in terms of escaping?

kec · on Sept 1, 2019

That's entirely the point: 0x1C and 0x1E should never actually appear in "normal" text unless someone has explicitly put them there, which is not necessarily true of , \t or \n.

xashor · on Sept 1, 2019

We took the wrong path quite some time ago. Most of the POSIX utils like sort, cut, grep only support \n (or sometimes \0) as row separator.

jlgaddis · on Sept 1, 2019

Are you aware of $IFS?

xashor · on Sept 1, 2019

I think $IFS is only used by the shell (with read and expansions), not by e.g. sort. It is so at least on my system.

dagenix · on Sept 1, 2019

But, there is nothing to stop your values from containing these characters. So, you still have to escape your input. And once you've done that, you might as well just use csv / tsv.

cbsmith · on Sept 4, 2019

Actually, the whole point of those characters is that they cannot be used in values.

perl4ever · on Sept 1, 2019

For some reason, in some circles, it seems to be semi standard to use þ (0xFE aka thorn) as the delimiter and a paragraph symbol (0x14 aka DC4 aka ^T) as the separator. The latter is not to be confused with 0xB6.

Anyway, these character are presumably not going to occur in ordinary text.

PeterisP · on Sept 1, 2019

All of "upper ascii half" can occur in ordinary text in "pre-Unicode" encodings.

0xFE is a good example - you may get a customer or employee from Iceland with that character in their name (e.g. https://en.wikipedia.org/wiki/Haf%C3%BE%C3%B3r_J%C3%BAl%C3%A...), or data in cyrillic cp1251 or koi8-r enconding where 0xFE also represents characters that you'll encounter in surnames, etc.

perl4ever · on Sept 2, 2019

It would be escaped in that case.

mdellavo · on Sept 1, 2019

using csv and tools like grep and join is very fast for large datasets

lgas · on Sept 1, 2019

Using the reserved separator characters wouldn't change that.

perl4ever · on Sept 1, 2019

I always used cut, sort, and uniq a lot too.

jstimpfle · on Sept 1, 2019

How do you do (for example) column names or types with ASCII only?

How do you extend ASCII?

kps · on Sept 1, 2019

ESC is the ASCII character for code extension.

ISO 2022 aka ECMA-35¹ standardizes how to use it in general, but the only thing that really caught on is a subset of the terminal control extensions (ISO 6429 aka ECMA-48² aka ‘ANSI’). In the hypothetical alternate universe where people use ASCII-based structured data, ISO 2022 would have standard sequences for such metadata.

If you were rolling your own today, you'd probably wrap the metadata in Application Program Command (ESC _ … ESC \) or Start Of String (ESC X … ESC \) or one of the private-use sequences. ¹ https://www.ecma-international.org/publications/standards/Ec...

² https://www.ecma-international.org/publications/standards/Ec...