Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The thing that puzzles me the most is, that people use _C_SV at all. Separation by comma, or any other member of the printable subset of ASCII in the first place. What this essentially boils down to is ambiguous in-band-signalling and a contextual grammar.

ASCII had addressed the problem of separating entries ever since its creation: Separator control codes. There are:

x01 SOH "Start of Heading"

x02 STX "Start of Text"

x03 ETX "End of Text"

x04 EOT "End of Transmission"

x1C FS "File Separator"

x1D GS "Group Separator"

x1E RS "Record Separator"

x1F US "Unit Separator"

You can use those just fine for exchanging data as you would using CSV, but without the ambiguities of separation characters and the need to quote strings. Heck if payload data is limited to the subset ASCII/UTF-8 without control codes you can just dump anything without the need for escaping or quoting.

So my suggestion is simple. Don't use CSV or "P"SV (printable separated values). Use ASV (ASCII separated values).



This comes up every single time someone mentions CSV. Without fail. The bottom line is that CSV is human readable and writable in plain text. If you start using fancy ASCII characters, then it becomes neither because our text editors don't support it.


Let's send patches to text editors so they render fancy ASCII characters? I also find it amusing that "fancy ASCII characters" is even a statement that can make sense, there's only 127 ASCII characters!


Sure. Send patches. Until it becomes easy to read and write fancy ASCII characters in text editors, my criticism stands.

I predict this will never happen under the principle of "good enough to use, bad enough to complain on HN." My point is that people keep suggesting ASCII delimiters as if it's some obvious solution that solves the entire problem, but it doesn't. People should instead be suggesting that text editors better support ASCII delimiters so that we can use them roughly as easily as we use commas today.


This might solve the "readable" part of the problem, but not the "writable" part of the problem.


Explain to me why humans have to be the ones to write proper syntax and escaping, instead of their editors. Do you do this for other formats, like RTF? What is this obsession with hand-coding complicated formats :)


> Do you do this for other formats, like RTF?

No, but I do it for formats like HTML, CSS, JSON. My editor can assist me, but I don't need it. To a lesser degree, the same is true of Java, though I do admit to leaning much more heavily on my IDE for that.

> What is this obsession with hand-coding complicated formats :)

Well, part of it is that they're not all that complicated unless you're doing something fancy. Another part is almost certainly our (we as in developers) pride in being able to use nothing but vi installed on a decade-old operating system to get things done.


Why? What benefit do you get from coding it by hand? The only one I can think of is being able to type some esoteric incantations in a textbox.

Downsides include:

XSS and malicious injection - users should be using markdown or provided an actual contentEditable HTML editor instead of a textarea. Like in GMail.

Syntax errors. How many times have you typed some JSON by hand and realized you forgot to balance some braces or add a comma, or remove a comma at the end of an object?

Complexity not that complicated, really? Basic C, HTML, CSS is not complicated. Advanced stuff is complicated. Forgetting a brace or a semicolon torpedoes the whole document. SQL is not complicated. Does that mean you want to write SQL by hand for production code?

Why don't you install a simple extension to vi like an html format adapter? I'm a developer too. But let me tell you what it sounds like when you say you want to use nothing but vi: that's like saying you want to be able to build robots using nothing but a hammer and some nails, and nothing else can be built that requires further abstraction or advancement where you can't do the same with a hammer.

And also keep in mind that not everyone is a developer. The fact that you like to keep esoteric syntax rules in your head for HTML, XML, CSS, Javascript, C++ and so on doesn't mean EVERYONE should have to. When it comes to CSV, it's not even a real standard. You have to keep in your head all the cross platform quirks, like \r\n garbage similar to how web developers need to keep in their head all te Browser quirks and workarounds.

All this... for some pride thing of being able to write text based stuff by hand.

I brought up RTF for a reason. No sane person should want to type .rtf or .docx by hand. So why HTML?


If you're calling CSV complicated, I really want to see your example of a simple format.


CSV is complicated in the same way the DOM and Javascript is complicated:

1) there are so many differences between browsers that you have to keep them all in mind when asking people to send you csv files, or generating them, etc. Such as for example \n vs \r vs \r\n and escaping them.

2) You have to keep in your head escape rules and exceptions, and balancing quotes and other delimiters.

3) The whole thing doesn't look human readable or easily navigable for a document of any serious complexity.

And what's the upside? If more people just used Excel or another spreadsheet program to edit these files, you won't face ANY of these issues. They would eventually converge on a standard format, like they did with HTML.

Disclaimer: I wrote a CSV parser


> there are so many differences between browsers that you have to keep them all in mind

Compatibility horrors from people violating the standard will appear no matter what format you use. That's not fair to blame on the format.

> You have to keep in your head escape rules and exceptions, and balancing quotes and other delimiters.

CSV itself has newlines, commas, and quotations marks for special characters. That's extremely minimal. The only extra thing to keep in your head is "is this field quoted or not".

What set of escapes and delimiters could be simpler? Would you rather reserve certain characters, and abandon the idea of holding "just text" values?

> The whole thing doesn't look human readable or easily navigable for a document of any serious complexity.

> And what's the upside? If more people just used Excel or another spreadsheet program to edit these files, you won't face ANY of these issues. They would eventually converge on a standard format, like they did with HTML.

This sounds like you're arguing for a more complex format! I'm confused.

So again, what is a format that you call simple?


> What set of escapes and delimiters could be simpler? Would you rather reserve certain characters, and abandon the idea of holding "just text" values?

There's a lot of design room here.

https://www.lua.org/pil/2.4.html http://prog21.dadgum.com/172.html

I can no longer remember where I saw it (thought it was lua), but I heard the idea to use =[[ ... ]]= and =[[[ ... ]]]= as wrappers (a bit like the qq operator from perl). They can be nested and don't interfere, so =[[[ abc =[[ ]]= ]]]= is a legitimate string.


Except when the compatibility horror comes from Excel violating the standard, but only if you happen to be located at the wrong hemisphere (hello semicolons). So, even if everyone just used Excel, you will still face this issue. Now what?


I am not arguing for another format. I am arguing for using a program to edit these files, to avoid syntax errors and other crap that arises when people do manual stuff that doesn't need to be done manually. And sure the format can stand to be slightly more complex, who cares if you're not editing it by hand.

Ascii text format is simple.

Anything where you have arbitrarily complex structure, why not use a program to edit it? What is the downside of using the right tool for the job? Your text editor is a program. Why tunnel through text and manually edit stuff?


Does any system (modern or not) actually use `\r` for anything? Because I'm not aware of it being used alone. I'm not even sure why software today still differentiates between the two instead of treating any of `\r`, `\n`, `\r\n` as a single line break. I can maybe see it being useful eg in word processor to differentiate manual line break from paragraph break, but that's not a plain text format; in the vast majority of cases, treating them as the same character in parsers shouldn't* cause any issues.

* shouldn't ≠ doesn't


IIRC, Mac systems used to use `\r` for line endings. It doesn't any more and instead uses `\n`.


Honestly I dispute that it is "human readable". It is sort of legible but incredibly inconvenient to read or manually write. They might be slightly more convenient than tabular files such as ACH or DDF


More convenient than JSON, more consistent than YAML.


Consistent? What do you mean, "consistent"? Sometimes it's comma separated, sometimes it's semicolon separated (depending on the user's locale), sometimes it's separated by tabs (because it's a _C_SV file, yeah, no biggie), no content encoding hint (Unicode? Latin-1251? Win-1252? Nobody knows), not to mention you've written this comment under an article that shows just about the least consistent behavior ever. (Line breaks? Ahahahaha!)

The only consistent thing about CSV is its ubiquity; other than that, it's a hairy, inconsistent mess that appears simple. (Source: having parsed millions of blobs that all identified themselves as CSV, despite being almost completely different in structure.)


> Sometimes it's comma separated, sometimes it's semicolon separated (depending on the user's locale), sometimes it's separated by tabs

CSV is comma separated. [1]

Valid YAML

    foo: bar baz
Invalid YAML

    foo: "bar" baz
Valid YAML

    foo: "bar baz"
Invalid YAML

    foo: "bar baz
Valid YAML

    foo: bar baz"
[1] https://tools.ietf.org/html/rfc4180


You would think so, but people are dumb. I've seen tab-delimited files that are .CSV instead of .tsv, and I've also seen the semicolon delimiter a few times though I can't recall where. I think Excel actually pops up a prompt when importing to confirm the delimiter in some cases?

From your link, it's quite clear that you should not assume any particular CSV file to follow any particular rules.

> Interoperability considerations: > Due to lack of a single specification, there are considerable differences among implementations. Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing CSV files. An attempt at a common definition can be found in Section 2....

> Published specification: > While numerous private specifications exist for various programs and systems, there is no single "master" specification for this format. An attempt at a common definition can be found in Section 2.

Section 2 states:

> This section documents the format that seems to be followed by most implementations:


"All theory, dear friend, is gray, but the golden tree of life springs ever green." -Goethe

If CSV were indeed always comma-separated, my hair would be at least 5% less gray. Alas, most programs emit semicolon-separated "CSV" in some locales (MS Office, LibreOffice, you-name-it-they-got-it).

Of course, I understand that your academic position "if it chokes the RFC-compliant parser, it's not a True CSV and should be sent to /dev/null" tautologically exists - but for some reason, users tend to object to such treatment (especially when they have no useful tools that would emit your One True Format for them).

TL;DR: there is no single standard fitting all the things that call themselves "CSV".


You seem like the perfect person to ask: what is a format that is close to the (apparent)simplicity of CSV, but is actually consistent?


I am so sorry.

In other words, as soon as you start exchanging data, you'll get something that is complex, broken, or (most common case) both. Existence of a simple, consistent general format has not been conclusively proven impossible, but I have yet to see one in practice.

(Of course, everybody and their dog have cooked up simple data schemes, yes, but those are a) domain-specific, and b) not in widespread use.)


The article kind of addresses this. There are millions of spreadsheets and applications out in the wild that use CSV to communicate.

Sure, if you're building some kind of system where you need to ingest data from one application from another application you control, then using a different interchange format like ASV is an option. But then people tend to use more powerful formats like JSON or XML.


> There are millions of spreadsheets and applications out in the wild that use CSV to communicate.

That, and data in CSV format is human readable in any old text editor or even work processor which many use as a quick sanity check to make sure their data looks sane. A lot of editors will not display the ASCII control characters at all so the fields on the line get mashed together, or may even reject the file as containing what it considers to be unexpected characters.


More to the point, it's easy to export from Excel, which is a massive pain to load in any non-.NET language.

While it's great to hope to use a well defined transport for machine-to-machine communication, it's exhausting to explain anything beyond CSV to Bob from sales.


Give me a version of every standard text editor that can let me display and edit these ASV files when I just need to quickly hack something, and sure, I'll use it. CSV is directly editable in any text editor and manipulable by standard text processing tools, that's one of its key advantages.


I cannot remember how often, when I worked in 'enterprise software', we were sent CSV files by companies, and they were completely broken after someone 'simply edited' them with a 'standard editor'. More than a 1000x for sure over the years.

Worse; most 'non computer people' cannot get them imported into a spreadsheet properly (for whatever reason; usually it just puts everything in one field or column, people curse and give up), so they have to edit them in Notepad or worse, in MS Word and then send them back.

Not really seeing the beauty I guess.


I distinctly remember receiving a .rtf file attachment from a user who was "sending me a CSV" and I had one of those daydream fantasies about tossing the computer out of the window and walking out.


How about Vim?

:help digraph

:help digraph-table

Feel free to implement mappings for quickly accessing these digraphs. Those pesky F<n> keys are perfect for this. Easy to reach, gets the job done.


For vim/evil, the following works.

    ASCII Name          - Vim Insert    - Visual Repr
    --------------------------------------------------    
    Start of Heading    - Ctrl-v Ctrl-a - ^A
    Start of Text       - Ctrl-v Ctrl-b - ^B
    End of Text         - Ctrl-v Ctrl-c - ^C
    End of Transmission - Ctrl-v Ctrl-d - ^D
    File Separator      - Ctrl-v Ctrl-\ - ^\
    Group Separator     - Ctrl-v Ctrl-] - ^]
For insertion you can also always just Ctrl-v DDD or Ctrl-v xHH where DDD is the three digit decimal value or HH is two hex values of the ascii code.


Vim and Emacs can. If your editor can't, maybe it should get with the (54 year old) program.


For some context in case anyone was curious, Wikipedia believes that both vi and Emacs were (originally) written in 1976, which is 41 years ago. Unix dates from ~1969--1973 depending on your definition, which is 44--48 years ago.

Accordingly, the reference to "54 year old" appears to be to the first standard as well as first commercial use of ASCII, in 1963.

That first ASCII standard from 1963 specified eight "separators" simply named S0 through S7 at codes 0x18--0x1F. The 1965 update reused the first four for other purposes (eg cancel and escape) and labeled 0x1C--0x1F with the more descriptive names we now know.


The number of devs who can competently use Vim or Emacs is already on the low side. The number of non-devs that can do it is far lower. The file needs to be editable by _any_ user with a text editor, not just a developer versed in Vim/Emacs.

I know the response will be "they'll just open it in Excel anyway" which is true in most cases, but I frequently have clients that want to download an export, modify it real quick with a text editor (many use Notepad++ for this), and then reupload it. They're usually doing massive find/replaces on the data and then reuploading into the system and a simple text editor is a lot better for this than Excel.


> The number of devs who can competently use Vim or Emacs is already on the low side. The number of non-devs that can do it is far lower. The file needs to be editable by _any_ user with a text editor, not just a developer versed in Vim/Emacs.

Emacs defaults to a menu + toolbar and the arrow/pageup/pagedown/home/end keys are functional. There are buttons with icons and labels for new document, open document, find in document, and copy/cut/paste. It's friendlier to use than something like Notepad++, gedit, or kate out of the box for simple editing. For more complicated stuff, there are menus and extensive documentation. The narrative that emacs is impossible to use for any but the programming elite doesn't fit the default experience.

If you define competently use as can extensively configure beyond the default state, then I'd argue that very few recent developers outside of those who use vim/emacs users have ever done so. How many people have you met who have written C# to extend Visual Studio or some Java to extend Eclipse/IntelliJ. Even with things like Atom, how many of those Atom users are writing Javascript packages versus using the packages someone else wrote?

If you define competently use as "be able to edit and debug in $x language", I'd argue the menu-driven approach in emacs is just as valid as the menu-driven in approach in any other random gui editor. The difference is that emacs can be customized and has decades of documentation and examples to pull from for any conceivable scenario. Want to interact with your editing environment with foot pedals, talk on that fancy new chat interface, interface with a serial port to pull sensor data or control a personal massager? You can find someone who has done it and documented it on emacs.


> The number of devs who can competently use Vim or Emacs is already on the low side.

Someone who cannot competently use either vim or emacs is not a developer.

> The number of non-devs that can do it is far lower.

The emacs paper talked about departmental secretaries using — and extending — emacs. Human beings are far cleverer than we like to think.


CS PhD and professional software engineer here. I can edit files very slowly and inefficiently in both vim and emacs but it would be a mess to use them for real work.

Some people never learned either of the editors. So what? The physical act of writing text was never the hard part of software engineering.


I don't think this necessarily addresses the security vulnerabilities in the article, which involve abusing the application reading the CSV, not the file format itself.

If Excel decides that text between Start of Text and End of Text that begins with a "=" is a formula, then you're in the same spot.


I am happy when I see I can get data via CSV over the other delivery methods people use. My local school board prints out all their data and then scans them into a PDF, ugh. I had one vendor that on purpose made the data only available in forms that would take me 600+ lines of code to clean up in mangled ASCII format.

I use CSV all the time when I am working with R. My data can come in the form of CSV, XLS, or PDF. Which would you want to work with?

I can easily look at the data. I never touch my incoming data and my output is in reports, but CSV can be the easiest way to get data into a computer.


Actually, XLS is not bad to work with, if you have a library for it. And it's well defined, unlike CSV. Insofar as I know, there's no way to make a CSV file that will open and show nicely in all popular versions of excel / google docs/ open office, especially across language settings. And a well formed XLS file will just work.


If a dev is going to use a weirdo non-CSV data interchange format, they would just use XSLX or JSON or etc etc etc.

"ASV" is only a viable option if you then also use your time machine to go back 40 years and make everyone start using it then.


This might create some integration-related hiccups with XML, as most ASCII control characters are forbidden per the XML 1.0/1.1 specs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: