Consider Using CSV

Pinus · on Dec 10, 2022

CSV looks deceptively simple. It is far too easy to just write(','.join(whatever)), which sort of works, until it doesn’t, and then someone, sometimes I, has to sort out the resulting mess. PLEASE use a proper CSV library (Python comes with a CSV module in the standard library), or at least implement the entire format according to the RFC from the outset, even if you think you won’t need it!

chaps · on Dec 10, 2022

Oh yes. CSVs are deceptively challenging especially if your use-case is from excel files to csv. Excel will happily convert a worksheet to csv, but it's a naive conversation. Headers that start on line 3, multi-line headers, inconsistent column counts, etc. It adds up really quickly!

zem · on Dec 10, 2022

I've also run into issues where I wrote some code that worked with csv input, and told users they could just export their data from excel. turns out excel doesn't export in utf-8 by default, we had some weird issues until we figured that out.

MrJohz · on Dec 10, 2022

Another issue is passing those CSV files across international borders - a CSV file that works in the UK (commas as separators and a decimal point) may not be readable in Germany (semicolons as separators and decimal commas) without some configuration.

kasajian · on Dec 10, 2022

This is a matter of developer education. The correct way to create and parse CSV files is to use a third-party library. They can get complicated. A field in a CSV can contain commas and quotes. In some cases, a single field can contain a line-feed, and you'll need to ensure the parser you use supports that. This would allow an entire CSV file to be embedded inside the field of a CSV field. At a minimum, a parser must support Excel's default parser logic.

But, if you pick the right parser and generator, then you're ok with using it.

Macha · on Dec 11, 2022

Right, but if you're picking CSV, you likely expect to interoperate with a provider that's not yourself. And then there's no way a parser can handle all CSV formats in the wild.

e.g. the example from my comment on the last CSV discussion (https://news.ycombinator.com/item?id=28223719)

What variant is this:

    1,5,Here is a string "" that does stuff,2021-1-1

What is the value of the third column?

Is this a CSV file without quoting? Then it's

    Here is a string "" that does stuff

Or is it a CSV file with double quote escaping? Then it's

    Here is a string " that does stuff

brundolf · on Dec 11, 2022

Yeah, but this is less of a problem if it's an internal API. You can stick to a stricter subset of the standard, and/or only handle the types of column values that you actually need

Still probably worth using a library, but it isn't a source of problems in my experience

ndsipa_pomu · on Dec 10, 2022

As much as I like and use CSV for database work, it has a problem with being poorly specified. The most common problems are when processing CSVs produced elsewhere which might not enclose text fields with quotes and thus have issues with data that includes commas and multi-line data.

gregmac · on Dec 10, 2022

There is a spec (RFC 4180 [1]) but it's definitely not widely followed. Worse, for a lot of data there's no problems for potentially years, until your numbers get too big or the first time a quote or comma gets in the data.

In my experience one of the biggest barriers I run into -- and the primary reason I hate using CSV -- is Microsoft Excel. It misinterprets numbers as dates, it convers big numeric identifiers to exponents, and more. Even merely opening a RFC4180-compliant file and saving it changes the data, and even Excel itself will often have a different misinterpretation of the de file.

If humans never used Excel for CSV, it would be a viable format. At the same time in most cases where humans aren't in the loop (machine-to-machine communications), there's better formats. You could spec "RFC4180 CSV" and hope no developer just sees the "CSV" and assumes they understand. Or specify something like a JSON streaming format and avoid a whole lot of headache.

[1] https://www.ietf.org/rfc/rfc4180.txt

gregw2 · on Dec 11, 2022

Regarding excel and rfc4180…

I avoided CSV for quite a while because I had excel-vs-CSV compatibility concerns like this.

However, when I tested this for myself a few years back, Excel output to my surprise was rfc4180 or darn near it (it might use CRLF rather than LF?) It emitted commas and quotes the same way as the rfc for all the test cases I checked.

That said, I agree with you Excel is problematic as an input source. Usually the problems are the humans who touch the data in excel, but what I’ve found is the automation problems tend to be with Excel parsing and interpreting incoming data (before it goes to CSV.) Exponents, trimming leading zeros, etc. as you say. But if the data is confirmed good in excel before being emitted, the CSV it emits is decent.

Counterexamples welcome.

CharlesW · on Dec 10, 2022

Yes, I feel like this would've been more helpful generalized as "Consider DSV" (delimiter-separated values) than CSV specifically, because of the interop issues that often come up. I'd have also mentioned using Parquet.

mason55 · on Dec 10, 2022

I generally find Avro to be a better replacement for CSV than Parquet. It’s a better drop in for the typical CSV use case of “process this file row by row”.

Parquet is great, don’t get me wrong.

wenc · on Dec 10, 2022

Avro is a row-based format that supports schema evolution, but unfortunately it is not a very popular format in general and there's much less support for it than Parquet. My team evaluated several formats (ORC, Parquet, Avro) but landed on Parquet due to columnar performance and wider support, though you're right, it's less performant for row-wise operations (but not by that much). If you really want row-wise performance, better to go with sqlite than Avro.

scrollaway · on Dec 10, 2022

Parquet has the opposite problem of CSV though. It's so complex to work with, that unless you're specifically in data science, it's both unheard of and unusable.

To read a parquet file in Python, you need Apache Arrow and Pandas. And literally the second result for "parquet python libraries" is an article titled "How To Read Parquet Files In Python Without a Distributed Cluster".

I remember dealing with Parquet file for a job a while back and this same question came up: Why isn't there a simpler way, for when you're not in the data science stack and you just need to convert a parquet file to csv/json/read rows? Is is a limitation of the format itself?

jointpdf · on Dec 10, 2022

We data scientists are well-known for our exclusive mastery data wrangling arcana, like…

  df = pandas.read_parquet(‘foo.parquet’)
  df.to_csv(‘foo.csv’)
  df.to_json(‘foo.json’)

(no sarcasm)—how could it be simpler than that? What problems have you encountered that make it unusable?

scrollaway · on Dec 11, 2022

Arrow and pandas are massive dependencies.

wenc · on Dec 12, 2022

Not really. Depends on your use case but most of the time you’re trading off disk space for a specialized efficient library.

Pandas and Arrow are dependencies like any other. Pandas is like a DSL for working with tabular data, much like numpy is a DSL for working with arrays and numerical algebra. No one working with linear algebra will insist on using the Python standard library built ins.

If you’re distributing a smallish Python app that only needs to read and manipulate smallish amounts of data, then I agree there are easier solves like SQLite.

But if you’re doing consulting work and dealing with large tabular datasets and need to do SQL type window functions and aggregations then Parquet is a better fit and the disk space required for adding a Pandas dependency is trivial. If one is using Anaconda, Pandas is batteries included. It really depends on what is being optimized for.

CharlesW · on Dec 10, 2022

> It's so complex to work with, that unless you're specifically in data science, it's both unheard of and unusable.

FWIW, in my experience at a "data analytics platform" company, it's reasonably popular for data-heavy workflows since Parquet is well-defined, and file sizes (especially as the amount of data grows) are a fraction of their CSV equivalents.

> Is it a limitation of the format itself?

I don't think so. In other languages, you can generally read/write Parquet files without a ton of dependencies (e.g. https://github.com/xitongsys/parquet-go).

wenc · on Dec 10, 2022

> It's so complex to work with

This is the opposite of my experience.

> To read a parquet file in Python, you need Apache Arrow and Pandas.

Or DuckDB.

    import duckdb
    df = duckdb.query("select * from 'a.parquet'")

Want to look inside a Parquet file? Use Visidata.

    vd a.parquet

> I remember dealing with Parquet file for a job a while back and this same question came up: Why isn't there a simpler way, for when you're not in the data science stack and you just need to convert a parquet file to csv/json/read rows? Is is a limitation of the format itself?

Do you consider Pandas a "data science" stack? To me, it's just a library like any other that makes it easy to work with tabular data. Even for CSV, there is csvreader (usually not a good idea to deal with CSV by hand). Outputting to CSV is literally a one liner in Pandas or DuckDB.

   import pandas as pd

   # output to CSV
   pd.read_parquet("a.parquet").to_csv("a.csv") 

   # output to JSON (choose from any number of orientations)
   pd.read_parquet("a.parquet").to_json(orient="table")

   # read rows
   for row in pd.read_parquet("a.parquet").itertuples():
       print(row)

fbdab103 · on Dec 10, 2022

I want to use parquet more frequently, but it creates new problems that do not exist if I dump to CSV. Last I looked, there were not any good GUIs that would let someone quickly browse the data. Now it is just a blob lacking introspection. CSV has issues, but it is universal.

wenc · on Dec 10, 2022

Not a GUI tool but try Visidata for looking inside Parquet files (and other tabular formats)

https://www.visidata.org/

fbdab103 · on Dec 10, 2022

A bit round-about, but the slick way I discovered is to take a detour through DuckDB. DuckDB offers parquet bindings which you can link through a kind of foreign data interface and then query through SQL. Using this, you can then just browse parquet files through DBeaver or your IDE of choice. Hardly an out of the box solution I can offer to a random collaborator, but fantastic for your savvy analyst.

hermitcrab · on Dec 10, 2022

That is interesting to hear. Parquet input and output is on the wishlist for our Easy Data Transform software (currently we support CSV, Excel, XML, JSON and a few others). Anyone have any experience integrating Parquet read/write into a C++ application?

fatneckbeardz · on Dec 10, 2022

but which delimiter.

if you choose pipe ok, now you have to make sure nobody typed a pipe into the input field or spreadsheet, and you cannot store unix commands

if you choose tab, ok, now people will get confused when they try to edit the text file to replace tabs with spaces, and now you have trouble putting code snippets into data fields because they have tabs.

this is the problem and it's why xml/json exist.

in my particular domain, tab separated works pretty well but in a general context of the world at large, i feel like JSON has reasons it exists.

elcritch · on Dec 10, 2022

Well the obvious solution would be ASCII 0x1D (Group Separator)! Accept, no one actually uses those ASCII characters. Kind of bums me out that UNIX basically skipped out on them.

hermitcrab · on Dec 10, 2022

I agree. Lots of discussion related to proposing that here: https://news.ycombinator.com/item?id=31220841

kevinmgranger · on Dec 10, 2022

It's not a separator character, but at least vim and emacs acknowledge the page feed character. A pittance, I suppose.

lelanthran · on Dec 10, 2022

> It's not a separator character,

Isn't it? I thought all the separator characters (0x1e, 0x1f, 0x1c) were specifically for delimiting records, fields and units.

What are they for?

civopsec · on Dec 10, 2022

They’re saying that “page feed character” (I’m guessing form feed) is acknowledged by Emacs, in contrast to those separator characters.

I think it’s used to mark sections in Emacs Lisp code.

civopsec · on Dec 10, 2022

Both pipe and tab are infinitely better for so-called human-readable data compared to comma. Comma doesn’t even work well for numbers since some locales use comma as the decimal separator. And a data format can’t be “human-readable” if you’re not allowed to write numbers in the way that you’re used to write them.

ndsipa_pomu · on Dec 10, 2022

Pipes are quite common, but for tricky data, I'd recommend ¬. It's on most keyboards and I can't think of any other use of it.

tom_ · on Dec 10, 2022

This symbol is not present on US keyboards.

ndsipa_pomu · on Dec 10, 2022

I did not know that. It's on most UK keyboards

mtVessel · on Dec 11, 2022

What is it called? What's it for?

ndsipa_pomu · on Dec 11, 2022

I had to look it up as I don't know what it's called and it's a negation symbol https://en.wikipedia.org/wiki/List_of_logic_symbols

As to what it's for, I'd say it's great as a delimiter. I've never used it for its intended purpose.

toast0 · on Dec 10, 2022

> but which delimiter

Control characters. Like ctrl-A and stuff. Almost nobody has them in their data.

fbdab103 · on Dec 10, 2022

In the context of an API so long as using a "real" CSV library a lot of those inconsistencies do not appear. Problems happen when you have to interface with humans and desktop software (ie Excel) which has its own rules.

hermitcrab · on Dec 10, 2022

Yes, CSV is superior to JSON for tabular data, but has it's own issues. One issue is that the standard is not consistently applied. Another is the approach to escaping means that it is hard to parse a CSV file with multiple threads. You have to parse the entire file before you can be sure which " characters escape other characters.

I wrote an article about tabular formats and their strengths and weaknesses here: https://successfulsoftware.net/2022/04/30/why-isnt-there-a-d...

The resulting HN discussion is here: https://news.ycombinator.com/item?id=31220841

_dain_ · on Dec 10, 2022

>CSV is superior to JSON for tabular data

Can't you just do this?

  {
    "columns": ["col1", "col2", "col3"],
    "data": [
               [1,      2,      3],
               [4,      5,      6],
               [7,      8,      9]
    ]
  }

That's valid JSON but it's human-readable and human-editable rows of comma-separated data, just like CSV.

hermitcrab · on Dec 10, 2022

You can. But I don't see how that is superior to the equivalent CSV.

foul · on Dec 10, 2022

Until you try to either cram it into Excel, work with different encodings or pass it around different software platforms (even your lang vs JS), no difference.

CSV has been abused a lot to make it work on conflicting use-cases, JSON handles a lot of misshaps happened with delimiter-separated record formats, like new-lines or bring-your-own-character encoding.

II2II · on Dec 10, 2022

> The most common problems are when processing CSVs produced elsewhere [...]

The limitations of CSV are certainly worth considering and, in the instances you mentioned, it may be not be worth using CSV. (If you are going to be using a more complex parser anyway, you may as well using a format that is better defined and where you are less likely to encounter edge cases.) That being said, there remain many cases where CSV is far more efficient and far less error prone.

ndsipa_pomu · on Dec 10, 2022

It tends to be a lowest common denominator or a choice between CSV and Excel documents which are trickier to automate.

Kamq · on Dec 10, 2022

Which works until you have a quote in your data. It also prevents chunking out the file to process in parallel since you could have new lines between the quotes.

What I've found to work well is to just % encode your delimiter, the new line character, and the '%' character. Basically every language has utilities for this.

Doesn't solve the issue with accepting outside files though. You have to be pessimistic with those regardless.

majkinetor · on Dec 10, 2022

With gzip on web server the difference is not important at all.

CSV in general is problematic as there is no standard (RFC 4180 is not). In certain contexts this surely can be good solution but definitelly not good in general scenario.

CharlesW · on Dec 10, 2022

As Wikipedia puts it, "CSV is widely used to refer to a large family of formats that differ in many ways". If there's a canonical standard, it appears to be RFC4180: https://www.rfc-editor.org/rfc/rfc4180

majkinetor · on Dec 10, 2022

It appears, but its not. I have not found single program so far that conforms only to this RFC and nothing else.

From the RFC itself:

   Status of This Memo

   This memo provides information for the Internet community.  It does
   not specify an Internet standard of any kind.  Distribution of this
   memo is unlimited.

CharlesW · on Dec 10, 2022

> I have not found single program so far that conforms only to this RFC and nothing else.

Wouldn't that be impossible, given that parsers have to accept all kind of bizarro CSV flavors? Maybe more importantly, do you know of a single program or single CSV library that doesn't support reading or writing CSV as defined by the RFC?

majkinetor · on Dec 10, 2022

Yeah, any of them. Just add new line in the "cell" and then go jump from the bridge.

Karellen · on Dec 10, 2022

An "Internet Standard" is just a designation that has been given to an RFC that has been blessed in a certain way. See https://www.rfc-editor.org/ for more details, but the set of designations is:

    * Uncategorised
    * Historic
    * Experimental
    * Informational
    * Best Current Practice
    * Proposed Standard
    * Draft Standard
    * Internet Standard

Once an RFC reaches "Internet Standard" it is given a special designation, e.g. STD-63 is the standards designation for RFC-3629: UTF-8 < https://www.rfc-editor.org/info/std63 >. See https://www.rfc-editor.org/standards

Being an "Internet Standard" is kinda special, but not especially so. For example, IMAP4, originally specified in RFC-3501 in March 2003, updated many times since, and revised in RFC-9051 in August 2021, is still a "Proposed Standard" without an STD designation, nearly 20 years and dozens of interoperable implementations later.

"Rough consensus and running code" is how things get done.

RFC-4180 is plenty good enough a "standard" for people to decide to interoperate over. They just have to decide to do so.

(Note also that HTML5 is not an "Internet Standard" according to the IETF et al. The last version to get an RFC was HTML 2 in RFC-1866, designated "Historic". And interoperability was an issue for a while with later versions of HTML during the "Best viewed in Internet Explorer/Netscape Navigator" wars. To get interoperability like we eventually did, you don't need an "Internet Standard"; you just need implementers who want to interoperate, and are willing to favour it over lock-in, and even over strict backwards-compatibility.)

(Also, the "and nothing else" clause in your comment confuses me. Why not support other formats/variants also? "Be liberal in what you accept" is certainly something that you probably want to avoid if you're designing a new format/protocol that no-one else is using yet, but if you're working with a decades-old format that was traditionally poorly-specified, with millions of documents out in the wild, it's probably the best way to allow existing users to move forward.)

briHass · on Dec 10, 2022

That was my first thought: JSON is highly amenable to compression; due to the repetition this blog is complaining about. It's a good lesson for junior devs: if you find yourself thinking about saving bits and bytes with custom protocols, you need to pull out of the rabbit hole and find the existing solution to your problem.

Sure, for a local data file or something where it's nice to be human-readable-ish, CSV can be a better choice than JSON (assuming you use a library for all the edge cases and string escapes.) If you really want a super-small and fast serialization, that's what protobuf is for.

thangalin · on Dec 10, 2022

CSV is also great for importing external data into documents. My text editor, KeenWrite[0], includes an R engine and a CSV-to-Markdown function[1]. This means you can write the following in a plain text R Markdown document:

    `r#csv2md('filanme.csv')`

The editor will convert Markdown to XHTML in the preview panel (in real time), then ConTeXt can typeset the XHTML into a PDF file in various styles.[2][3] This avoids spending time fighting with table formatting/consistency in certain word processors while storing the data in a machine-friendly format. (Thereby upholding the DRY principle because the data can have a single source of truth, as opposed to copying data into documents, which could go stale/diverge.)

Using JSON would be possible, but it's not as easy to convert into a Markdown table.

[0]: https://github.com/DaveJarvis/keenwrite

[1]: https://github.com/DaveJarvis/keenwrite/blob/main/R/csv.R#L3...

[2]: https://i.ibb.co/6FLXKsD/keenwrite-csv.png

[3]: https://i.ibb.co/47h6zNx/keenwrite-table.png

gugagore · on Dec 10, 2022

The only reason, in my eyes, to use CSV is to have easy interoperability with spreadsheet software.

If you want streaming: https://jsonlines.org/

Karellen · on Dec 10, 2022

JSON lines looks kinda interesting, but the newline-delimited thing seems weird.

It seems to me that you could write a JSON streaming parser that, if the outer element is an Array, reads and outputs/processes one Array element at a time as its JSON value. Yeah, you can't get the array length, and if there's a parse error somewhere down the line then you have to figure out how to deal with that (a non-streaming parser would have rejected the whole input), but that's kind of inherent in using a streaming parser. The upside is that you can work with any valid JSON.

Sure, if you're interoperating with shell tools, and don't have `jq` available, newline-delimited JSON might be helpful. But on the other hand, just install `jq`, dummy!

majkinetor · on Dec 10, 2022

Meh.

Excel compatibility really sux. And Excel is most used one by large. You really can't double click it effectivelly, as everything will be shown as generic type, so you have to mess up with wizard which is also half baked. I have to create tutorials for that for each service using it.

elcritch · on Dec 10, 2022

Sometimes CSV is nicer. Still you can cut down on your JSON by formatting it as a similar header style:

    [
      ["productId", "quantity", "customerId"],
      ["5710031efdfe", 1, "8fe96b88"],
      ["479cd9744e5c", 2, "526ba6f5"]
    ]

This style also works well with jsonlines a sibling comment mentioned. Of course my favorite is MessagePack (or CBOR) using similar styles. MsgPack can be as small as gzipped JSON. :)

account-5 · on Dec 10, 2022

I think one of the issues it data types. JSON has them CSV doesn't, so this means your program needs to be aware of which columns are which data type and do the conversion where needed.

It's similar to JSON Vs INI files for config files.

On a different note I wouldn't nest JSON in a CSV column. I'd delimit with a pipe or something the split string on that. Much simpler if you're in control of the data.

ajanuary · on Dec 10, 2022

How often is this a concern in practice? It’s a question I’ve been thinking about a bunch and the answer I keep coming back to is that most of the time, encoding the type in the data exchange format isn’t actually very useful. So I would be interested in use cases where it is.

Are you just trusting that the types coming in are going to be the correct ones? What happens if someone sends you `{“foo”: “10”}` instead of `{“foo”: 10}`? Do you validate with a schema up front? In which case your code already needs to know what types it expects and can convert them. Or are you letter the incorrect types run through your system until it hits a type error at runtime somewhere?

account-5 · on Dec 10, 2022

> How often is this a concern in practice?

No idea really, but if you're using a JSON parsing library then that is going to automatically convert for the data types. Which, provided you trust the data, saves you a job.

majkinetor · on Dec 10, 2022

JSON also has schema that can be used to verify it.

bufferoverflow · on Dec 10, 2022

The author didn't compare gzipped/brottlied sizes.

The author didn't think of any examples with even a bit more complexity. If you have 2-level object nesting, now what?

ARandomerDude · on Dec 10, 2022

> It's only 77 bytes, with 29 for the header and 24 for each line. At 100,000 entries, this list would be 2.4 MB (that's ~63% less than the JSON).

If size is really the issue but you still want schema enforcement protobuf is the way to go.

saulpw · on Dec 10, 2022

protobuf is terrible! Now you have to rely on Google-scale tools to generate code for whatever language(s) you want to read or write the data in, and this becomes quite the encumbrance.

out_of_protocol · on Dec 10, 2022

I'd go with sqlite instead. Also, there are specialized formats like Parquet

xwowsersx · on Dec 10, 2022

I mean point well taken, but, as they acknowledged in the post themselves, CSV isn't suitable when you have a nested structure. And you almost always have/need a nested structure, no?

tremon · on Dec 10, 2022

Relational databases have worked fine for decades without nested structures. The simple trick is to take the nested structure out of the entity and into its own table.

xwowsersx · on Dec 10, 2022

Unless I misunderstood something, I'm not sure I understand the relevance here. I assumed we were talking about sending data to clients. In such cases, you do not send database tables. Instead, you send rich, fully hydrated objects which are the result of joining those tables. The serialized representation can be backed by the relational model, but at some point you have to put those together to send something useful to the client. My only point is that CSV is unsuitable for this task in many/most cases.

majkinetor · on Dec 10, 2022

That may be simple trick for the db, but not when your paradigm involves importing files - imaging telling that to users, instead of giving json, please give 75 csv files.

anamexis · on Dec 10, 2022

This scenario might be more common than you think -- spreadsheets still reign supreme, and often 75 csv files is how the users have the data to begin with.

(Incidentally, my day job is building a spreadsheet importer.)

sitkack · on Dec 10, 2022

Did they? And all the databases I use regularly support nested structures, they are extremely expressive.

akhmatova · on Dec 10, 2022

CSV isn't suitable when you have a nested structure.

As the post acknowledges right about where you stopped skimming.

And you almost always have/need a nested structure, no?

No.

xwowsersx · on Dec 10, 2022

> as they acknowledged in the post themselves

As I noted in my own comment. Ironic to accuse me of skimming the original post when you couldn't even read my two sentences.

akhmatova · on Dec 10, 2022

My very bad - please have my sincerest apology.

xwowsersx · on Dec 10, 2022

Apology accepted, no worries

YmiYugy · on Dec 11, 2022

I always thought CSV was just fine, until I had to ingest and export a bunch of CSV in my last project. The big problem is that CSV is not well defined and it's so deceptively simple that many don't bother to adhere to the spec that does exist. Just a few idiosyncrasies I found: Inconsistent character encoding. If you open or save a csv with Excel it will assume a Windows-1252 encoding. Since browsers deal exclusively with UTF-8, this get's really messy. The CSV I got didn't actually use a comma as a delimiter but a semicolon. Everyone seems to have conflicting options about whether strings should have quotes and if so, which ones. The CSV I had to deal with also came with a decimal comma, which screwed up even more stuff. My advice stay away from CSV as an exchange format. Use something that is well defined.

pcthrowaway · on Dec 10, 2022

I'm definitely in the "Just use JSON for most things" camp, but I'm wondering, why would you ever choose CSV for interfacing microservices over protobuf?

Isn't protobuf basically CSV but with good libraries at the interface point and standards around how to deserialize the streams?

panzerboiler · on Dec 10, 2022

I usually prefer a binary encoding. More efficient on the wire, easier to parse and generate, and with no ambiguity. We have 2 control codes given to us by the teletype era that have the perfect meaning for this kind of data:

    0x1E Record Separator
    0x1F Unit Separator

nmz · on Dec 10, 2022

and because its a single byte, its fast, no need to tokenize.

You also have 2 more the group separator and the file separator. so you could represent a tree with it.

sitkack · on Dec 10, 2022

No one uses that format for streamed json, see ndson and jsonl

http://ndjson.org/

The size complaint is overblown, as repeated fields are compressed away.

As other folks rightfully commented, csv is a mine field. One should assume every CSV file is broken in some way. They also don't enumerate any of the downsides of CSV.

What people should consider is using formats like Avro or Parquet that carry their schema with them so the data can be loaded and analyzed without have to manually deal with column meaning.

majkinetor · on Dec 10, 2022

Since this is about CSV, this is obligatory tool for larger ones:

* https://github.com/antonycourtney/tad

sitkack · on Dec 10, 2022

For manipulating CSV from the terminal, check out https://github.com/BurntSushi/xsv

speq · on Dec 11, 2022

There's a fork with new features: https://github.com/jqnatividad/qsv

sitkack · on Dec 11, 2022

This is amazing work. Thanks for bringing it to my attention. Hopefully it and xsv can be merged at some point in the future.

majkinetor · on Dec 11, 2022

Unless you really need ultra performance, PowerShell is certainly much better option.

SillyUsername · on Dec 10, 2022

Holy cow.

If somebody asked me to support this format after you'd left the company I'd quit on the spot. This frankenformat is 100% premature optimization, non standardised, requires custom parsers (which are potentially inefficient and may negate the network performance from having to parse both json and csv) and is potentially very difficult to maintain and debug (no syntax highlighters or rest like posting tools)

Just either use GRPC or JSON with regular network level gzip encoding.

beached_whale · on Dec 10, 2022

A constrained format based on JSONL with each record being a tuple of number/string/bool/null could better defined than CSV and looks almost like it. The benefit being, almost any json library could work with it, or could be made to one line at a time and it can be parallelized as newlines only exist as the delimiter.

["hello",5,false,1,2,2.334,null]

["world",12,true,1,2,2.334,null]

albertopv · on Dec 10, 2022

What else do you use if you have to import millions of rows from a client or supplier without direct integration but sftp?

CharlesW · on Dec 10, 2022

Often, Parquet. https://parquet.apache.org/

wenc · on Dec 10, 2022

It often surprises me that Parquet is not widely known outside of data engineering circles. Most software developers are still mucking around with CSV for large tabular data, which is absolutely the wrong format. Better developers use sqlite, which is less wrong but still wrong. Postgres is closer to the right answer, but for very large, typed tabular data, Parquet is the way to go.

Parquet is a columnar format that is compressed, typed, efficient for columnar queries, append-friendly (though not rewritable), and is a natively supported format for Apache Spark. Parquet libraries are now widely available for most languages (didn't used to be the case, but now they are). I query Parquet files with DuckDB in Python and it blazes through GBs of data in seconds.

At work, the canonical format is TSV (tab separated values) which despite being human-readable, is huge, inefficient to query and does not support data types.

When I have to work with large TSV files (10GB or larger), I first convert them into 500MB Parquet files. The latter are faster, smaller and less prone to type errors. Because columnar formats like Parquets are indexed, I can do complex operations like joins, window functions, aggregations on them in a performant way, while any similar operation on TSV files will trigger a table scan each time.

I recently ran a Spark job on a very large TSV file which took over 8 hours and timed out. A Spark job on the same data represented in Parquet completed in 5 minutes.

spentu · on Dec 10, 2022

I cannot count how many times CSV "format" has caused problems for me..

In my country the decimal separator is comma, instead of punctuation. This causes problems when importing and exporting with this "format".

Just few weeks ago I had fun times working with API returning CSV in unknown encoding. Hopefully they will never make changes (you cannot always trust headers). Ah and i do love when CSV is missing headers and someone adds data into middle.

Of course some of these issues can be avoided by doing the things "right". Sadly you cannot trust this in real life. People write ugly structures in JSON, but at least you can validate results..

WirelessGigabit · on Dec 11, 2022

No. Just no. The amount of times I've had issues with CSVs exported from a non-US locale is insane. They use semi-colon as separator, as for some weird reason they use the comma as the decimal point.

Then there's the issue of encoding, as that is also not the same across locales. Then you get a CSV with the BOM characters up front or some French accents represented as ? because of incorrect encoding parsing / saving.

At least JSON doesn't have any of these things. Standardized strings, and standardized number format.

sheeeep86 · on Dec 10, 2022

You could have the advantages of both worlds by having one json object per line. You could stream process, and you could structure more complex objects and have consistent escaping.

gugagore · on Dec 10, 2022

https://jsonlines.org/

nmz · on Dec 10, 2022

You could do that, you could also have a json that is not streamable. You can't guarantee how large a json object will be but you can guess that the csv will probable be.

ledauphin · on Dec 10, 2022

and the "redundant" headers will get compressed away to nearly nothing over most transports.

whateveracct · on Dec 10, 2022

I quite like CSVs. I've used them to great effect at maybe every job I've ever had. xsv, sqlite, and Excel/LibreOffice provide useful tooling on top of them.

I see a lot of complaining about "no standard" in this thread, but the way I've used them, it's been fine. I just use Haskell's cassava. If human produce them with Excel/LibreOffice, I never have issues on the ingestion end.

slotrans · on Dec 10, 2022

Please don't. CSV is one of the worst file formats ever conceived. Use (compressed) line-delimited JSON if you need a file of records.

cpeterso · on Dec 10, 2022

Another alternative is a streaming JSON format like JSONL (newline-delimited JSON). You can parse one record/line at a time, but still have the structure and named fields of JSON.

https://en.m.wikipedia.org/wiki/JSON_streaming

brundolf · on Dec 11, 2022

I worked at a company where we did this for some endpoints and it worked great. Our client app had to request enormous time-series datasets and using CSV cut a significant percentage off of the payload size. I recommend it if you have similar constraints

nathants · on Dec 10, 2022

i had a lot of fun exploring the performance ceiling of csv and csv like formats. turns out binary encoding of size prefixed byte arrays is fast[1].

csv is just a sequence of 2d byte arrays. probably avoid if dealing with heterogeneous external data. possibly use if dealing with homogeneous internal data.

1. https://github.com/nathants/bsv/tree/55c90797283f5e37f91bbb6...

fellowniusmonk · on Dec 10, 2022

Delimited formats performance can be exceptional, they can also be phenomenally terse and avoid the string tarpits of CSV and TSV if you just use these unicode characters.

U+241D, U+241E, U+241F

teddyh · on Dec 10, 2022

Those are not the unit/record/group separator characters! Those are the graphical symbols for the unit/record/group separator codes. The actual unit/record/group separator codes are in ASCII, as 'tremon' writes in a sibling comment.

fellowniusmonk · on Dec 10, 2022

I guess I shouldn't post comments when I'm terribly hungover. The fact remains, use those seperators, most developers don't even seem aware of them to any degree.

tremon · on Dec 10, 2022

Or these characters, from the ASCII era:

SOH (U+01), US (U+1F), RS (U+1E), GS (U+1D), FS (U+1C)

pkstn · on Dec 10, 2022

Use gzip for compressing. If you want to stream, use following syntax:

[\n

  { ... },\n

  { ... },\n

  { ... },\n

  ...\n

]\n

With this simple trick you can stream easily..

pkstn · on Dec 10, 2022

in client-side you can parse bunch of items by just wrapping with [ and ]

pkstn · on Dec 10, 2022

and well, there's also BSON: https://github.com/mongodb/js-bson

revskill · on Dec 10, 2022

Sure! For example, for batch processing, CSV is always the default for me and the teams.

margarina72 · on Dec 10, 2022

you may also simply add a format specification and return either csv or json depending on the need or the context. Most language would have what it needs to return either without much trouble.

dsmmcken · on Dec 10, 2022

You could also consider Kafka for streaming, and Parquet for batch.