CSV looks deceptively simple. It is far too easy to just write(','.join(whatever)), which sort of works, until it doesn’t, and then someone, sometimes I, has to sort out the resulting mess. PLEASE use a proper CSV library (Python comes with a CSV module in the standard library), or at least implement the entire format according to the RFC from the outset, even if you think you won’t need it!
Oh yes. CSVs are deceptively challenging especially if your use-case is from excel files to csv. Excel will happily convert a worksheet to csv, but it's a naive conversation. Headers that start on line 3, multi-line headers, inconsistent column counts, etc. It adds up really quickly!
I've also run into issues where I wrote some code that worked with csv input, and told users they could just export their data from excel. turns out excel doesn't export in utf-8 by default, we had some weird issues until we figured that out.
Another issue is passing those CSV files across international borders - a CSV file that works in the UK (commas as separators and a decimal point) may not be readable in Germany (semicolons as separators and decimal commas) without some configuration.
This is a matter of developer education. The correct way to create and parse CSV files is to use a third-party library. They can get complicated. A field in a CSV can contain commas and quotes. In some cases, a single field can contain a line-feed, and you'll need to ensure the parser you use supports that. This would allow an entire CSV file to be embedded inside the field of a CSV field. At a minimum, a parser must support Excel's default parser logic.
But, if you pick the right parser and generator, then you're ok with using it.
Right, but if you're picking CSV, you likely expect to interoperate with a provider that's not yourself. And then there's no way a parser can handle all CSV formats in the wild.
Yeah, but this is less of a problem if it's an internal API. You can stick to a stricter subset of the standard, and/or only handle the types of column values that you actually need
Still probably worth using a library, but it isn't a source of problems in my experience
As much as I like and use CSV for database work, it has a problem with being poorly specified. The most common problems are when processing CSVs produced elsewhere which might not enclose text fields with quotes and thus have issues with data that includes commas and multi-line data.
There is a spec (RFC 4180 [1]) but it's definitely not widely followed. Worse, for a lot of data there's no problems for potentially years, until your numbers get too big or the first time a quote or comma gets in the data.
In my experience one of the biggest barriers I run into -- and the primary reason I hate using CSV -- is Microsoft Excel. It misinterprets numbers as dates, it convers big numeric identifiers to exponents, and more. Even merely opening a RFC4180-compliant file and saving it changes the data, and even Excel itself will often have a different misinterpretation of the de file.
If humans never used Excel for CSV, it would be a viable format. At the same time in most cases where humans aren't in the loop (machine-to-machine communications), there's better formats. You could spec "RFC4180 CSV" and hope no developer just sees the "CSV" and assumes they understand. Or specify something like a JSON streaming format and avoid a whole lot of headache.
I avoided CSV for quite a while because I had excel-vs-CSV compatibility concerns like this.
However, when I tested this for myself a few years back, Excel output to my surprise was rfc4180 or darn near it (it might use CRLF rather than LF?) It emitted commas and quotes the same way as the rfc for all the test cases I checked.
That said, I agree with you Excel is problematic as an input source. Usually the problems are the humans who touch the data in excel, but what I’ve found is the automation problems tend to be with Excel parsing and interpreting incoming data (before it goes to CSV.) Exponents, trimming leading zeros, etc. as you say. But if the data is confirmed good in excel before being emitted, the CSV it emits is decent.
Yes, I feel like this would've been more helpful generalized as "Consider DSV" (delimiter-separated values) than CSV specifically, because of the interop issues that often come up. I'd have also mentioned using Parquet.
I generally find Avro to be a better replacement for CSV than Parquet. It’s a better drop in for the typical CSV use case of “process this file row by row”.
Avro is a row-based format that supports schema evolution, but unfortunately it is not a very popular format in general and there's much less support for it than Parquet. My team evaluated several formats (ORC, Parquet, Avro) but landed on Parquet due to columnar performance and wider support, though you're right, it's less performant for row-wise operations (but not by that much). If you really want row-wise performance, better to go with sqlite than Avro.
Parquet has the opposite problem of CSV though. It's so complex to work with, that unless you're specifically in data science, it's both unheard of and unusable.
To read a parquet file in Python, you need Apache Arrow and Pandas. And literally the second result for "parquet python libraries" is an article titled "How To Read Parquet Files In Python Without a Distributed Cluster".
I remember dealing with Parquet file for a job a while back and this same question came up: Why isn't there a simpler way, for when you're not in the data science stack and you just need to convert a parquet file to csv/json/read rows? Is is a limitation of the format itself?
Not really. Depends on your use case but most of the time you’re trading off disk space for a specialized efficient library.
Pandas and Arrow are dependencies like any other. Pandas is like a DSL for working with tabular data, much like numpy is a DSL for working with arrays and numerical algebra. No one working with linear algebra will insist on using the Python standard library built ins.
If you’re distributing a smallish Python app that only needs to read and manipulate smallish amounts of data, then I agree there are easier solves like SQLite.
But if you’re doing consulting work and dealing with large tabular datasets and need to do SQL type window functions and aggregations then Parquet is a better fit and the disk space required for adding a Pandas dependency is trivial. If one is using Anaconda, Pandas is batteries included. It really depends on what is being optimized for.
> It's so complex to work with, that unless you're specifically in data science, it's both unheard of and unusable.
FWIW, in my experience at a "data analytics platform" company, it's reasonably popular for data-heavy workflows since Parquet is well-defined, and file sizes (especially as the amount of data grows) are a fraction of their CSV equivalents.
> Is it a limitation of the format itself?
I don't think so. In other languages, you can generally read/write Parquet files without a ton of dependencies (e.g. https://github.com/xitongsys/parquet-go).
> To read a parquet file in Python, you need Apache Arrow and Pandas.
Or DuckDB.
import duckdb
df = duckdb.query("select * from 'a.parquet'")
Want to look inside a Parquet file? Use Visidata.
vd a.parquet
> I remember dealing with Parquet file for a job a while back and this same question came up: Why isn't there a simpler way, for when you're not in the data science stack and you just need to convert a parquet file to csv/json/read rows? Is is a limitation of the format itself?
Do you consider Pandas a "data science" stack? To me, it's just a library like any other that makes it easy to work with tabular data. Even for CSV, there is csvreader (usually not a good idea to deal with CSV by hand). Outputting to CSV is literally a one liner in Pandas or DuckDB.
import pandas as pd
# output to CSV
pd.read_parquet("a.parquet").to_csv("a.csv")
# output to JSON (choose from any number of orientations)
pd.read_parquet("a.parquet").to_json(orient="table")
# read rows
for row in pd.read_parquet("a.parquet").itertuples():
print(row)
I want to use parquet more frequently, but it creates new problems that do not exist if I dump to CSV. Last I looked, there were not any good GUIs that would let someone quickly browse the data. Now it is just a blob lacking introspection. CSV has issues, but it is universal.
A bit round-about, but the slick way I discovered is to take a detour through DuckDB. DuckDB offers parquet bindings which you can link through a kind of foreign data interface and then query through SQL. Using this, you can then just browse parquet files through DBeaver or your IDE of choice. Hardly an out of the box solution I can offer to a random collaborator, but fantastic for your savvy analyst.
That is interesting to hear. Parquet input and output is on the wishlist for our Easy Data Transform software (currently we support CSV, Excel, XML, JSON and a few others). Anyone have any experience integrating Parquet read/write into a C++ application?
if you choose pipe ok, now you have to make sure nobody typed a pipe into the input field or spreadsheet, and you cannot store unix commands
if you choose tab, ok, now people will get confused when they try to edit the text file to replace tabs with spaces, and now you have trouble putting code snippets into data fields because they have tabs.
this is the problem and it's why xml/json exist.
in my particular domain, tab separated works pretty well but in a general context of the world at large, i feel like JSON has reasons it exists.
Well the obvious solution would be ASCII 0x1D (Group Separator)! Accept, no one actually uses those ASCII characters. Kind of bums me out that UNIX basically skipped out on them.
Both pipe and tab are infinitely better for so-called human-readable data compared to comma. Comma doesn’t even work well for numbers since some locales use comma as the decimal separator. And a data format can’t be “human-readable” if you’re not allowed to write numbers in the way that you’re used to write them.
In the context of an API so long as using a "real" CSV library a lot of those inconsistencies do not appear. Problems happen when you have to interface with humans and desktop software (ie Excel) which has its own rules.
Yes, CSV is superior to JSON for tabular data, but has it's own issues. One issue is that the standard is not consistently applied. Another is the approach to escaping means that it is hard to parse a CSV file with multiple threads. You have to parse the entire file before you can be sure which " characters escape other characters.
Until you try to either cram it into Excel, work with different encodings or pass it around different software platforms (even your lang vs JS), no difference.
CSV has been abused a lot to make it work on conflicting use-cases, JSON handles a lot of misshaps happened with delimiter-separated record formats, like new-lines or bring-your-own-character encoding.
> The most common problems are when processing CSVs produced elsewhere [...]
The limitations of CSV are certainly worth considering and, in the instances you mentioned, it may be not be worth using CSV. (If you are going to be using a more complex parser anyway, you may as well using a format that is better defined and where you are less likely to encounter edge cases.) That being said, there remain many cases where CSV is far more efficient and far less error prone.
Which works until you have a quote in your data. It also prevents chunking out the file to process in parallel since you could have new lines between the quotes.
What I've found to work well is to just % encode your delimiter, the new line character, and the '%' character. Basically every language has utilities for this.
Doesn't solve the issue with accepting outside files though. You have to be pessimistic with those regardless.
With gzip on web server the difference is not important at all.
CSV in general is problematic as there is no standard (RFC 4180 is not). In certain contexts this surely can be good solution but definitelly not good in general scenario.
As Wikipedia puts it, "CSV is widely used to refer to a large family of formats that differ in many ways". If there's a canonical standard, it appears to be RFC4180: https://www.rfc-editor.org/rfc/rfc4180
It appears, but its not. I have not found single program so far that conforms only to this RFC and nothing else.
From the RFC itself:
Status of This Memo
This memo provides information for the Internet community. It does
not specify an Internet standard of any kind. Distribution of this
memo is unlimited.
> I have not found single program so far that conforms only to this RFC and nothing else.
Wouldn't that be impossible, given that parsers have to accept all kind of bizarro CSV flavors? Maybe more importantly, do you know of a single program or single CSV library that doesn't support reading or writing CSV as defined by the RFC?
An "Internet Standard" is just a designation that has been given to an RFC that has been blessed in a certain way. See https://www.rfc-editor.org/ for more details, but the set of designations is:
* Uncategorised
* Historic
* Experimental
* Informational
* Best Current Practice
* Proposed Standard
* Draft Standard
* Internet Standard
Being an "Internet Standard" is kinda special, but not especially so. For example, IMAP4, originally specified in RFC-3501 in March 2003, updated many times since, and revised in RFC-9051 in August 2021, is still a "Proposed Standard" without an STD designation, nearly 20 years and dozens of interoperable implementations later.
"Rough consensus and running code" is how things get done.
RFC-4180 is plenty good enough a "standard" for people to decide to interoperate over. They just have to decide to do so.
(Note also that HTML5 is not an "Internet Standard" according to the IETF et al. The last version to get an RFC was HTML 2 in RFC-1866, designated "Historic". And interoperability was an issue for a while with later versions of HTML during the "Best viewed in Internet Explorer/Netscape Navigator" wars. To get interoperability like we eventually did, you don't need an "Internet Standard"; you just need implementers who want to interoperate, and are willing to favour it over lock-in, and even over strict backwards-compatibility.)
(Also, the "and nothing else" clause in your comment confuses me. Why not support other formats/variants also? "Be liberal in what you accept" is certainly something that you probably want to avoid if you're designing a new format/protocol that no-one else is using yet, but if you're working with a decades-old format that was traditionally poorly-specified, with millions of documents out in the wild, it's probably the best way to allow existing users to move forward.)
That was my first thought: JSON is highly amenable to compression; due to the repetition this blog is complaining about. It's a good lesson for junior devs: if you find yourself thinking about saving bits and bytes with custom protocols, you need to pull out of the rabbit hole and find the existing solution to your problem.
Sure, for a local data file or something where it's nice to be human-readable-ish, CSV can be a better choice than JSON (assuming you use a library for all the edge cases and string escapes.) If you really want a super-small and fast serialization, that's what protobuf is for.
CSV is also great for importing external data into documents. My text editor, KeenWrite[0], includes an R engine and a CSV-to-Markdown function[1]. This means you can write the following in a plain text R Markdown document:
`r#csv2md('filanme.csv')`
The editor will convert Markdown to XHTML in the preview panel (in real time), then ConTeXt can typeset the XHTML into a PDF file in various styles.[2][3] This avoids spending time fighting with table formatting/consistency in certain word processors while storing the data in a machine-friendly format. (Thereby upholding the DRY principle because the data can have a single source of truth, as opposed to copying data into documents, which could go stale/diverge.)
Using JSON would be possible, but it's not as easy to convert into a Markdown table.
JSON lines looks kinda interesting, but the newline-delimited thing seems weird.
It seems to me that you could write a JSON streaming parser that, if the outer element is an Array, reads and outputs/processes one Array element at a time as its JSON value. Yeah, you can't get the array length, and if there's a parse error somewhere down the line then you have to figure out how to deal with that (a non-streaming parser would have rejected the whole input), but that's kind of inherent in using a streaming parser. The upside is that you can work with any valid JSON.
Sure, if you're interoperating with shell tools, and don't have `jq` available, newline-delimited JSON might be helpful. But on the other hand, just install `jq`, dummy!
Excel compatibility really sux. And Excel is most used one by large. You really can't double click it effectivelly, as everything will be shown as generic type, so you have to mess up with wizard which is also half baked. I have to create tutorials for that for each service using it.
This style also works well with jsonlines a sibling comment mentioned. Of course my favorite is MessagePack (or CBOR) using similar styles. MsgPack can be as small as gzipped JSON. :)
I think one of the issues it data types. JSON has them CSV doesn't, so this means your program needs to be aware of which columns are which data type and do the conversion where needed.
It's similar to JSON Vs INI files for config files.
On a different note I wouldn't nest JSON in a CSV column. I'd delimit with a pipe or something the split string on that. Much simpler if you're in control of the data.
How often is this a concern in practice? It’s a question I’ve been thinking about a bunch and the answer I keep coming back to is that most of the time, encoding the type in the data exchange format isn’t actually very useful. So I would be interested in use cases where it is.
Are you just trusting that the types coming in are going to be the correct ones? What happens if someone sends you `{“foo”: “10”}` instead of `{“foo”: 10}`? Do you validate with a schema up front? In which case your code already needs to know what types it expects and can convert them. Or are you letter the incorrect types run through your system until it hits a type error at runtime somewhere?
No idea really, but if you're using a JSON parsing library then that is going to automatically convert for the data types. Which, provided you trust the data, saves you a job.
protobuf is terrible! Now you have to rely on Google-scale tools to generate code for whatever language(s) you want to read or write the data in, and this becomes quite the encumbrance.
I mean point well taken, but, as they acknowledged in the post themselves, CSV isn't suitable when you have a nested structure. And you almost always have/need a nested structure, no?
Relational databases have worked fine for decades without nested structures. The simple trick is to take the nested structure out of the entity and into its own table.
Unless I misunderstood something, I'm not sure I understand the relevance here. I assumed we were talking about sending data to clients. In such cases, you do not send database tables. Instead, you send rich, fully hydrated objects which are the result of joining those tables. The serialized representation can be backed by the relational model, but at some point you have to put those together to send something useful to the client. My only point is that CSV is unsuitable for this task in many/most cases.
That may be simple trick for the db, but not when your paradigm involves importing files - imaging telling that to users, instead of giving json, please give 75 csv files.
This scenario might be more common than you think -- spreadsheets still reign supreme, and often 75 csv files is how the users have the data to begin with.
(Incidentally, my day job is building a spreadsheet importer.)
I always thought CSV was just fine, until I had to ingest and export a bunch of CSV in my last project.
The big problem is that CSV is not well defined and it's so deceptively simple that many don't bother to adhere to the spec that does exist.
Just a few idiosyncrasies I found:
Inconsistent character encoding. If you open or save a csv with Excel it will assume a Windows-1252 encoding. Since browsers deal exclusively with UTF-8, this get's really messy.
The CSV I got didn't actually use a comma as a delimiter but a semicolon.
Everyone seems to have conflicting options about whether strings should have quotes and if so, which ones.
The CSV I had to deal with also came with a decimal comma, which screwed up even more stuff.
My advice stay away from CSV as an exchange format. Use something that is well defined.
I'm definitely in the "Just use JSON for most things" camp, but I'm wondering, why would you ever choose CSV for interfacing microservices over protobuf?
Isn't protobuf basically CSV but with good libraries at the interface point and standards around how to deserialize the streams?
I usually prefer a binary encoding. More efficient on the wire, easier to parse and generate, and with no ambiguity. We have 2 control codes given to us by the teletype era that have the perfect meaning for this kind of data:
The size complaint is overblown, as repeated fields are compressed away.
As other folks rightfully commented, csv is a mine field. One should assume every CSV file is broken in some way. They also don't enumerate any of the downsides of CSV.
What people should consider is using formats like Avro or Parquet that carry their schema with them so the data can be loaded and analyzed without have to manually deal with column meaning.
If somebody asked me to support this format after you'd left the company I'd quit on the spot.
This frankenformat is 100% premature optimization, non standardised, requires custom parsers (which are potentially inefficient and may negate the network performance from having to parse both json and csv) and is potentially very difficult to maintain and debug (no syntax highlighters or rest like posting tools)
Just either use GRPC or JSON with regular network level gzip encoding.
A constrained format based on JSONL with each record being a tuple of number/string/bool/null could better defined than CSV and looks almost like it. The benefit being, almost any json library could work with it, or could be made to one line at a time and it can be parallelized as newlines only exist as the delimiter.
It often surprises me that Parquet is not widely known outside of data engineering circles. Most software developers are still mucking around with CSV for large tabular data, which is absolutely the wrong format. Better developers use sqlite, which is less wrong but still wrong. Postgres is closer to the right answer, but for very large, typed tabular data, Parquet is the way to go.
Parquet is a columnar format that is compressed, typed, efficient for columnar queries, append-friendly (though not rewritable), and is a natively supported format for Apache Spark. Parquet libraries are now widely available for most languages (didn't used to be the case, but now they are). I query Parquet files with DuckDB in Python and it blazes through GBs of data in seconds.
At work, the canonical format is TSV (tab separated values) which despite being human-readable, is huge, inefficient to query and does not support data types.
When I have to work with large TSV files (10GB or larger), I first convert them into 500MB Parquet files. The latter are faster, smaller and less prone to type errors. Because columnar formats like Parquets are indexed, I can do complex operations like joins, window functions, aggregations on them in a performant way, while any similar operation on TSV files will trigger a table scan each time.
I recently ran a Spark job on a very large TSV file which took over 8 hours and timed out. A Spark job on the same data represented in Parquet completed in 5 minutes.
I cannot count how many times CSV "format" has caused problems for me..
In my country the decimal separator is comma, instead of punctuation.
This causes problems when importing and exporting with this "format".
Just few weeks ago I had fun times working with API returning CSV in unknown encoding. Hopefully they will never make changes (you cannot always trust headers). Ah and i do love when CSV is missing headers and someone adds data into middle.
Of course some of these issues can be avoided by doing the things "right". Sadly you cannot trust this in real life. People write ugly structures in JSON, but at least you can validate results..
No. Just no. The amount of times I've had issues with CSVs exported from a non-US locale is insane. They use semi-colon as separator, as for some weird reason they use the comma as the decimal point.
Then there's the issue of encoding, as that is also not the same across locales. Then you get a CSV with the BOM characters up front or some French accents represented as ? because of incorrect encoding parsing / saving.
At least JSON doesn't have any of these things. Standardized strings, and standardized number format.
You could have the advantages of both worlds by having one json object per line. You could stream process, and you could structure more complex objects and have consistent escaping.
You could do that, you could also have a json that is not streamable. You can't guarantee how large a json object will be but you can guess that the csv will probable be.
I quite like CSVs. I've used them to great effect at maybe every job I've ever had. xsv, sqlite, and Excel/LibreOffice provide useful tooling on top of them.
I see a lot of complaining about "no standard" in this thread, but the way I've used them, it's been fine. I just use Haskell's cassava. If human produce them with Excel/LibreOffice, I never have issues on the ingestion end.
Another alternative is a streaming JSON format like JSONL (newline-delimited JSON). You can parse one record/line at a time, but still have the structure and named fields of JSON.
I worked at a company where we did this for some endpoints and it worked great. Our client app had to request enormous time-series datasets and using CSV cut a significant percentage off of the payload size. I recommend it if you have similar constraints
i had a lot of fun exploring the performance ceiling of csv and csv like formats. turns out binary encoding of size prefixed byte arrays is fast[1].
csv is just a sequence of 2d byte arrays. probably avoid if dealing with heterogeneous external data. possibly use if dealing with homogeneous internal data.
Delimited formats performance can be exceptional, they can also be phenomenally terse and avoid the string tarpits of CSV and TSV if you just use these unicode characters.
Those are not the unit/record/group separator characters! Those are the graphical symbols for the unit/record/group separator codes. The actual unit/record/group separator codes are in ASCII, as 'tremon' writes in a sibling comment.
I guess I shouldn't post comments when I'm terribly hungover. The fact remains, use those seperators, most developers don't even seem aware of them to any degree.
you may also simply add a format specification and return either csv or json depending on the need or the context. Most language would have what it needs to return either without much trouble.