Hacker News new | past | comments | ask | show | jobs | submit login

I deal with gig size csvs all the time and don’t have any performance issues. These aren’t huge files, but decent sized. And most are just a few megs and only thousands to millions of records.

Csv is not very performant, but it doesn’t matter for these use cases.

I’ll also add that I’m not working with the csvs, they are just I/o. So any memory issues are handled by the load process. I certainly don’t use csvs for my internal processes. Just for when someone sends me data or I have to send it back to them.

That being said my workstation is pretty big and can handle 10s of gigs of csv before I care. But that’s usually just for dev or debugging and anything that sticks around will be working with data in some proper store (usually parquet distributed across nodes).




That may be your experience, but certainly not a universal experience (and apparently not the author's, either). In my experience, it's pretty easy to have CSVs (or Parquet files, or whatever) that are tens or hundreds of GBs in size. The space savings from a more modern file format are significant, as is the convenience of being able to specify and download/open only a subset of rows or columns over the network. Most of us don't have workstations with 50GB of RAM, because it's far more cost-effective to use a Cloud VM if you only occasionally need that much memory.

That being said, the real point here is that folks blindly use CSVs for internal-facing processes even though there's no particular reason to, and they have plenty of drawbacks. If you're just building some kind of ETL pipeline why wouldn't you use Parquet? It isn't as if you're opening stuff in Excel.


The author is giving universal advice to all friends.

If the title was “friends in certain circumstances shouldn’t let friends in certain circumstances export to csv.”

Even a laptop with 8gb ram can open a gig csv.

Of course the internals of your etl will use some efficient data structure, but you’d still want to export as csv at some point to get data to other people. Or you want your friends to export csv to get data to you.


If I run a simulation workload it's pretty easy to generate gigabytes of data per second. CSV encoding adds a huge overhead space and time wise, so saving trajectories to disc for later analysis can easily become the bottleneck.

I have had many other situations where CSV was the bottleneck.

I still would default to CSV first in many situations because it's robust and easily inspected by hand.


> That being said my workstation is pretty big and can handle 10s of gigs of csv before I care.

How much RAM do you have? What's the ratio of [smallest CSV file which bottlenecks]/[your RAM]?


My dev workstation has 96gb. I don’t work with massive data files so I’ve never really hit my limit. I think the biggest raw data file I’ve opened was 10-20gb.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: