I deal with gig size csvs all the time and don’t have any performance issues. Th...

whakim · 2024-03-26T15:12:11 1711465931

That may be your experience, but certainly not a universal experience (and apparently not the author's, either). In my experience, it's pretty easy to have CSVs (or Parquet files, or whatever) that are tens or hundreds of GBs in size. The space savings from a more modern file format are significant, as is the convenience of being able to specify and download/open only a subset of rows or columns over the network. Most of us don't have workstations with 50GB of RAM, because it's far more cost-effective to use a Cloud VM if you only occasionally need that much memory.

That being said, the real point here is that folks blindly use CSVs for internal-facing processes even though there's no particular reason to, and they have plenty of drawbacks. If you're just building some kind of ETL pipeline why wouldn't you use Parquet? It isn't as if you're opening stuff in Excel.

prepend · 2024-03-26T19:21:00 1711480860

The author is giving universal advice to all friends.

If the title was “friends in certain circumstances shouldn’t let friends in certain circumstances export to csv.”

Even a laptop with 8gb ram can open a gig csv.

Of course the internals of your etl will use some efficient data structure, but you’d still want to export as csv at some point to get data to other people. Or you want your friends to export csv to get data to you.

Certhas · 2024-03-26T08:15:55 1711440955

If I run a simulation workload it's pretty easy to generate gigabytes of data per second. CSV encoding adds a huge overhead space and time wise, so saving trajectories to disc for later analysis can easily become the bottleneck.

I have had many other situations where CSV was the bottleneck.

I still would default to CSV first in many situations because it's robust and easily inspected by hand.

smcin · 2024-03-26T12:41:27 1711456887

> That being said my workstation is pretty big and can handle 10s of gigs of csv before I care.

How much RAM do you have? What's the ratio of [smallest CSV file which bottlenecks]/[your RAM]?

prepend · 2024-03-26T19:22:20 1711480940

My dev workstation has 96gb. I don’t work with massive data files so I’ve never really hit my limit. I think the biggest raw data file I’ve opened was 10-20gb.