Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's not possible to read ahead and chunk with 100% assurance it will always work, but libraries like pandas and R's data.table do a reasonable job of reading in the first X rows, doing some usually correct type inference on the columns, and then read ahead and chunking the rest of the rows.

For what it's worth, I totally agree something like compressed json lines is a better data exchange format, but part of why csv remains as universal and supported as it is is that so much existing data storage applications export to either csv or excel and that's about it. So any ETL system that can't strictly control the source of its input data has no choice but to support csv.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: