Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It depends on what you're trying to optimise, parquet is a very good all round option. HDF I've never really gotten into as it always felt like a good solution only if I move everything over. It's great if your use case fits.

Feather is a layer on top of arrow and was a proof of concept (so I'm not sure how heavily it's used now), and arrow is fast becoming the interchange format. It's exactly laid out as things will be in memory - which means zero copy for shuttling it around from one place to another. I _think_ there is less support for feather but that is likely changing as everything converges.

Parquet should be

* Faster to write * Faster to read (even if you're reading the whole file, which actually isn't required, the format helps you read just sections of the columns you need) * Smaller * Better at handling actual floating points

than CSV, while having actual standards alongside it. Be a little wary of pandas guessing the right column types for you if you're creating partitioned files btw.

When you're working with pandas, etc (check out Dask) you can pretty much just swap out some reading and writing functions. You can also use pyarrow directly if you need to be very careful about column types.

For your use case you may want to explicitly use a single column for the features that is a list, I'm not sure if that's better/worse than having so many columns. If a reader may want to find just some images where a small subset of features are > X, you might benefit from multiple columns so that the reader only processes the data it needs.

Worth testing out, but I expect you should be able to try it out in an afternoon if you're already working with pandas/similar. Just install pyarrow and use a to_parquet. Things like dask (or straight pyarrow) give you partitioned files as output if you want too, if there's a useful column or columns to split on https://arrow.apache.org/docs/python/parquet.html#partitione...



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: