Hacker News new | past | comments | ask | show | jobs | submit login

The two I reach for first: Dask and SparkSQL

Dask is super easy and quick to learn provides similar features to spark but can be somewhat easier for the Pandas crowd. There's also Modin/Ray for this but I haven't tried it yet.

For very fast processing and ease of writing SparkSQL is the tool I reach for. Start a single node spark instance (super easy) then interactively wrangle ur data declaratively with SQL. Great for quick and dirty cleaning and aggregation of big-ish data.

If you're into google cloud BigQuery is currently my top tool for quick and dirty processing but u can do a lot more with ur 5$/1TB with a giant compute engine high mem instance and Dask or SparkSQL.




Thanks for this. I did not know about Dask! wow this looks great. Love the web-based task visualizations: https://distributed.dask.org/en/latest/web.html


Check out the Dask Bag it’s my favorite feature, it helps you deal with non tabular data that also might not be structured consistently: https://examples.dask.org/bag.html

Everybody I show it to likes it even more than working with data frames once they grok it.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: