Dask is super easy and quick to learn provides similar features to spark but can be somewhat easier for the Pandas crowd. There's also Modin/Ray for this but I haven't tried it yet.
For very fast processing and ease of writing SparkSQL is the tool I reach for. Start a single node spark instance (super easy) then interactively wrangle ur data declaratively with SQL. Great for quick and dirty cleaning and aggregation of big-ish data.
If you're into google cloud BigQuery is currently my top tool for quick and dirty processing but u can do a lot more with ur 5$/1TB with a giant compute engine high mem instance and Dask or SparkSQL.
Check out the Dask Bag it’s my favorite feature, it helps you deal with non tabular data that also might not be structured consistently: https://examples.dask.org/bag.html
Everybody I show it to likes it even more than working with data frames once they grok it.
Dask is super easy and quick to learn provides similar features to spark but can be somewhat easier for the Pandas crowd. There's also Modin/Ray for this but I haven't tried it yet.
For very fast processing and ease of writing SparkSQL is the tool I reach for. Start a single node spark instance (super easy) then interactively wrangle ur data declaratively with SQL. Great for quick and dirty cleaning and aggregation of big-ish data.
If you're into google cloud BigQuery is currently my top tool for quick and dirty processing but u can do a lot more with ur 5$/1TB with a giant compute engine high mem instance and Dask or SparkSQL.