Those methods were always appealing to me, and I tried to bild some home use app...

faizshah · on Jan 31, 2020

The two I reach for first: Dask and SparkSQL

Dask is super easy and quick to learn provides similar features to spark but can be somewhat easier for the Pandas crowd. There's also Modin/Ray for this but I haven't tried it yet.

For very fast processing and ease of writing SparkSQL is the tool I reach for. Start a single node spark instance (super easy) then interactively wrangle ur data declaratively with SQL. Great for quick and dirty cleaning and aggregation of big-ish data.

If you're into google cloud BigQuery is currently my top tool for quick and dirty processing but u can do a lot more with ur 5$/1TB with a giant compute engine high mem instance and Dask or SparkSQL.

heinrichhartman · on Jan 31, 2020

Thanks for this. I did not know about Dask! wow this looks great. Love the web-based task visualizations: https://distributed.dask.org/en/latest/web.html

faizshah · on Feb 1, 2020

Check out the Dask Bag it’s my favorite feature, it helps you deal with non tabular data that also might not be structured consistently: https://examples.dask.org/bag.html

Everybody I show it to likes it even more than working with data frames once they grok it.

bkq · on Jan 31, 2020

>How do you deal with words like "---" in your text that look like the match separator of grep?

For this one you can use the "--" flag to signal that everything else should be treated as an argument.

    $ grep -rn -- -

heinrichhartman · on Jan 31, 2020

yeah, I should have made that more clar: I was talking about the sed expression on p17. It looks for "--" in stdin. This could alsp be a word. I realize now that the first tr before removes all non a-zA-Z characters, so in this case it should not be an issue.

However intermixing text with separators is not trivial. There are reasons we use JSON/XML for exchanging structured data.