Hacker News new | past | comments | ask | show | jobs | submit login

Those methods were always appealing to me, and I tried to bild some home use applications with it (document management, even some bash cgi scripts) but they fall over quite soon:

* How do you deal with words like "---" in your text that look like the match separator of grep?

* What if your filenames contain spaces?

* Are the sed/awk/perl one liners really all that readable and correct?

* How to catch and report failure conditions ... in pipe steps?

This stuff is great for interactive use and one-off ETL, not for applications.

Not sure what real alternatives are that give you:

- parallel execution

- seamless composition (like |)

- object passing not byte streams

- Quick to write.

Most of the time I switch to Python for this, but it does not give you sane parallelity. Sure you can do this with Java + Akka, but this takes days to build out...

Any recommendations?




The two I reach for first: Dask and SparkSQL

Dask is super easy and quick to learn provides similar features to spark but can be somewhat easier for the Pandas crowd. There's also Modin/Ray for this but I haven't tried it yet.

For very fast processing and ease of writing SparkSQL is the tool I reach for. Start a single node spark instance (super easy) then interactively wrangle ur data declaratively with SQL. Great for quick and dirty cleaning and aggregation of big-ish data.

If you're into google cloud BigQuery is currently my top tool for quick and dirty processing but u can do a lot more with ur 5$/1TB with a giant compute engine high mem instance and Dask or SparkSQL.


Thanks for this. I did not know about Dask! wow this looks great. Love the web-based task visualizations: https://distributed.dask.org/en/latest/web.html


Check out the Dask Bag it’s my favorite feature, it helps you deal with non tabular data that also might not be structured consistently: https://examples.dask.org/bag.html

Everybody I show it to likes it even more than working with data frames once they grok it.


>How do you deal with words like "---" in your text that look like the match separator of grep?

For this one you can use the "--" flag to signal that everything else should be treated as an argument.

    $ grep -rn -- -


yeah, I should have made that more clar: I was talking about the sed expression on p17. It looks for "--" in stdin. This could alsp be a word. I realize now that the first tr before removes all non a-zA-Z characters, so in this case it should not be an issue.

However intermixing text with separators is not trivial. There are reasons we use JSON/XML for exchanging structured data.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: