Those methods were always appealing to me, and I tried to bild some home use applications with it (document management, even some bash cgi scripts) but they fall over quite soon:
* How do you deal with words like "---" in your text that look like the match separator of grep?
* What if your filenames contain spaces?
* Are the sed/awk/perl one liners really all that readable and correct?
* How to catch and report failure conditions ... in pipe steps?
This stuff is great for interactive use and one-off ETL, not for applications.
Not sure what real alternatives are that give you:
- parallel execution
- seamless composition (like |)
- object passing not byte streams
- Quick to write.
Most of the time I switch to Python for this, but it does not give you sane parallelity.
Sure you can do this with Java + Akka, but this takes days to build out...
Dask is super easy and quick to learn provides similar features to spark but can be somewhat easier for the Pandas crowd. There's also Modin/Ray for this but I haven't tried it yet.
For very fast processing and ease of writing SparkSQL is the tool I reach for. Start a single node spark instance (super easy) then interactively wrangle ur data declaratively with SQL. Great for quick and dirty cleaning and aggregation of big-ish data.
If you're into google cloud BigQuery is currently my top tool for quick and dirty processing but u can do a lot more with ur 5$/1TB with a giant compute engine high mem instance and Dask or SparkSQL.
Check out the Dask Bag it’s my favorite feature, it helps you deal with non tabular data that also might not be structured consistently: https://examples.dask.org/bag.html
Everybody I show it to likes it even more than working with data frames once they grok it.
yeah, I should have made that more clar: I was talking about the sed expression on p17. It looks for "--" in stdin.
This could alsp be a word.
I realize now that the first tr before removes all non a-zA-Z characters, so in this case it should not be an issue.
However intermixing text with separators is not trivial. There are reasons we use JSON/XML for exchanging structured data.
* How do you deal with words like "---" in your text that look like the match separator of grep?
* What if your filenames contain spaces?
* Are the sed/awk/perl one liners really all that readable and correct?
* How to catch and report failure conditions ... in pipe steps?
This stuff is great for interactive use and one-off ETL, not for applications.
Not sure what real alternatives are that give you:
- parallel execution
- seamless composition (like |)
- object passing not byte streams
- Quick to write.
Most of the time I switch to Python for this, but it does not give you sane parallelity. Sure you can do this with Java + Akka, but this takes days to build out...
Any recommendations?