Those methods were always appealing to me, and I tried to bild some home use applications with it (document management, even some bash cgi scripts) but they fall over quite soon:
* How do you deal with words like "---" in your text that look like the match separator of grep?
* What if your filenames contain spaces?
* Are the sed/awk/perl one liners really all that readable and correct?
* How to catch and report failure conditions ... in pipe steps?
This stuff is great for interactive use and one-off ETL, not for applications.
Not sure what real alternatives are that give you:
- parallel execution
- seamless composition (like |)
- object passing not byte streams
- Quick to write.
Most of the time I switch to Python for this, but it does not give you sane parallelity.
Sure you can do this with Java + Akka, but this takes days to build out...
Dask is super easy and quick to learn provides similar features to spark but can be somewhat easier for the Pandas crowd. There's also Modin/Ray for this but I haven't tried it yet.
For very fast processing and ease of writing SparkSQL is the tool I reach for. Start a single node spark instance (super easy) then interactively wrangle ur data declaratively with SQL. Great for quick and dirty cleaning and aggregation of big-ish data.
If you're into google cloud BigQuery is currently my top tool for quick and dirty processing but u can do a lot more with ur 5$/1TB with a giant compute engine high mem instance and Dask or SparkSQL.
Check out the Dask Bag it’s my favorite feature, it helps you deal with non tabular data that also might not be structured consistently: https://examples.dask.org/bag.html
Everybody I show it to likes it even more than working with data frames once they grok it.
yeah, I should have made that more clar: I was talking about the sed expression on p17. It looks for "--" in stdin.
This could alsp be a word.
I realize now that the first tr before removes all non a-zA-Z characters, so in this case it should not be an issue.
However intermixing text with separators is not trivial. There are reasons we use JSON/XML for exchanging structured data.
For those interested in building their own local offline datasets/search engines check out Kiwix and Zeal. Understand how they work.
Code is open and there are a ton of already created data dumps + indexes. You don't have spend time rebuilding a Wikipedia/wikidata/stackoverflow dump and index by yourself.
I did build something like this (but even simpler) for the support website of my side project: https://support.filestash.app/. a PHP script calling grep and displaying the results. It's very hacky but is good enough for its intended use case: search through the entire IRC chat log
I did something similar too. For my last move, I wrote a detailed list of which item was in which box. My original plan was to add a QR code to each box, so I could quickly see what's inside.
But once I was done, I realized that I had it backwards. Therefore I wrote a PHP page to grep the list, and figure out in which box a specific item was.
The linked site on the slides is password protected, and internet archive is silent on it as well; Does anyone have a copy of the referenced materials?
I always enjoy demonstrating various combinations of cat, grep, uniq, sort, and cut to folks unfamiliar with the command line; data scientists in particular.
Even if you can't ship a bash script to production, they're great tools for ad-hoc exploration and validation.
There was a post earlier on using command line tools instead of Hadoop for quick data processing. This shows a non-trivial example of how you could implement a complex data pipeline and an overview of some of the commands you could learn if you’re interested.
Yeah, this is a pretty cool post I never considered doing something like this in the shell. It seems silly to me that I forget how powerful basic tooling in the native shell can be.. sometimes I have a jackhammer and I forget that a sledgehammer will do the job just fine.
Just for anyone else, you could probably end up in a similar corner with perl - but in both cases it's likely a case of "holding it wrong" - ruby borrows heavily from perl which borrowed heavily from shell with sed, awk, grep, cut and friends.
So this kind of thing should be quite doable in a short ruby script - or a few short scripts - albeit written in "shell" style, with eg '-n or -p (wrap code in "while gets...end",-p with "puts _"), probably along with -a (automatically split lines).
Its in some senses an entirely different dialect of ruby, though.
Because there's a big trend towards overengineering. I checked the Azure search solution, and my god it's complex. I ended up doing it in sqlite, and it perfect for a small fts document search engine. The article is similar: try to keep it simple; there are barely any cases where you'll use the real benefits of something lkke Azure search.
* How do you deal with words like "---" in your text that look like the match separator of grep?
* What if your filenames contain spaces?
* Are the sed/awk/perl one liners really all that readable and correct?
* How to catch and report failure conditions ... in pipe steps?
This stuff is great for interactive use and one-off ETL, not for applications.
Not sure what real alternatives are that give you:
- parallel execution
- seamless composition (like |)
- object passing not byte streams
- Quick to write.
Most of the time I switch to Python for this, but it does not give you sane parallelity. Sure you can do this with Java + Akka, but this takes days to build out...
Any recommendations?