How to build a search engine with common Unix tools (2018) [pdf]

heinrichhartman · on Jan 31, 2020

Those methods were always appealing to me, and I tried to bild some home use applications with it (document management, even some bash cgi scripts) but they fall over quite soon:

* How do you deal with words like "---" in your text that look like the match separator of grep?

* What if your filenames contain spaces?

* Are the sed/awk/perl one liners really all that readable and correct?

* How to catch and report failure conditions ... in pipe steps?

This stuff is great for interactive use and one-off ETL, not for applications.

Not sure what real alternatives are that give you:

- parallel execution

- seamless composition (like |)

- object passing not byte streams

- Quick to write.

Most of the time I switch to Python for this, but it does not give you sane parallelity. Sure you can do this with Java + Akka, but this takes days to build out...

Any recommendations?

faizshah · on Jan 31, 2020

The two I reach for first: Dask and SparkSQL

Dask is super easy and quick to learn provides similar features to spark but can be somewhat easier for the Pandas crowd. There's also Modin/Ray for this but I haven't tried it yet.

For very fast processing and ease of writing SparkSQL is the tool I reach for. Start a single node spark instance (super easy) then interactively wrangle ur data declaratively with SQL. Great for quick and dirty cleaning and aggregation of big-ish data.

If you're into google cloud BigQuery is currently my top tool for quick and dirty processing but u can do a lot more with ur 5$/1TB with a giant compute engine high mem instance and Dask or SparkSQL.

heinrichhartman · on Jan 31, 2020

Thanks for this. I did not know about Dask! wow this looks great. Love the web-based task visualizations: https://distributed.dask.org/en/latest/web.html

faizshah · on Feb 1, 2020

Check out the Dask Bag it’s my favorite feature, it helps you deal with non tabular data that also might not be structured consistently: https://examples.dask.org/bag.html

Everybody I show it to likes it even more than working with data frames once they grok it.

bkq · on Jan 31, 2020

>How do you deal with words like "---" in your text that look like the match separator of grep?

For this one you can use the "--" flag to signal that everything else should be treated as an argument.

    $ grep -rn -- -

heinrichhartman · on Jan 31, 2020

yeah, I should have made that more clar: I was talking about the sed expression on p17. It looks for "--" in stdin. This could alsp be a word. I realize now that the first tr before removes all non a-zA-Z characters, so in this case it should not be an issue.

However intermixing text with separators is not trivial. There are reasons we use JSON/XML for exchanging structured data.

limesontoast · on Jan 31, 2020

I am planning to read through this more thoroughly as something I really want is my own personal search engine. Ideally I need it to

1. store data locally for offline retrieval

2. Support indexing big sites including stack overflow, Wikipedia, Reddit, news.ycombinator.com, microsoft.com docs, and a bunch of other domains.

3. Be easy to add a single URL into the index from command line and optionally browser plugin. Only index that page, this would replace my bookmarks.

4. Optionally auto store browser history for a custom period of time, purge when expires.

Does anything like that exist?

ko56 · on Jan 31, 2020

For those interested in building their own local offline datasets/search engines check out Kiwix and Zeal. Understand how they work.

Code is open and there are a ton of already created data dumps + indexes. You don't have spend time rebuilding a Wikipedia/wikidata/stackoverflow dump and index by yourself.

limesontoast · on Jan 31, 2020

Thanks for the pointers, I'll check those out as well.

mickael-kerjean · on Jan 31, 2020

I did build something like this (but even simpler) for the support website of my side project: https://support.filestash.app/. a PHP script calling grep and displaying the results. It's very hacky but is good enough for its intended use case: search through the entire IRC chat log

probably_wrong · on Jan 31, 2020

I did something similar too. For my last move, I wrote a detailed list of which item was in which box. My original plan was to add a QR code to each box, so I could quickly see what's inside.

But once I was done, I realized that I had it backwards. Therefore I wrote a PHP page to grep the list, and figure out in which box a specific item was.

stevekemp · on Jan 31, 2020

I dropped you a mail about your search form :)

SanchoPanda · on Jan 30, 2020

The linked site on the slides is password protected, and internet archive is silent on it as well; Does anyone have a copy of the referenced materials?

cr0sh · on Jan 31, 2020

I've put out a request to his email address for access; if I hear anything, I will post back to this thread...

cr0sh · on Feb 3, 2020

Should be open now:

https://www.smiffy.de/dbkda-2018/

duggan · on Jan 31, 2020

I always enjoy demonstrating various combinations of cat, grep, uniq, sort, and cut to folks unfamiliar with the command line; data scientists in particular.

Even if you can't ship a bash script to production, they're great tools for ad-hoc exploration and validation.

riddleronroof · on Jan 31, 2020

In SQL https://gist.github.com/sanealytics/0e910380576fbe4825455264...

pstuart · on Jan 31, 2020

sqlite has full text search capability: https://sqlite.org/fts5.html#overview_of_fts5

blondin · on Jan 31, 2020

man... slides always feel out of context for me. i would rather a blog post than slides. these seem to cover the basic theory as well...

ninjamayo · on Jan 31, 2020

faizshah · on Jan 31, 2020

There was a post earlier on using command line tools instead of Hadoop for quick data processing. This shows a non-trivial example of how you could implement a complex data pipeline and an overview of some of the commands you could learn if you’re interested.

crmrc114 · on Jan 31, 2020

Yeah, this is a pretty cool post I never considered doing something like this in the shell. It seems silly to me that I forget how powerful basic tooling in the native shell can be.. sometimes I have a jackhammer and I forget that a sledgehammer will do the job just fine.

ahi · on Jan 31, 2020

I've been there. 100s of lines into a ruby program then "oh yeah, cut sort grep"

e12e · on Jan 31, 2020

Just for anyone else, you could probably end up in a similar corner with perl - but in both cases it's likely a case of "holding it wrong" - ruby borrows heavily from perl which borrowed heavily from shell with sed, awk, grep, cut and friends.

So this kind of thing should be quite doable in a short ruby script - or a few short scripts - albeit written in "shell" style, with eg '-n or -p (wrap code in "while gets...end",-p with "puts _"), probably along with -a (automatically split lines).

Its in some senses an entirely different dialect of ruby, though.

Some examples here:

https://github.com/learnbyexample/Command-line-text-processi...

pmlnr · on Jan 31, 2020

Because there's a big trend towards overengineering. I checked the Azure search solution, and my god it's complex. I ended up doing it in sqlite, and it perfect for a small fts document search engine. The article is similar: try to keep it simple; there are barely any cases where you'll use the real benefits of something lkke Azure search.

shakna · on Jan 31, 2020

> I ended up doing it in sqlite, and it perfect for a small fts document search engine.

It even has it as a feature!

    CREATE VIRTUAL TABLE something USING fts5(x, y, z)