Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Heres a probably unpopular opinion.... Pipes make things a bit slow. A native pipeless program would be a good bit faster - incl. an acid db. Note that doing this in python and expecting it to beat grep wont work...

The other thing is that hadoop - and some others are slow on big data (peta, or more) vs own tools. Theyre necessary/used because of massive clustering (10x the hardware deployed easily beats making ur own financially).

I suspect its a general lack of understanding the way computers work (hardware, os ie system architecture) vs "why care it works and python/go/java/etc are easy for me i dont need to know what happens under the hood".



> incl. an acid db.

Why would you want to use a database for this problem? The input data would take time to load into an ACID db and we're only interested in a single ternary value within that data. The output data is just a few lists of boolean values so it has no reason to be in a database either.

This is a textbook stream processing problem. Adding a database creates more complexity for literally no benefit assuming the requirements in the linked article were complete. I would be baffled to see a solution to this problem that was anything more than a stream processor, to say nothing of a database being involved.


If it really is just a one-shot with one simple-ish filter, I agree. But I often find myself incrementally building shell-pipeline tangles that are sped up massively by being replaced with SQLite. Once your processing pipeline is making liberal use of the sort/grep/cut/tee/uniq/tac/awk/join/paste suite of tools, things get slow. The tangle of Unix tools effectively does repeated full-table scans without the benefit of indexes, and is especially bad if you have to re-sort the data at different stages of the pipeline, e.g. on different columns, or need to split and then re-join columns in different stages of the pipeline. In that kind of scenario a database (at least SQLite, haven't tried a more "heavyweight" database) ends up being a win even for stream-processing tasks. You pay for a load/index step up front, but you more than get it back if the pipeline is nontrivial.


The interesting part is that its still faster, not that its the best-case solution. The main reason is that the data set fits in memory and is no slower to load (you need to read the data in all cases, duh. Both piped and db will read the data from disk exactly once in a sequential fashion).

There is no locking issue, and you can be smart in the filtering steps (most dbs do some of that automagically anyway). You don't have that level of control with the pipes, you are limited by the program's ability to process stdin, and additional locking.

This is exactly where knowing how things really work under the hood give you an advantage vs "but in theory..". You can reimplement a complete program, or even set of programs that will outperform the db abd the piped example. But will you? No, you want the best balance between fastest solution with the least amount of work.


In the final solution at the end of the article there are only two pipes:

1. A pipe to feed the file names into xargs for starting up parallel `mawk` processes.

2. A pipe to a final `mawk` process which aggregates the data from the parallel processes.

There's still some performance that could be gained by using a single processes with threads and shared memory, but this is pretty good for something that can be whipped together quickly.


Yeah its not bad. In the final command, it is basically leveraging mawk for everything which works out well since there's fewer pipes.

But in this case its about replacing hadoop with mawk basically. Which is indeed a good point as well - and incidentally also confirms my own comment =)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: