> We’re talking about 31 million lines of data here (all lines with Swiss email addresses in the leak). On our workstation, reducing this to the distinct email addresses (approx. 3,3 million) took a mere 10.099 seconds (4 + 6 + .099).
That's just 700MB of data... On my 5 year old middle-class PC this took 12 seconds with sort and uniq.
$ wc -l email30m
31000000 email30m
$ time sort email30m | uniq > email30m.uniq
real 0m11.498s
user 0m33.833s
sys 0m0.933s
$ wc -l email30m.uniq
2172320 email30m.uniq
This automatically split the task into all available CPU cores.
Using Unix utilities is so underrated. I love Unix programming for the simplicity and power that you demonstrate well here. It's not as flashy and cool as writing some big program but I am constantly amazed how people can go through new web framework fads every 2 months but these 40-50 year old programs are still the most useful tools around.
I'm getting 2 minutes and 29.98 seconds, with my 2017 MacBook Pro with 8 cores + 16 GiB ram. I'm using GNU tools, could you share more details on how you're getting 12 seconds?
$ time awk 'BEGIN{for(i=0; i<3100000; i+=1) for(y=0; y<10; y+= 1) print i "@gmail.com"}' > email30m.sorted
awk > email30m.sorted 8.29s user 0.82s system 97% cpu 9.352 total
$ ls -lh email30m.sorted
522M Mar 8 17:14 email30m
I didn't have your email30m dataset, so I fabricated one, mine is smaller at 522M and has less entropy, since 'm assuming everyone uses gmail & uses numbers for usernames. :)
$ time shuf email30m.sorted > email30m
shuf email30m.sorted > email30m 15.58s user 1.69s system 98% cpu 17.561 total
I also shuffled the dataset, because that would be cheating.
$ wc -l email30m
31000000 email30m
$ time sort email30m | uniq > email30m.uniq
sort email30m 873.55s user 4.29s system 585% cpu 2:29.98 total
uniq > email30m.uniq 4.10s user 0.34s system 2% cpu 2:29.98 total
$ wc -l email30m.uniq
3100000 email30m.uniq
I followed what you did, but it took way longer when running on my machine.
$ time sort email30m.sorted | uniq > email30m.sorted.uniq
sort email30m.sorted 413.75s user 3.66s system 468% cpu 1:29.04 total
uniq > email30m.sorted.uniq 4.11s user 0.40s system 5% cpu 1:29.04 total
And just out of curiosity, it takes 1 minute 29 seconds to sort & uniq the presorted dataset.
GNU sort generates a bunch of temporary files for large inputs, and for a lot of linux folks /tmp is mounted as tmpfs (i.e. it's RAM/swap-backed), but it might do something else on other platforms, or just locate temporary files on disk or something, so that might be one explanation.
Or it could be the difference between qsort() from different libcs.
> This automatically split the task into all available CPU cores.
GNU sort can already utilize multiple cores, and you can override internal heuristics about whether it's going to be beneficial for one particular input or not with --parallel=N. Anyways, if multithreaded sort is available what you should probably do is replace `sort | uniq` with just `sort -u` to save some time on pipe IO.
You can actually get a lot just from grep (well, zgrep, which searches compressed files)
zgrep -A '@\[^.]\+\.ch' /path/to/collections/**/*.gz
will spit out (almost) all swiss email addresses. sprinkle some grep -v, wc and sort|uniq here and there and bob's your uncle :-)
(I'm in no way trying to dismiss the efforts--my method won't get you as much detail, but is enough to find you and your friends who asked to be searched in the dumps)
Previously on HN, a similar approach demonstrating that command-line tools can be hundreds of times faster than the fancy distributed approach (for certain classes of problems):
"Command-line tools can be faster than your Hadoop cluster"
It's not just about "raw" speed, but also about how much mental time you spend thinking about your setup. even if you didn't know about zgrep already, that's at most minutes of thinking ahead vs. potential hours or days setting up cloud providers, big toolchains, and whatever else just to get started.
Hadoop is intended to be used if you have exceeded terabyte scale of data and reached petabytes. That is when simply moving the data around has become incredibly difficult due to network IO limitations.
Let me know when you have a solution for using command line tools for sorting through our 4PB of data in our Hadoop cluster!
Yes, I have difficulty imagining that the bottleneck here is anything other than reading the file from a spinning-rust disk. It would probably be faster to iterate over all the files line by line (eg, sequentially) in a single thread than spinning up 40 threads and having them skip around all over the files, thrashing the disk.
I had this come up recently myself. An operation was performing slowly, so I split it up to use multiple cores in parallel. Got slower. Looked at performance monitor and disk utilization was pegged at 100%, while CPU utilization was ~80% of one core. Went back to the old code, and disk utilization was still pegged at 100%, but it had a marginally higher transfer rate because it was reading one file at a time rather than several all at once. I should have examined the performance characteristics more carefully in the beginning, but I didn't, because I assumed I'm smarter than I am. Good lesson.
This is almost certainly the wrong tool for the job, unless the job is to learn how to use Spark.
The magic allure of spark is when you have a team of 20-30 analysts and 10 data scientists working on a 4PB dataset stored on an 80 node cluster. Let me know when you have a solution for this without using big data tools like Hadoop, spark, etc.
I mean for the problem at hand ... not for anything, ever :)
I think usually the argument for Spark and the like proceeds by just making the dataset bigger until whatever you’re comparing becomes absurd (e.g. command line vs Spark). It was the same deal in the map reduce days. Works like shit for 99% of datasets but it’s the best thing ever when there are no other alternatives :)
That's just 700MB of data... On my 5 year old middle-class PC this took 12 seconds with sort and uniq.
This automatically split the task into all available CPU cores.