Investigating the 900GB “Collection #1-5” password leaks with R

aw3c2 · on March 8, 2019

> We’re talking about 31 million lines of data here (all lines with Swiss email addresses in the leak). On our workstation, reducing this to the distinct email addresses (approx. 3,3 million) took a mere 10.099 seconds (4 + 6 + .099).

That's just 700MB of data... On my 5 year old middle-class PC this took 12 seconds with sort and uniq.

    $ wc -l email30m
    31000000 email30m
    
    $ time sort email30m | uniq > email30m.uniq
    
    real 0m11.498s
    user 0m33.833s
    sys 0m0.933s
    
    $ wc -l email30m.uniq 
    2172320 email30m.uniq

This automatically split the task into all available CPU cores.

dahfizz · on March 9, 2019

Using Unix utilities is so underrated. I love Unix programming for the simplicity and power that you demonstrate well here. It's not as flashy and cool as writing some big program but I am constantly amazed how people can go through new web framework fads every 2 months but these 40-50 year old programs are still the most useful tools around.

antpls · on March 9, 2019

It's nice to see that Spark is on par with Unix tools on low data volumes.

The Spark program can be applied as-is for data volume 10 times or 100 times bigger without being modified.

You would have to reinvent Spark if you started from Unix tools and applied it to bigger volume of data

stewbrew · on March 9, 2019

To be fair, you describe the last step of one of his analyses. He starts with 900gb of data.

One could of course discuss whether spark is the right tool for filtering a text file for a pattern, IMHO the post is interesting nevertheless.

munro · on March 9, 2019

I'm getting 2 minutes and 29.98 seconds, with my 2017 MacBook Pro with 8 cores + 16 GiB ram. I'm using GNU tools, could you share more details on how you're getting 12 seconds?

    $ time awk 'BEGIN{for(i=0; i<3100000; i+=1) for(y=0; y<10; y+= 1) print i "@gmail.com"}' > email30m.sorted
    awk  > email30m.sorted  8.29s user 0.82s system 97% cpu 9.352 total

    $ ls -lh email30m.sorted
    522M Mar  8 17:14 email30m

I didn't have your email30m dataset, so I fabricated one, mine is smaller at 522M and has less entropy, since 'm assuming everyone uses gmail & uses numbers for usernames. :)

    $ time shuf email30m.sorted > email30m
    shuf email30m.sorted > email30m  15.58s user 1.69s system 98% cpu 17.561 total

I also shuffled the dataset, because that would be cheating.

    $ wc -l email30m
    31000000 email30m

    $ time sort email30m | uniq > email30m.uniq
    sort email30m  873.55s user 4.29s system 585% cpu 2:29.98 total
    uniq > email30m.uniq  4.10s user 0.34s system 2% cpu 2:29.98 total

    $ wc -l email30m.uniq
    3100000 email30m.uniq

I followed what you did, but it took way longer when running on my machine.

    $ time sort email30m.sorted | uniq > email30m.sorted.uniq
    sort email30m.sorted  413.75s user 3.66s system 468% cpu 1:29.04 total
    uniq > email30m.sorted.uniq  4.11s user 0.40s system 5% cpu 1:29.04 total

And just out of curiosity, it takes 1 minute 29 seconds to sort & uniq the presorted dataset.

mappu · on March 9, 2019

I ran all your commands above on Debian Buster (4-core i5 2500 from 2011, 8GB ram, SATA SSD). My /tmp apparently isn't a tmpfs mount.

Results: http://paste.debian.net/1072430/ It took 54s to sort|uniq and 38s for the presorted dataset.

EDIT: For the unsorted dataset: 35s for `sort --parallel=8 -u` and 37s for `sort --parallel=4 -u`.

vthriller · on March 9, 2019

GNU sort generates a bunch of temporary files for large inputs, and for a lot of linux folks /tmp is mounted as tmpfs (i.e. it's RAM/swap-backed), but it might do something else on other platforms, or just locate temporary files on disk or something, so that might be one explanation.

Or it could be the difference between qsort() from different libcs.

aw3c2 · on March 9, 2019

http://0x0.st/zHb2.7z

All I did was place the file in a ramdisk.

vthriller · on March 9, 2019

> This automatically split the task into all available CPU cores.

GNU sort can already utilize multiple cores, and you can override internal heuristics about whether it's going to be beneficial for one particular input or not with --parallel=N. Anyways, if multithreaded sort is available what you should probably do is replace `sort | uniq` with just `sort -u` to save some time on pipe IO.

aw3c2 · on March 9, 2019

Oh nice!

`sort -u` is just as fast as `sort | uniq` for me but good call :)

girst · on March 8, 2019

You can actually get a lot just from grep (well, zgrep, which searches compressed files)

    zgrep -A '@\[^.]\+\.ch' /path/to/collections/**/*.gz

will spit out (almost) all swiss email addresses. sprinkle some grep -v, wc and sort|uniq here and there and bob's your uncle :-)

(I'm in no way trying to dismiss the efforts--my method won't get you as much detail, but is enough to find you and your friends who asked to be searched in the dumps)

jmcphers · on March 8, 2019

Previously on HN, a similar approach demonstrating that command-line tools can be hundreds of times faster than the fancy distributed approach (for certain classes of problems):

"Command-line tools can be faster than your Hadoop cluster"

https://news.ycombinator.com/item?id=8908462

girst · on March 8, 2019

It's not just about "raw" speed, but also about how much mental time you spend thinking about your setup. even if you didn't know about zgrep already, that's at most minutes of thinking ahead vs. potential hours or days setting up cloud providers, big toolchains, and whatever else just to get started.

since we're throwing links around, the (tangentially related) Knuth vs. McIlroy saga comes to mind: https://news.ycombinator.com/item?id=15264997

obvious_throwaw · on March 8, 2019

> even if you didn't know about zgrep already

Interestingly, the dumps come with a brief README file telling you... to use the zgrep and example arguments :)

saberience · on March 9, 2019

Hadoop is intended to be used if you have exceeded terabyte scale of data and reached petabytes. That is when simply moving the data around has become incredibly difficult due to network IO limitations.

Let me know when you have a solution for using command line tools for sorting through our 4PB of data in our Hadoop cluster!

Use the right tools for the right jobs :)

patrickg_zill · on March 9, 2019

You don't have any indexes?

nwallin · on March 8, 2019

Yes, I have difficulty imagining that the bottleneck here is anything other than reading the file from a spinning-rust disk. It would probably be faster to iterate over all the files line by line (eg, sequentially) in a single thread than spinning up 40 threads and having them skip around all over the files, thrashing the disk.

I had this come up recently myself. An operation was performing slowly, so I split it up to use multiple cores in parallel. Got slower. Looked at performance monitor and disk utilization was pegged at 100%, while CPU utilization was ~80% of one core. Went back to the old code, and disk utilization was still pegged at 100%, but it had a marginally higher transfer rate because it was reading one file at a time rather than several all at once. I should have examined the performance characteristics more carefully in the beginning, but I didn't, because I assumed I'm smarter than I am. Good lesson.

This is almost certainly the wrong tool for the job, unless the job is to learn how to use Spark.

usgroup · on March 8, 2019

This is cool and all, but one could smash this out with gnu parallel and awk with no problem.

Almost everything mentioned can be done on a stream including the aggregates with non of the Spark fanfare.

You could also happily do this with PipelineDB in a single SQL query.

What is the magic allure of Spark?

saberience · on March 9, 2019

The magic allure of spark is when you have a team of 20-30 analysts and 10 data scientists working on a 4PB dataset stored on an 80 node cluster. Let me know when you have a solution for this without using big data tools like Hadoop, spark, etc.

usgroup · on March 9, 2019

I mean for the problem at hand ... not for anything, ever :)

I think usually the argument for Spark and the like proceeds by just making the dataset bigger until whatever you’re comparing becomes absurd (e.g. command line vs Spark). It was the same deal in the map reduce days. Works like shit for 99% of datasets but it’s the best thing ever when there are no other alternatives :)

kuzehanka · on March 9, 2019

If the job can be done on a single node, there is absolutely no reason to use hadoop or spark or anything of the sort.

kkarakk · on March 9, 2019

Am i wrong in thinking that 900GB barely counts as "big data"?

mechanical_jane · on March 14, 2019

I like the definition that Big Data starts at the capacity of the biggest single hardddrive you can buy. So this year it is 14 TB.

spectramax · on March 9, 2019

I’d love to see a comparison of sorts with a real life big data set such as this one for comparing Python, R, Julia, Matlab, etc.

kkarakk · on March 9, 2019

a real life big data set would be a couple of hundred terabytes of multidimensional values not 900gb of text