How to speed up massive data analysis by eliminating disk seeks

petewarden · on Jan 1, 2010

I'm certain I'm re-inventing the wheel with this approach, but I obviously haven't been researching in the right places, since I hadn't run across this approach before I cobbled it together.

I'm expecting an education on what I'm missing from the HN community!

fhars · on Jan 2, 2010

The "external sorting" chapter of TAoCP might also be interresting in this context (5.4 in Vol. 3), even though Knuth thought that "the once-crucial topic of patterns for tape merging has become of limited relevance to currnt needs" (p. 251) due to the rapid development of disks in the 80s and 90s, so that his exposition might "be their last grabd appearance before they acept a final curtain call." Little did he know that disk would soon become the new tape... To his credit, he ended the introduction to the chapter with the citation "for all we know now, these techniques may well become crucial again."

hga · on Jan 2, 2010

I got the impression from some of what I was reading in the run up to Y2K that this sort of approach was common in the mainframe world.

Obviously, many sites had lots of data (often a lot stored on tape) and limited main memory (especially back in the days in which it was core).

You might start here: http://en.wikipedia.org/wiki/Mainframe_sort_merge ("It is very frequently used; often the most commonly used application program in a mainframe shop.")

Show the old dogs that new dogs can learn old tricks ^_^.

barrkel · on Jan 2, 2010

I've had a similar problem when dealing with a relatively large data set (in a consumer PC context) that really really needed random access. I was using sorted text files with indexes (flat array of line start offsets), but eventually I redesigned to use a (Patricia) trie, so that I could store all the data in memory.

Crunching the data down in size to make it both small yet low-cost to extract was then the challenge. I started out with sorted text files on the order of 550MB, and ended up with memory dumps of efficiently packed tries on the order of 95MB, with further scope for compression possible through huffman encoding of markov chains (encoding letter transition probabilities with a path through a huffman binary tree), that I didn't need to implement because I had already achieved my goals.

The trivial parallelization available through sorting massive text files is hard to beat, though, especially as you can write ad-hoc bash scripts to do work with sort, uniq, sed, etc.

andrewcooke · on Jan 2, 2010

is http://en.wikipedia.org/wiki/Merge_sort#Merge_sorting_tape_d... what you were remembering? that's what immediately sprung to my mind on reading the article, but i don't know if you're expecting something more specific.

zzleeper · on Jan 2, 2010

Column stores (http://en.wikipedia.org/wiki/Column-oriented_DBMS) have been designed specially for this case. There are a few ones (monet, cstore) but the really good ones are sadly $$$.

That said, their ideas are quite straightforward and you could look them up quickly (I forgot the blog I was thinking about, but check also here: http://www.vldb.org/ ).

KateLawson · on Jan 2, 2010

Yes, you are reinventing the wheel. This kind of approach has been used for decades in disk drive controllers. You sort track accesses in ascending or descending order to prevent longer seeks. It's called the "elevator algorithm".

http://en.wikipedia.org/wiki/Elevator_algorithm

This is combined with tag queuing, where multiple requests can be accepted from the host at once. The greater your tag depth, the more insight the controller gets into future seeks.

petewarden · on Jan 2, 2010

Interesting - so in those terms I'm creating a massively deep queue of access requests, and then sorting them into an optimal order.

aaronblohowiak · on Jan 2, 2010

Random access is always slower than linear reading. Even if you aren't going to disk, you can avoid blowing out the processor cache and having to get from main system memory. Ram is a cache, and L2 is a cache, &c. What you are doing is pretty normal "old-school" unix programming.

timtadh · on Jan 2, 2010

it is a good point that classical dbms's aren't always good, however the method that a dbms uses to perform queries are always good to know. the author here implemented a sort-merge-join which is one of the classic implementations of join algorithm. for a good overview of the trade offs between various the various join and sort algorithms see "Principles of Database & Knowledge-Base Systems Vol. 2" by Jeff Ullman. The first chapter in the book is the one you want. It is dated but therefore cheap if you get it used.

here is the worldcat link http://www.worldcat.org/oclc/439156325

petewarden · on Jan 2, 2010

Cheers! That's exactly the sort of reference I was hoping for.

stephenjudkins · on Jan 2, 2010

Take this post with a grain of salt, since I have the zeal of a recently saved sinner, but you should try using Hive and Hadoop for this sort of thing.

We recently switched from a workflow that is very similar to the one you describe to using Hive with Amazon's elastic map reduce. Hive presents a SQL-like layer of abstraction over exactly this sort of thing. Instead of doing the sorting and merging by hand, you simply write it as a series of joins. It's like writing SQL, except the actual implementation works almost exactly like what you're doing.

Integrating simple Ruby scripts for JSON processing was also trivial.

Elastic MapReduce also had near-zero infrastructure and management overhead for us (besides the 10% Amazon charges for the machine instances). We use S3 for all data input and output, which is perfect for us.

Even when running on a single machine, using Hive was a big win in terms of development time, and performance of the jobs seemed only slightly slower that using Unix utilities on big text files. It's almost a bonus that we can also scale it out to dozens of machines, for a huge speedup. Running a job that took several hours on a single machine took less than five minutes, and only a few hours of EC2 machine time. Cheap and easy!

jbeda · on Jan 2, 2010

This is basically a map reduce. You should look at hadoop as you start doing more complicated stuff.

gstar · on Jan 2, 2010

The sort mode is map-reduce style, but the overall idea was to put data sequentially on disk before you process it.

brianobush · on Jan 2, 2010

but for hadoop, you need several boxes and the admin overhead of setting it up and administrating it. if you have a large iron box, this approach works well as I have done this style many times.

speek · on Jan 2, 2010

You should check out this paper: Disk is the new ram http://www.ccs.neu.edu/home/gene/papers/acm-viewpoint08.pdf It's really neat.

jbl · on Jan 2, 2010

You can even combine this approach with split and parallel make to do a cheap single-machine parallel sort. I use a little script that generates a Makefile that can be called with make -j n that splits an input file, sorts the parts, and then merges them with sort -m. It's proved to be quite handy.

conquest · on Jan 2, 2010

Where do you store the processed data for recall? It seems you have data per fan page as well as a google style suggest index of those page names.

petewarden · on Jan 2, 2010

I'm actually storing out the data as text files containing json in the file system, one per fan page. I normally use Mongo, Redis, Tokyo or MySQL for this sort of thing, but since I'd already done all of the processing that they normally help me with as operations on disk files, I thought I'd try sticking with the low-tech theme.

earle · on Jan 2, 2010

This is precisely how data is stored in ranges within HBase/Hypertable. Your data is also sorted between mapping and reducing.

tlack · on Jan 2, 2010

is it really faster to write to a bunch of text files, sort them to a new bigger text file, and then do the insert? seems like a lot of extra steps, all involving a lot of reading and writing..

barrkel · on Jan 2, 2010

It depends. For example, what's the balance between reading and writing, how stream-oriented your processing is, how important is random access between different records, etc.

This approach can be easily augmented too. For example, doing a binary search for a particular line in a text file when you don't have all the lines in RAM is somewhat tedious; it can be made much easier by creating a simple index for the file, consisting of a flat array of the file offset of every line start. That flat array can be stored in a file also; then, both the total number of lines and the contents of a line at any given index are trivial to retrieve.

If you have to handle a small number of updates while still handling lots of reads, then you can use a two-layer approach. Keep a cache of all pending updates in memory in an efficient manner (e.g. hash table), and look up the cache before falling back to the disk; and when writing, both update the cache and write out to an update log, which can be sorted and included in the main store later, when it makes sense.

ilaksh · on Jan 2, 2010

how about not using a mechanical disk.

solid state seeks are like 10 or 50 times faster.

akronim · on Jan 2, 2010

because (as it says in the article) he's running on EC2 and it isn't an option.