I'm certain I'm re-inventing the wheel with this approach, but I obviously haven't been researching in the right places, since I hadn't run across this approach before I cobbled it together.
I'm expecting an education on what I'm missing from the HN community!
The "external sorting" chapter of TAoCP might also be interresting in this context (5.4 in Vol. 3), even though Knuth thought that "the once-crucial topic of patterns for tape merging has become of limited relevance to currnt needs" (p. 251) due to the rapid development of disks in the 80s and 90s, so that his exposition might "be their last grabd appearance before they acept a final curtain call." Little did he know that disk would soon become the new tape...
To his credit, he ended the introduction to the chapter with the citation "for all we know now, these techniques may well become crucial again."
I've had a similar problem when dealing with a relatively large data set (in a consumer PC context) that really really needed random access. I was using sorted text files with indexes (flat array of line start offsets), but eventually I redesigned to use a (Patricia) trie, so that I could store all the data in memory.
Crunching the data down in size to make it both small yet low-cost to extract was then the challenge. I started out with sorted text files on the order of 550MB, and ended up with memory dumps of efficiently packed tries on the order of 95MB, with further scope for compression possible through huffman encoding of markov chains (encoding letter transition probabilities with a path through a huffman binary tree), that I didn't need to implement because I had already achieved my goals.
The trivial parallelization available through sorting massive text files is hard to beat, though, especially as you can write ad-hoc bash scripts to do work with sort, uniq, sed, etc.
That said, their ideas are quite straightforward and you could look them up quickly (I forgot the blog I was thinking about, but check also here: http://www.vldb.org/ ).
Yes, you are reinventing the wheel. This kind of approach has been used for decades in disk drive controllers. You sort track accesses in ascending or descending order to prevent longer seeks. It's called the "elevator algorithm".
This is combined with tag queuing, where multiple requests can be accepted from the host at once. The greater your tag depth, the more insight the controller gets into future seeks.
Random access is always slower than linear reading. Even if you aren't going to disk, you can avoid blowing out the processor cache and having to get from main system memory. Ram is a cache, and L2 is a cache, &c. What you are doing is pretty normal "old-school" unix programming.
it is a good point that classical dbms's aren't always good, however the method that a dbms uses to perform queries are always good to know. the author here implemented a sort-merge-join which is one of the classic implementations of join algorithm. for a good overview of the trade offs between various the various join and sort algorithms see "Principles of Database & Knowledge-Base Systems Vol. 2" by Jeff Ullman. The first chapter in the book is the one you want. It is dated but therefore cheap if you get it used.
Take this post with a grain of salt, since I have the zeal of a recently saved sinner, but you should try using Hive and Hadoop for this sort of thing.
We recently switched from a workflow that is very similar to the one you describe to using Hive with Amazon's elastic map reduce. Hive presents a SQL-like layer of abstraction over exactly this sort of thing. Instead of doing the sorting and merging by hand, you simply write it as a series of joins. It's like writing SQL, except the actual implementation works almost exactly like what you're doing.
Integrating simple Ruby scripts for JSON processing was also trivial.
Elastic MapReduce also had near-zero infrastructure and management overhead for us (besides the 10% Amazon charges for the machine instances). We use S3 for all data input and output, which is perfect for us.
Even when running on a single machine, using Hive was a big win in terms of development time, and performance of the jobs seemed only slightly slower that using Unix utilities on big text files. It's almost a bonus that we can also scale it out to dozens of machines, for a huge speedup. Running a job that took several hours on a single machine took less than five minutes, and only a few hours of EC2 machine time. Cheap and easy!
but for hadoop, you need several boxes and the admin overhead of setting it up and administrating it. if you have a large iron box, this approach works well as I have done this style many times.
You can even combine this approach with split and parallel make to do a cheap single-machine parallel sort. I use a little script that generates a Makefile that can be called with make -j n that splits an input file, sorts the parts, and then merges them with sort -m. It's proved to be quite handy.
I'm actually storing out the data as text files containing json in the file system, one per fan page. I normally use Mongo, Redis, Tokyo or MySQL for this sort of thing, but since I'd already done all of the processing that they normally help me with as operations on disk files, I thought I'd try sticking with the low-tech theme.
is it really faster to write to a bunch of text files, sort them to a new bigger text file, and then do the insert? seems like a lot of extra steps, all involving a lot of reading and writing..
It depends. For example, what's the balance between reading and writing, how stream-oriented your processing is, how important is random access between different records, etc.
This approach can be easily augmented too. For example, doing a binary search for a particular line in a text file when you don't have all the lines in RAM is somewhat tedious; it can be made much easier by creating a simple index for the file, consisting of a flat array of the file offset of every line start. That flat array can be stored in a file also; then, both the total number of lines and the contents of a line at any given index are trivial to retrieve.
If you have to handle a small number of updates while still handling lots of reads, then you can use a two-layer approach. Keep a cache of all pending updates in memory in an efficient manner (e.g. hash table), and look up the cache before falling back to the disk; and when writing, both update the cache and write out to an update log, which can be sorted and included in the main store later, when it makes sense.
I'm expecting an education on what I'm missing from the HN community!