For comparison, tineye.com (a reverse image search engine that I worked on a few years ago) queries well over 1 billion images (and thus rows of metadata) in under 0.5s (our goal was sub-250ms).
I also know that the size of the cluster is a fraction of 40 instances (especially if you don't include the subcluster of web crawlers).
By comparison, it doesn't seem super impressive. Tineye's metadata is stored in a sharded PostgreSQL database, the image index itself is a separate proprietary engine (also sharded).
Then again, we probably dealt with a smaller set of query types so we could optimize those scenarios much more heavily. Nonetheless, from my experience it feels like they could do better with any of the options they've tried. It's important to be aware where the bottleneck is. In our case, the IO of the EC2 storage disks was a bottleneck (which we mitigated by sharding). Once we moved the cluster off of AWS and shoved some nice SSD drives into the machines, the performance increase was dramatic (half the shards for twice the speed, pretty much).
I'm saying that in terms of real time, if you have enough parallelism, you don't have to wait for n units, but rather log n units. As a function of CPU time, sure it's linear, but ignoring what parallelism gives you is just not being realistic, IMHO.
As to the sibling comment, that's another good point. You say it's because it knows about the queries ahead of time, but you don't get excellent real-world performance without what an academic might consider cheating. Special-case your data structures to expected queries, where those queries are either empirically discovered and adjusted for manually, or learned automatically over time. A closed-minded approach that simply shuts down creative thinking by parroting out "linear time, no can do better, sorry" is poor form.
This doesn't change the fact that searching is at least as fast as aggregating, no matter how much you cheat. Besides, how is precomputing relevant to the discussion here? By that reasoning, I can convert any algorithm to O(1) time and O(1) space.
The topic is about how search was faster than aggregation, and I'm saying that searching is faster than aggregation generally. You can parallelize all you want, it's still going to be faster to search than to aggregate.
CouchDB does this. It stores precomputed sub-aggregates in the inner-nodes of its B-Tree, so that at query time, it is only summing a handful of numbers instead of all of them.
Further extrapolating, m2.2xlarge has 8 pretty powerful 2.66GHz Nehalem Xeon cores, so that's 320 cores, or roughly 3.2 million "rows" (whatever that is) per second per core. Nothing to laugh at, but certainly not a revelation.
Aren't metrics such as "seconds on an ec2 instance" not particularly meaningful because you get highly variable performance per instance based on who else is using the actual hardware? Am I correct to assume that m2.2xlarge instances are shared like other instance types?
It's always tempting to build it yourself. Initially we built our own data store too, partly because it was a great geek challenge, and then went back to InnoDB and NoSQL solutions once we figured we can't compete with 1000's of open source devs and battle hardened products.
Our hardware bill for our entire server cluster, which we own, is only slightly more than Druid's monthly cloud bill. [My company is the largest real-time analytics provider on the web]
Few people realize that Redis was created by Salvatore to provide a real-time analytics product - and then it grew into so much more. Also InnoDB's clustered indexes are spectacular when it comes to answering questions like the one's the OP has posed.
How's Redis' overall persistence/performance? I'm always wary of having it handle critical data in fear that it will just up and lose some (like MongoDB has done for me), but I love it otherwise. Have you had any bad experiences with using it in production?
On their home page CB currently claim 1,977,762 visitors across all sites. We see the following across our network:
397,727,080 visits per month.
1,002,792,862 impressions per month.
209,727,427 absolute uniques per month.
We use a third party to report this data to ensure objectivity when we're doing reports for outsiders. This is from the period March 30 to April 29 2011.
ChartBeat's numbers are concurrent users. No idea how that translates on web but I'd expect it's more than 200m uniques - we do 120ish off 100k - 200k concurrents.
Disclaimer: I work at Endeca, which just launched Latitdue, a mostly in-memory OLAP product. I work in the performance and scale group, and I've been experimenting with the speed of these things lately.
As an experiment, I wrote a simple, stand alone program that takes a compressed database column and a list of which rows you care about, and does a group by with count. (We're a column store, so every column is stored separately.) For 1 billion records, it took about 250 ms.
This is on a 4-socket Nehalem, so I had 32 cores and used 32 threads. If they're using Nehalems as well, and have 1 socket per box, they're using 320 cores. So they're about a factor of 40 slower than my simple experiment.
To be sure, theirs was a general system that presumably runs on their actual data, whereas mine was just an experiment. Still, it seems quite possible to get the same result with significantly less hardware.
Actually, I think this is what Vertica is for (VoltDB was a Vertica spin-off). Volt is intended for OLTP workloads whereas Vertica is designed for OLAP, especially giant star-schema workloads like the post described.
Vertica is legit, we tested it against a billion rows on a single server and it didn't blink and eye. It reads compressed data, and optimizes compression based on the column's data type
for that ec2 price you could get a few years of vertica licenses
Vertica was C-Store, a column oriented store for analysis. VoltDB was H-Store, an in-memory database for very fast OLTP. (From testing VoltDB, it screams, and scales nearly linearly.)
I'm not sure HStore/VoltDB was a "spin off" of Vertica - they are quite different models fundamentally.
They use different architectures but there is a real organizational connection between the two products. I interviewed with the VoltDB team at Vertica's office.
SAP's analytic appliance HANA (formerly called Business Warehouse Accelerator) has been doing this for at least 5 years now. And people think SAP is old-fashioned...
It took them 40 servers to run 1B aggregations per second? Great --- Citrusleaf recently did 250M aggregations per second with one amazon large. And, oh, we're fully clustered and reliable, so you can snap in new servers with no maintenance.
I'd love to see a demo. Ever since I saw hummingbird's demo of real time traffic analysis I've wanted to write real time analytics.
I'd love to see this one rolling (even a video)
I'll be looking forward to their "part two" where they go on about architecture. They're really managing 1B hits/s ? Or they have a historical data of a billion hits and can parse it in a second ?
Lost after "Over the last twelve months, we tried and failed to achieve scale and speed with relational databases (Greenplum, InfoBright, MySQL) and NoSQL offerings (HBase)."
I know for a fact that Greenplum can easily cover this load. Did they actually try running all these things on the same size cluster with the same amount of RAM? Were they all tuned properly?
Qlikview's options for getting data out suck royally. Your basically forced to use their native Windows client. Their web stuff is too immature to be viable.
Doesn't make it interesting for me; OLAP queries are very easy to distribute.
Also, kdb+ achieves a "billion rows per second" scan on a similar dataset with just one or two instances (rather than 40), if you know what you are doing.
1. "Our 40-instance (m2.2xlarge) cluster can scan, filter, and aggregate 1 billion rows in 950 milliseconds."