FWIW, in 2011, Google wrote that they achieved a PB sort in 33 minutes on 8000 computers, vs. 234 minutes on 190 computers with 6080 cores reported by Spark here.
I'm just writing down exactly what they reported. They used different metrics.
Certainly it would be interesting to have an apples to apples comparison. But the computers aren't the only thing that is relevant -- we also need to know about the networking hardware.
It's interesting, but not earth-shattering. The "10x fewer nodes" means nothing; how powerful are the new nodes? What's the network? Do you use SSDs? etc. etc.
They also tuned their code to this specific problem:
"Exploiting Cache Locality: In the sort benchmark, each record is 100 bytes, where the sort key is the first 10 bytes. As we were profiling our sort program, we noticed the cache miss rate was high, because each comparison required an object pointer lookup that was random..... Combining TimSort with our new layout to exploit cache locality, the CPU time for sorting was reduced by a factor of 5."
I would love to see MR and Spark compete on the exact same hardware configuration.
Since each node was handling 500GB of data (roughly), I think the disk speed may have been a more critical factor since each node had 244GB of memory. Their nodes used SSDs; the older nodes used spinning rust. The seek times alone will be a killer.
If only that were true -- the shuffle is typically seek-bound when the intermediate data doesn't fit into cache (plenty of papers show this pretty conclusively).
Doesn't that speak to his point? Either smaller memory and more nodes, or more memory and less nodes. Why not do apples to apples? (This feels like the benchmarketing going on in browsers, which, at this point, is largely meaningless.)
Edit: on the other hand, this is an endorsement of the current wave of "per node performance stinks, let's avoid rewriting software for an extra year or two by throwing SSDs at it." Great for hardware vendors!
No it doesn't. The old record used 2100 nodes so the entire data actually fit in memory. There shouldn't be much seek happening even in the MR 2100 case. In Spark's case, the data actually doesn't fit in memory.
Also this was primary network bound. The old record had 2100 nodes with 10Gbps network.
Ah, so you're saying as this is network bound, you want to launch much fewer wimpy nodes in exchange for a few bigger ones (those w/ SSDs -- over 1PB worth!). If it was memory-bound, you wouldn't care how they're spread as long as it was still in-memory.
Another way of looking at this is performance per watt or dollar. The r3.2 has 60GB, so comparing to that, Spark cost the same ~$1.4K while giving a 3X speedup. (Or on host that charges per minute, it'd be the same performance at 3X cheaper.)
This is me not knowing the space: would MR (or more modern things like Tez) perform worse on this HW setup, or is this a reminder that hardware/config tuning matters?
Yes absolutely. If I can get a single machine with 200TB of SSDs, that'd have been great :)
But as soon as we have more than 1 node, then having more nodes is better. We can actually demonstrate this quantitatively. We are required to replicate the output, which means 100 TB of data would generate 100 TB of network for replication, and 100 * (N-1)/N TB of network for shuffle, where N = num nodes. That is the overall 200 - 100/N in network. Assuming each node has 1GB/s network, then you'd need (200 - 100/N) / N seconds just to transfer the data across network, i.e. the optimal number of nodes is the one that gives you the lowest (200/N - 100/N^2).
The problem is with last year's MR run was that it wasn't saturating the network at all. It had roughly 1.5GB/s/node HDDs, and the overall network throughput was probably around 20MB/s/node when they were using 10Gbps network (I'm assuming only half of the time is doing network. If they were doing network the full time, then network throughput was at ~10MB/s/node).
Had we been using the same HDDs last year's entry had, our map phase would slow down by about 2x, and the reduce phase shouldn't change. This would mean a total run time of less than 30 mins on ~200 nodes. Still way better.
As per the specification of this test, the data has to be committed on disk before it is considered sorted. So even if it all fits in memory, it has to be on disk before the end.
So you have 100TB of disk read, followed by 100TB of disk write, all on HDDs. That's about 100GB/node; and since Hadoop nodes are typically in RAID-6, each write has an associated read and write too.
This does not even include the intermediate files, which (depending on how the kernel parameters have been set), could have been written on disk. Typical dirty_background_ratio is 10; so after 6GB of dirty pages, pdflush will kick in and start writing to the spinning disk.
yes, that's the key. The local IO on previous reported systems was HDDs, thus 300-600Mb/s at best with 6 drives per machine, while this guys are getting 3.3Gb/s. Getting full performance - 1.1Gb/s - of 10Gb network - were they lucky or AWS now is that good ? (couple years ago i was able to get only 400Mb/s node-to-node on cluster compute nodes there)
> I would love to see MR and Spark compete on the exact same hardware configuration.
You may find this benchmark [1] interesting to read.
It needs some updating (a lot has changed since February 2014), but it compares Shark (which uses Spark as its execution engine) to Hive (using Hadoop 1 MapReduce as its execution engine) and a number of other systems.
The benchmark is run on EC2 and is detailed in such a way that it should be independently verifiable. Hive and Shark are run on identically sized clusters, though I don't know if the other details of the configuration were identical.
I believe the network had become a bottleneck. As per the article:
> [O]ur Spark cluster was able to sustain ... 1.1 GB/s/node network activity during the reduce phase, saturating the 10Gbps link available on these machines.
If the network is the bottleneck it makes sense to reduce the number of nodes to reduce the network communications.
For the curious, the (max) price of those instances is $6.82/hr, so 206 * 6.82 * (23/60) = $538.55 --If they did it with non-reserved instances in US East.
If they used reserved instances in USEast, it drops to $181.
Obviously there are lots of costs involved beside the final perfect run, but it's an interesting ballpark.
One of the big positives of Spark is that its architecture is amenable to having workers run on spot instances, which are even cheaper than reserved instances.
Your post mentions "single root IO virtualization" as a factor in maximizing network performance. I am wondering what the impact of this was in your sorting. Do you have data for runs where you didn't enable this?
Hi Reynold! Do you have numbers / intuition for how previous versions of spark would have run? I'm upgrading (soon) from spark 0.8 to spark 1.1 and am curious to see the performance gains (especially w.r.t. shuffles)
It would help a little bit (maybe a few percent), but not much because the scheduling latency was relatively low for these tasks (the largest scheduling delay was ~10 secs, whereas each task takes minutes).
The strength of Hadoop isn't so much speed but that it's been around and there is a pretty impressive and fairly mature set of projects that comprises the Hadoop ecosystem, from Yarn to Hive, etc. There are still many issues to resolve, and this evolution will continue for decades to come.
The TB sort benchmark is pretty useless to me - I am much more concerned with stability, a vibrant community (which means people, the software they write and institutions using Hadoop in production).
Last time I tinkered with Spark (this was over a year ago) it was so buggy, next to useless, but perhaps things have changed.
Still - the idea that there is some sort of a revolutionary new approach that is paradigm-shifting and is way better than anything before should be viewed with extreme skepticism.
The problem of distributed computing is not a simple one. I remember tinkering with the Linux kernel back in the mid nineties, and 20 years later it still has ways to go to improve.
Twenty years from now it might or might not be Hadoop that is the tool for this sort of thing, we don't know, but I will not take seriously anything or anyone who claims that the "next best thing" is here in 2014.
Spark requires Hadoop to run, so this whole Spark vs Hadoop debate makes no sense whatsoever.
There is a place for arguing how effective Map/Reduce is, but it's been known for years that M/R is not the only, nor best general purpose algorithm for solving all problems. More and more tools these days do not use M/R, Spark including, and Spark certainly is no the first tool to provide an alternative to M/R. AFAIK Google has abandoned M/R years ago.
I just don't understand this constant boasting about Spark, it seems very suspicious to me.
This is not correct. Spark uses the Hadoop Input/Output API, but you don't need any Hadoop component installed to run Spark, not even HDFS.
You can -- and many companies do -- run Spark on Mesos or on Spark's standalone cluster manager, and use S3 as their storage layer.
> this whole Spark vs Hadoop debate makes no sense whatsoever
If we talk about Hadoop as an ecosystem of tools, then yes, it doesn't make sense to frame Spark as a competitor. Spark is part of that ecosystem.
But if we talk about Hadoop as Hadoop 1 MapReduce or as Hadoop 2 Tez, both of which are execution engines, then it very much makes sense to pit Spark against them as an alternative execution engine.
Granted, Hadoop 1 MapReduce is pretty old compared to Spark, and Tez is still under heavy development, but these are alternatives and not complements to Spark.
(Note: In Hadoop 2, MapReduce is just a framework that uses Tez as its underlying execution engine.)
> I just don't understand this constant boasting about Spark, it seems very suspicious to me.
Suspicious how?
I think Spark's elegant API, unified data processing model, and performance -- all of which are documented very well in demos and benchmarks online -- merit the excitement that you see in the "Big Data" community.
Actually Doug Cutting himself (who created Hadoop) tweeted about this. I guess Spark gets some of his blessing :)
As pointed out in the article multiple times, we are comparing with MR here. We are not comparing with Hadoop as an ecosystem. Spark plays nicely with Hadoop. As a matter of fact, this experiment ran on HDFS.
In terms of vibrant community, Spark is now the largest open source Big Data project by community/contributor count. More than 300 people have contributed code to the project.
Going on a tangent here: this benchmark highlights the difficulty of sorting in general. Sorts are necessary for computing percentiles (such as the median.) In practical applications, an approximate algorithm such as t-digest should suffice. You can return results in seconds as opposed to "chest thumping" benchmarks to prove a point. :)
http://googleresearch.blogspot.com/2011/09/sorting-petabytes...