It's interesting, but not earth-shattering. The "10x fewer nodes" means nothing;...

saryant · on Oct 10, 2014

The article says exactly what they ran on. EC2 i2.8xlarge instances which have 32 cores, 800GB SSD and 244GB RAM.

grier · on Oct 10, 2014

Also used the "Enhanced Networking" option on the instances which means single root I/O virtualization underneath.

discardorama · on Oct 10, 2014

I read that. But how does that compare with the nodes they're comparing against ("10x fewer nodes")?

rxin · on Oct 10, 2014

The old entry had 10Gb/s <full-duplex> (40 nodes/rack 160Gbps rack to spine. 2.5:1 subscription), 64GB of RAM, and 12 x 3TB SATA.

The network part is probably the most important one here, and both have comparable network.

discardorama · on Oct 10, 2014

Since each node was handling 500GB of data (roughly), I think the disk speed may have been a more critical factor since each node had 244GB of memory. Their nodes used SSDs; the older nodes used spinning rust. The seek times alone will be a killer.

rxin · on Oct 10, 2014

Not sure why you mentioned seek time. In large scale, distributed sorting, I/O is mostly sequential.

tlipcon · on Oct 11, 2014

If only that were true -- the shuffle is typically seek-bound when the intermediate data doesn't fit into cache (plenty of papers show this pretty conclusively).

rxin · on Oct 11, 2014

Hi Todd,

Except in the case of MR 2100 nodes the entire dataset fit in memory :)

lmeyerov · on Oct 11, 2014

Doesn't that speak to his point? Either smaller memory and more nodes, or more memory and less nodes. Why not do apples to apples? (This feels like the benchmarketing going on in browsers, which, at this point, is largely meaningless.)

Edit: on the other hand, this is an endorsement of the current wave of "per node performance stinks, let's avoid rewriting software for an extra year or two by throwing SSDs at it." Great for hardware vendors!

rxin · on Oct 11, 2014

No it doesn't. The old record used 2100 nodes so the entire data actually fit in memory. There shouldn't be much seek happening even in the MR 2100 case. In Spark's case, the data actually doesn't fit in memory.

Also this was primary network bound. The old record had 2100 nodes with 10Gbps network.

lmeyerov · on Oct 11, 2014

Ah, so you're saying as this is network bound, you want to launch much fewer wimpy nodes in exchange for a few bigger ones (those w/ SSDs -- over 1PB worth!). If it was memory-bound, you wouldn't care how they're spread as long as it was still in-memory.

Another way of looking at this is performance per watt or dollar. The r3.2 has 60GB, so comparing to that, Spark cost the same ~$1.4K while giving a 3X speedup. (Or on host that charges per minute, it'd be the same performance at 3X cheaper.)

This is me not knowing the space: would MR (or more modern things like Tez) perform worse on this HW setup, or is this a reminder that hardware/config tuning matters?

rxin · on Oct 11, 2014

Yes absolutely. If I can get a single machine with 200TB of SSDs, that'd have been great :)

But as soon as we have more than 1 node, then having more nodes is better. We can actually demonstrate this quantitatively. We are required to replicate the output, which means 100 TB of data would generate 100 TB of network for replication, and 100 * (N-1)/N TB of network for shuffle, where N = num nodes. That is the overall 200 - 100/N in network. Assuming each node has 1GB/s network, then you'd need (200 - 100/N) / N seconds just to transfer the data across network, i.e. the optimal number of nodes is the one that gives you the lowest (200/N - 100/N^2).

The problem is with last year's MR run was that it wasn't saturating the network at all. It had roughly 1.5GB/s/node HDDs, and the overall network throughput was probably around 20MB/s/node when they were using 10Gbps network (I'm assuming only half of the time is doing network. If they were doing network the full time, then network throughput was at ~10MB/s/node).

Had we been using the same HDDs last year's entry had, our map phase would slow down by about 2x, and the reduce phase shouldn't change. This would mean a total run time of less than 30 mins on ~200 nodes. Still way better.

discardorama · on Oct 11, 2014

As per the specification of this test, the data has to be committed on disk before it is considered sorted. So even if it all fits in memory, it has to be on disk before the end.

So you have 100TB of disk read, followed by 100TB of disk write, all on HDDs. That's about 100GB/node; and since Hadoop nodes are typically in RAID-6, each write has an associated read and write too.

This does not even include the intermediate files, which (depending on how the kernel parameters have been set), could have been written on disk. Typical dirty_background_ratio is 10; so after 6GB of dirty pages, pdflush will kick in and start writing to the spinning disk.

rxin · on Oct 13, 2014

Yes, but the final data is sequential only. We were discussing about random access, which only applies to the intermediate shuffle file.

Maybe you can email me offline. I can tell you more about the setup and how Spark / MapReduce works w.r.t. to it.

Twirrim · on Oct 11, 2014

3TB SATA would indicate spinning rust too, so slower storage. It's far from an apples to apples comparison.

discardorama · on Oct 10, 2014

Not just 800GB of SSD; 8x 800GB of SSD!

trhway · on Oct 10, 2014

yes, that's the key. The local IO on previous reported systems was HDDs, thus 300-600Mb/s at best with 6 drives per machine, while this guys are getting 3.3Gb/s. Getting full performance - 1.1Gb/s - of 10Gb network - were they lucky or AWS now is that good ? (couple years ago i was able to get only 400Mb/s node-to-node on cluster compute nodes there)

nchammas · on Oct 11, 2014

> I would love to see MR and Spark compete on the exact same hardware configuration.

You may find this benchmark [1] interesting to read.

It needs some updating (a lot has changed since February 2014), but it compares Shark (which uses Spark as its execution engine) to Hive (using Hadoop 1 MapReduce as its execution engine) and a number of other systems.

The benchmark is run on EC2 and is detailed in such a way that it should be independently verifiable. Hive and Shark are run on identically sized clusters, though I don't know if the other details of the configuration were identical.

[1] https://amplab.cs.berkeley.edu/benchmark/