Yes absolutely. If I can get a single machine with 200TB of SSDs, that'd have be...

Yes absolutely. If I can get a single machine with 200TB of SSDs, that'd have been great :)

But as soon as we have more than 1 node, then having more nodes is better. We can actually demonstrate this quantitatively. We are required to replicate the output, which means 100 TB of data would generate 100 TB of network for replication, and 100 * (N-1)/N TB of network for shuffle, where N = num nodes. That is the overall 200 - 100/N in network. Assuming each node has 1GB/s network, then you'd need (200 - 100/N) / N seconds just to transfer the data across network, i.e. the optimal number of nodes is the one that gives you the lowest (200/N - 100/N^2).

The problem is with last year's MR run was that it wasn't saturating the network at all. It had roughly 1.5GB/s/node HDDs, and the overall network throughput was probably around 20MB/s/node when they were using 10Gbps network (I'm assuming only half of the time is doing network. If they were doing network the full time, then network throughput was at ~10MB/s/node).

Had we been using the same HDDs last year's entry had, our map phase would slow down by about 2x, and the reduce phase shouldn't change. This would mean a total run time of less than 30 mins on ~200 nodes. Still way better.