The 100 terabyte benchmark used 206 Spark nodes, compared with 2100 Hadoop nodes...

Lanzaa · on Oct 10, 2014

I believe the network had become a bottleneck. As per the article:

> [O]ur Spark cluster was able to sustain ... 1.1 GB/s/node network activity during the reduce phase, saturating the 10Gbps link available on these machines.

If the network is the bottleneck it makes sense to reduce the number of nodes to reduce the network communications.

rxin · on Oct 10, 2014

The job is actually very linearly scalable. i.e. running it on 200 nodes roughly doubles the throughput of 100 nodes.

rxin · on Oct 10, 2014

It is mainly the cost of getting nodes from EC2 at that point. It becomes hard to get a huge number of i2.8xl instances.

Spark runs fine on thousands of nodes.