Custer scheduling is a huge area. I used to work on MPI clusters and it is an art to balance CPU, Bandwidth, propagation time to pick the optimum number of processors for a particular algorithm.
Especially on commodity ethernet based MPI, it doesn't do broadcast so shipping a Gb common dataset to 64nodes can take a lot longer than actualy doing the calculation.
Strange -- I always just sort of assumed that since they are making big clusters, they could spend the extra $$ for a good multicast switch, and that MPI did ip multicast. (a quick googling shows me to be wrong...).
Especially on commodity ethernet based MPI, it doesn't do broadcast so shipping a Gb common dataset to 64nodes can take a lot longer than actualy doing the calculation.