With the Xilinx ISE toolset I am currently using (which Xilinx is deprecating in...

With the Xilinx ISE toolset I am currently using (which Xilinx is deprecating in favor of the new Vivado toolset) it parallelizes/multithreads poorly. I understand that the place and route algorithm is based upon simulated annealing, in which you make small random perturbations to the current layout configuration, measure whether it is better or worse, and sometimes retain the new configuration, and sometimes roll back. This gradually evolves the system to a configuration which maximizes some objective function, avoiding getting stuck in a local maximum. It has traditionally been a challenge to parallelize this sequential algorithm through design partitioning because of placement and routing interactions between the partitions.

In some flows you can do a coarse floorplan of your design and route the submodules separately and then stitch them together. I imagine this is how the very largest devices are implemented in manageable design iteration times.

I don't usually worry about that, though. Since my design is just so many replicated tiles, I tend to do design iterations of 4- or 16-processor elements to test the impact on clock period / timing slack. That usually takes 2-3 minutes per design spin. Only once in a while do I place and route the whole chip to confirm some change doesn't impact timing closure.