Is programming speed really that bad for the ultra high-end devices? Minutes? I don't remember it being that bad for the Amazon F1 when I ported a Xilinx build to use the F1 SDK (I didn't spend lots of time with our prior one, so I wouldn't know.) Of course, their programming strategy is extremely customized, but even for very high-utilization images, it was only ever on the order of seconds. Vivado is absolutely terribly slow though, no matter what you do, or what device you use. (Not to mention if you want to use the ILA support over the internet...)
Also, for some designs you can mitigate the reconfiguration time issue by having two regions and draining requests to one of them, before doing an update. Most of the Xilinx tooling for OpenCL does this kind of thing by default (4-6 "opencl kernel" regions.) But of course it's not always an option to give up that much space...
It depends on the programming interface. JTAG is bit serial and rather slow, so it can take quite a while to load a large FPGA via JTAG. However, there are several other interfaces that can be used, including QSPI, dual QSPI, parallel flash, and a simple parallel interface from some other controller. These can run at many MHz and can load a configuration into a large FPGA in less than a second.
Also, for some designs you can mitigate the reconfiguration time issue by having two regions and draining requests to one of them, before doing an update. Most of the Xilinx tooling for OpenCL does this kind of thing by default (4-6 "opencl kernel" regions.) But of course it's not always an option to give up that much space...