Yes, this is basically the idea. However, there are different solutions for training and inference. For training, I would recommend that you add automatic checkpoint, and even consider model migration. For inference (which I think is the original concern), over provisioning is the key (simply because the fact that it would take a long time to load the model. Also, you also want to diversify your node types, etc.