Oh, that explains! I thought somehow the driver installation broke despite having made an AMI of the same instance type with drivers already installed, and so I just run the install shell script every time and then it works. I figured maybe some hardware address is wrong or it's subtly different hardware, but it being time-based makes a lot more sense.
Absolutely, same here. Additionally there seems to be very little consistency with GPU instance start up times, I’ve had 30 seconds one moment and 5 minutes another. Can’t say I’ve experienced 10 minutes luckily.
Yeah, less than a few seconds with spot consistently would be nice, but i've never seen it. When I was handling autoscaling via the ec2 api + python + ngnix, the daemon I wrote pretty much had to have a while loop to continuously check connectivity via ssh after a t3a.medium (with ubuntu) was kicked off from `ec2.request_spot_instances`
Perhaps I should have been using "Clear Linux 34640"
Probably depends on your spot price.. we always set maximum spot price at the price of “on-demand” instances but due to how priorities work it may also sometimes not work when the data center in given region is out of capacity for given type.
TBH the bigger issue than boottime for AWS is lack of physical resources to fulfill the demand - this is what most big players are struggling with.
No no, the spot price matches, the instance boots, you can ssh to it and do everything except using the GPU. E.g. you try to run your pytorch NN training, but it freezes for 5-10 minutes, then it runs fine. If you start your training again, it runs immediately.
"launch GPU spot instance" to "I can actually use the GPU, at least by running nvidia-smi".
I find that this can take up to 10 minutes and for the more expensive instances, this can mean non-negligible amount of money.
Great article BTW