Benchmark that would interest me would be time from "launch GPU spot instance" t...

lucb1e · on Aug 17, 2021

Oh, that explains! I thought somehow the driver installation broke despite having made an AMI of the same instance type with drivers already installed, and so I just run the install shell script every time and then it works. I figured maybe some hardware address is wrong or it's subtly different hardware, but it being time-based makes a lot more sense.

ardenpm · on Aug 17, 2021

Absolutely, same here. Additionally there seems to be very little consistency with GPU instance start up times, I’ve had 30 seconds one moment and 5 minutes another. Can’t say I’ve experienced 10 minutes luckily.

cinquemb · on Aug 17, 2021

Yeah, less than a few seconds with spot consistently would be nice, but i've never seen it. When I was handling autoscaling via the ec2 api + python + ngnix, the daemon I wrote pretty much had to have a while loop to continuously check connectivity via ssh after a t3a.medium (with ubuntu) was kicked off from `ec2.request_spot_instances`

Perhaps I should have been using "Clear Linux 34640"

pojzon · on Aug 17, 2021

Probably depends on your spot price.. we always set maximum spot price at the price of “on-demand” instances but due to how priorities work it may also sometimes not work when the data center in given region is out of capacity for given type.

TBH the bigger issue than boottime for AWS is lack of physical resources to fulfill the demand - this is what most big players are struggling with.

lajosbacs · on Aug 17, 2021

No no, the spot price matches, the instance boots, you can ssh to it and do everything except using the GPU. E.g. you try to run your pytorch NN training, but it freezes for 5-10 minutes, then it runs fine. If you start your training again, it runs immediately.