It's hilarious people are bashing GCP for having one compute instance go down an...

Wuzado · on Dec 2, 2023

The article doesn't seem to mention AWS, really. I also feel like the primary issue is the lack of communication and support, even for a large corporate partner.

Seems like they're moving to bare-metal, which has an obvious benefit of being able to tell your on-call engineer to fix the issue or die trying.

vel0city · on Dec 2, 2023

But in this case the answer from AWS would have been that's their SLA and you need to just be ready to handle an instance getting messed up from time to time, because it's guaranteed to happen.

deanCommie · on Dec 2, 2023

EC2 [0] and GCP Compute [1] have the exact same SLAs, which is 99.99%, dipping below which gets you a 10% refund. Dipping below 95% gets you a 100% refund.

[0] https://aws.amazon.com/compute/sla/

[1] https://cloud.google.com/compute/sla

vel0city · on Dec 3, 2023

By the links you shared instance level SLA on AWS is 99.5%. GCP instance level is 99.99%. That's not the same.

> For each individual Amazon EC2 instance (“Single EC2 Instance”), AWS will use commercially reasonable efforts to make the Single EC2 Instance available with an Instance-Level Uptime Percentage of at least 99.5%

The underlying storage isn't the same as well, and that matters more. EBS is 99.95% durable. Even standard zonal PD's on GCP are >99.99%, balanced are >99.999%, SSDs are >99.9999%.

Even if it was 99.99% (it's not on AWS) what's the point of having your instance be 99.99% if the underlying disks might disappear? That's something I've seen happen multiple times on AWS, never once on GCP.

NineStarPoint · on Dec 3, 2023

This is very different from my experience. In my years with AWS I’ve only had an instance get stopped once for a reason that was weird AWS background stuff that had nothing to do with my application. I don’t think I’ve ever had or even heard of an instance just disappearing.

vel0city · on Dec 3, 2023

By "disappear" I mean the instance failed hard and couldn't be restarted. It's just gone. Usually related to the EBS volume dying.

But yeah, usually when they die they can just be relaunched. Still they die way more often on AWS than in GCP, and will just end up staying stopped. Until very recently they couldn't even migrate the instances when the underlying hardware had some maintenance, you had to stop and relaunch it on your own. FFS most decent hypervisors have had live migrations for decades and yet I still get notifications of "this instance will stop on x day..." emails. I should never see that. The cloud provider should keep the instance running forever. There's no excuse.

yolovoe · on Dec 3, 2023

I don’t know why you’re getting downvotes. What you’re saying sounds true to me, and I work in the core of EC2.

I am guessing you’re using newer instance types if their reliability is still questionable. Or you have a huge fleet of instances so you see a steady rate of failures every year.

Our failure rate on the commonly used instance types if fairly low. We have several types of failures and in some bad failure cases, live migration isn’t possible and your instance won’t even be restarted.

AWS already asks people to expect failures and plan around this with multi AZ deployments.

If you want stability, sign an NDA with AWS and ask for fleet wide reliability metrics for various instance types. There’s a surprisingly huge variance.

berniedurfee · on Dec 3, 2023

Same. 12+ years of using AWS and there’s been 1 instance of a server (RDS) going down due to something outside of our control.

Restoring a snapshot got us back running quickly. If we were multi-az, we probably wouldn’t have noticed.

belter · on Dec 3, 2023

In general in Cloud and as somebody said, you should Architect assuming everything fails all the time.

vel0city · on Dec 3, 2023

So why not have EC2s have a 50% SLA. Have them all force quit at some random interval between 2 hours and 200 hours, guaranteed. Have EBS volumes just corrupt your data every week. Why bother with SLAs at all when the solution is buy more redundant resources?

Or how about having actually reliable primitives?

I don't disagree, if you need extreme reliability build your infra to handle multi-az, even multi-region outages. But sometimes I'd rather just have an instance just stay online instead of having to pay for it three times over and still have it reasonably be expected to not corrupt itself. Hypervisor and storage technology could make that happen, as it's true on other clouds and has been true in the data center for decades.

I can have an instance on GCP with it's block storage having 99.9999% durability. I can't do the same with gp3 on AWS without having to deal with the complexity of clustering and all it's headaches and costs, the volume has a durability of 99.95%. Why is that an unreasonable ask?