Nines are not enough: meaningful metrics for clouds

deathanatos · on June 19, 2019

> List the good outcomes you want, and the bad outcomes to be avoided

I feel like sometimes you don't know the bad outcomes, until they happen. E.g., years ago, my team had a fairly big outage/issue caused by S3 reads taking multiple of 60s. (That is, our latency graph had very large spikes right over 60s, 120s, 180s¹. These were reads in the typically in the 10 KiB to 100 KiB range, sometimes as large as single-digit megs — i.e., they should take milliseconds, maybe a second, not 3 minutes. It took a significant amount of back-and-forth² as it either only effected our bucket, or just nobody else noticed. But it slowed our ability to process down that we built up an incredibly backlogs. (Processing a file had previously taken <1s, and was now taking over three minutes in some cases, a 200x slowdown!)

This is still not covered by the S3 SLA.

Also had a different cloud provider where the "Create VM" API returned 200 OK. The VM never finished booting. The SLA was over the 200 OK, not the actual completion of the task. Basically, exactly the example in the article, with a real world "we're paying for this?" provider.

¹I'm simplifying. It was weirdly actually 70s, 130s, and 190s. I can only presume that's a 10s timeout and n 60s timeouts, somewhere.

²Woe unto the person who doesn't come to support with request IDs. My impression the whole time is "these guys can't see their own response latency?"

jkoudys · on June 19, 2019

I did service work at a big corp for years, and we would've loved to take a problem like yours. I'd always want the tickets that came in with some interesting, visible pattern and a client who's already done some analysis. Much more fun than taking 6 hours telling a room full of panicky executives how important their problem is to me, before getting to the 10 minutes it actually takes to solve it.

perlgeek · on June 19, 2019

> ²Woe unto the person who doesn't come to support with request IDs. My impression the whole time is "these guys can't see their own response latency?"

Response latency is often only sampled, so if you observe rare outliers, support might not be any visibility by default.

jsty · on June 19, 2019

> Woe unto the person who doesn't come to support with request IDs. My impression the whole time is "these guys can't see their own response latency?"

They can probably see their own latencies at a service level, but if this was only affecting a few tenants, their internal tooling might have security-related restrictions on "dig into the request characteristics of this tenant".

Some providers deal with this by making CSRs log a reason for each access that you need to be prepared to justify when audited, others just lock it down completely outside of particular request patterns (e.g. "provide a request ID"). Pros and cons on both sides really.

vkou · on June 19, 2019

> ²Woe unto the person who doesn't come to support with request IDs. My impression the whole time is "these guys can't see their own response latency?"

They may see that there's a spike of high-latency requests for the system, as a whole, on their chart, but need a customer's explicit permission & particular request IDs to run a "Select from request_log where ..." query.

Generally speaking, just because you see a spike on a graph doesn't necessarily give you leave to start digging through random customer queries.

breischl · on June 19, 2019

I think it's hilarious that everyone is responding to

>²Woe unto the person who doesn't come to support with request IDs. My impression the whole time is "these guys can't see their own response latency?"

... so I'm going to do the same. :)

It's also possible that while that latency is weird for your particular usage, it's totally expected for someone else's. That can make it very hard to find your problematic requests amidst the noise of everyone else's average day at the bitbucket.

8note · on June 19, 2019

I get those timeouts from Dynamo.

the solution according to support is to set your own aws client config that kills connections if the timeout is larger than expected.

the 60s is defined in the default client config

jugg1es · on June 19, 2019

I can second the sentiment in this article about SLA/SLE being very hard to define. For example, my company originally agreed to an SLA with a customer regarding API response time. The problem was that the SLA was a flat 500ms and didn't take into account the nature of the possible queries. It's possible to request up to 18 months worth of data, which is never going to return to 500ms.

I had to spend 6 hours analyzing data to figure out what factors of a query actually impacted response time. It required me re-learning advanced spreadsheet skills to find correlations in log data. We're now in the process of rewriting the agreement because this analysis was not done at contract time.

This is a topic not really written about that much.

breischl · on June 19, 2019

This is so true, and I think partially what the article was referring as Customer Behavior Expectations. ie, "the SLA is 500ms _if you don't always request 18 months of data_" or whatever.

Another favorite of mine is when the SLIs are written in terms of average rate ignoring spikes. eg, "100ms latency at 100 requests/sec". So the customer slams you with 6000 requests as fast as possible, sleeps 60 seconds, and repeats. That averages out to 100 TPS, so how come you're not hitting your latency SLA?!

lazyant · on June 19, 2019

Not for your particular case of a long reply but response times SLOs are usually measured in percentiles ("90% of responses within 500ms")

greenleafjacob · on June 19, 2019

You should return the data in multiple pages for example, or exempt that query from latency SLA.

inflatableDodo · on June 19, 2019

Altitude, extent, velocity, humidity, temperature, pressure, droplet size, electrical charge, acidity, pictorial similarity and fluffyness.

jerrre · on June 19, 2019

It only now dawned on my why Digital Ocean's VMs[1] are called droplets...

[1] not 100% if that's what they are.

knd775 · on June 19, 2019

They are VMs

peter303 · on June 19, 2019

No one has achieved two nines yet.

Embarassing recent downtimes for Google and AWS.

NickNameNick · on June 19, 2019

I've been trying to convince my boss that 'nine fives' is a perfectly acceptable target.

Seriously, some systems only need to be up when people are actually using them. It doesn't matter if they don't work out of hours or over weekends.

masklinn · on June 19, 2019

> Seriously, some systems only need to be up when people are actually using them. It doesn't matter if they don't work out of hours or over weekends.

OTOH "out of hours" or "over weekends" is a very good time to make batch processes happen, so for some business services it might be better to not be up during hours, but reliably be up outside of them.

An other issue with that is when the service is used internationally / globally, or even just by 24/7 businesses, and even "over weekend" becomes not necessarily a thing.

flukus · on June 19, 2019

> An other issue with that is when the service is used internationally / globally

This also often happens prematurely. If the team in the US needs live data entered in Asia then there's not much you can do, they system has to be global. But an often overlooked option (today) is running multiple instances of the same software with each only needing nine fives in it's region. Even if you do need live data it might be better to have another process shuffling it between international instances. Also helps with latency.

zzzcpan · on June 19, 2019

> I've been trying to convince my boss that 'nine fives' is a perfectly acceptable target.

Unless you can force people to use your crap or your competition is barely working, it's not really acceptable. A typical customer has pretty reliable internet connection during normal operations and only rare long outages. Meaning that any unavailability and unresponsiveness that lasts more than a few seconds will be pretty visible and annoying to customers. And assuming your competition can do three-four nines, you'd need like four-five nines to do better, which is actually five-six nines as a target.

dllthomas · on June 19, 2019

I think the parent was talking about being reliably on a specific ~50% of the time (and reliably off the rest of the time), not about outages a significant portion of the time when the customer would expect to use it. I'm not sure there are actually that many cases where that's great, but it's a different thing than what you're criticizing.

Edited to add: Wonderfully, when trying to post this I got "Sorry, we're not able to serve your requests this quickly."

nickpsecurity · on June 19, 2019

Tell them to switch to OpenVMS. It was achieving in 1980's-1990's what Linux-based clouds still aren't. Example listing lots of its availability-boosting technologies:

https://www.stanq.com/OpenVMS%20and%20High%20Availability.pd...

They can get an updated version from this company:

https://www.vmssoftware.com/

Even better, the hardware it runs on might be cheap on eBay. :)

jandrese · on June 19, 2019

My thought is that cloud providers could provide five 9s of service today, but not for $5/month.

For the most part companies have decided that high uptime guarantees aren't worth the cost. Or they start rolling their own systems to deal with it.

lazyant · on June 19, 2019

There's a comment along those lines in the Google SRE book iirc; author was saying why should it matter to have some kind of service with 5 nines if the user's condition (flaky WiFi, laptop rebooting etc) had a way worse availability

Insanity · on June 19, 2019

unexpected downtime is unfortunately not planned to be 'out-of-hours' or over weekends. Made more complicated that if it's globally, it's almost never entirely 'out-of-hours' for everyone.

toast0 · on June 19, 2019

Nine nines is great but hard, i always aim for eight eights.

velcrovan · on June 19, 2019

In a year that’s forty days and forty nights…a nice figure if you’re into biblical allusions

th0ma5 · on June 19, 2019

I hate this assumption. 3am on a Sunday is often when I can finally get to things.

jkoudys · on June 19, 2019

Or five nines in hex.

zzzcpan · on June 19, 2019

They do achieve 3 nines (>=0.999, <0.9999) availability for pretty much every service. 4 nines - not so much.

ijpoijpoihpiuoh · on June 19, 2019

The only real SLA is the ability to vote with your feet and go to another provider that satisfies your business needs, or conversely to increase your investment with a particular cloud if you are satisfied with their performance. Providers know this, and they know that even if they wanted to, they can't capture all the possible values that are important to your particular business in numerical metrics. They also know that what is important to your business might not be important to another one, and therefore that it is hopeless to try to capture the entire gamut in a single set of metrics that are universally applicable. Any attempt would be futile, so it's best to stick to simpler metrics and let businesses and customers decide for themselves whether a given cloud meets their particular needs. SLA violation penalties hurt a little, but the real pain comes when you lose business.

carapace · on June 19, 2019

Over the years I've found that a lot of the confusion surrounding computers can be cut through by seeing them as a kind of factory. An elegant system or piece of software looks like an elegant factory, while a clunky or over-designed system or program is like a Rube Goldberg machine.

What makes good sense in leasing unused factory capacity would make good sense in leasing cloud resources, eh?