No, the cloud provides scalable infrastructure, but once you are in the 0.01% and you have very unique usage patterns, you still need to know how to set up your infrastructure for your needs. The difference is that instead of writing and managing a scalable cache, you just need to build the layer that knows to pre-provision for that scale/talk with AWS to make sure the system has sufficient capacity.
The cloud isn't some magic thing that solves all scaling problems, it's a tool that gives you strong primitives (and once you're a large enough customer, an active partner) to help you solve your scaling problems.
Slack knew how to set up their infrastructure. Nothing in the postmortem implies AWS was misconfigured. AWS spotted the problem and fixed it entirely on their side.
Nothing in this report suggests that Slack has unique usage patterns. Users returning to work after Christmas is not a phenomenon unique to Slack.
Their problems were:
1. The AWS infrastructure broke due to an event as predictable as the start of the year. That's on Amazon.
2. Their infrastructure is too complicated. Their auto-scaling created chaos by shutting down machines whilst engineers were logged into them due to bad heuristics, although it's not like this was a good way to save money, and their separation of Slack into many different AWS accounts created weird bottlenecks they had no way to understand or fix.
3. They were unable to diagnose the root cause and the outage ended when AWS noticed the problem and fixed their gateway system themselves.
The cloud isn't some magic thing that solves all scaling problems
In this case it actually created scaling problems where none needed to exist. AWS is expensive compared to dedicated machines in a colo. Part of the justification for that high cost is seamless scalability and ability to 'flex'.
But Slack doesn't need the ability to flex here. Scaling down over the holidays and then back up once people returned to work just isn't that important for them - it's unlikely there were a large number of jobs queued up waiting to run on their spare hardware for a few days anyway. It just wasn't a good way to save money: a massive outage certainly cost them far more than they'll ever save.
> The cloud isn't some magic thing that solves all scaling problems, it's a tool that gives you strong primitives (and once you're a large enough customer, an active partner) to help you solve your scaling problems.
I don't think anyone who's got any reasonable level of experience is expecting that it's a magic wand.
There are, though some things in AWS (and for sure other cloud providers) where you get no useful signals or controls. It's entirely managed by the cloud provider, based on their own internal metrics and scaling behaviors.
Behind the scenes, their load balancer services don't give you indications of how heavily loaded they are - nor do you get to directly control how many/big those load balancers are.
In some parts you can hack around this by pre-warming infrastructure by generating fake traffic - but that assumes that you have those metrics and knowledge that you even need to do this.
This applies to all sorts of things - there's hidden caps and other capacity limits all over AWSs platform that you don't know about until you hit them. There's even capacity limits that you can know about, because they're publicly documented, but AWS lies and won't tell you the actual limit being applied to your account - the console and documentation says one thing, but in reality it's a lot lower.
If that capacity limit resulted in an outage, well, tough luck.
The cloud isn't some magic thing that solves all scaling problems, it's a tool that gives you strong primitives (and once you're a large enough customer, an active partner) to help you solve your scaling problems.