Slack’s Outage on January 4th 2021

dplgk · on Feb 1, 2021

> On January 4th, one of our Transit Gateways became overloaded. The TGWs are managed by AWS and are intended to scale transparently to us. However, Slack’s annual traffic pattern is a little unusual: Traffic is lower over the holidays, as everyone disconnects from work (good job on the work-life balance, Slack users!). On the first Monday back, client caches are cold and clients pull down more data than usual on their first connection to Slack. We go from our quietest time of the whole year to one of our biggest days quite literally overnight.

What's interesting is that when this happened, some HN comments suggested it was the return from holiday traffic that caused it. Others said, "nah, don't you think they know how to handle that by now?"

Turns out occam's razor applied here. The simplest answer was the correct one. Return-from-holiday traffic.

cett · on Feb 1, 2021

Though the nuance is Slack did know how to handle it, AWS didn't.

fipar · on Feb 1, 2021

I don't mean this ironically, but I think Slack did not actually know how to handle it: they outsourced the handling of this; they passed the buck.

This usually works well, under the rationale that "upstream provider does this for a living, so they must be better than us at this", but if you have too unique needs (or are just a bit "unlucky"), it can fail too.

All this to say that the cloud isn't magic. From a risk/error prevention point of view, it's not that different from writing software for a single local machine: not every programmer needs to know how to manually do memory management, it makes a lot more sense to rely on your OS and malloc (and friends) for this, but the caveat is that you do need to account for the fact that malloc may fail. In the cloud case, one can't just assume that you'll always be able to provision a new instance, scale up a service, etc. The cloud is like a utility company: normally very reliable, but they do fail too.

tw04 · on Feb 1, 2021

>I don't mean this ironically, but I think Slack did not actually know how to handle it: they outsourced the handling of this; they passed the buck.

Isn't that literally supposed to be the sales pitch for the cloud? Get away from the infrastructure as a whole so you can focus on code, and let the cloud providers wave their magic wand to enable scaling?

If you're saying now the story is: well rely on them to auto scale, until they don't - then why would I bother? Now you're telling me I need to go back to having infrastructure experts, which means I can save TON of money by going with a hosting provider that allows allocation of resources via API (which is basically all of them).

solidasparagus · on Feb 1, 2021

No, the cloud provides scalable infrastructure, but once you are in the 0.01% and you have very unique usage patterns, you still need to know how to set up your infrastructure for your needs. The difference is that instead of writing and managing a scalable cache, you just need to build the layer that knows to pre-provision for that scale/talk with AWS to make sure the system has sufficient capacity.

The cloud isn't some magic thing that solves all scaling problems, it's a tool that gives you strong primitives (and once you're a large enough customer, an active partner) to help you solve your scaling problems.

thu2111 · on Feb 1, 2021

This feels like AWS apologism.

Slack knew how to set up their infrastructure. Nothing in the postmortem implies AWS was misconfigured. AWS spotted the problem and fixed it entirely on their side.

Nothing in this report suggests that Slack has unique usage patterns. Users returning to work after Christmas is not a phenomenon unique to Slack.

Their problems were:

1. The AWS infrastructure broke due to an event as predictable as the start of the year. That's on Amazon.

2. Their infrastructure is too complicated. Their auto-scaling created chaos by shutting down machines whilst engineers were logged into them due to bad heuristics, although it's not like this was a good way to save money, and their separation of Slack into many different AWS accounts created weird bottlenecks they had no way to understand or fix.

3. They were unable to diagnose the root cause and the outage ended when AWS noticed the problem and fixed their gateway system themselves.

The cloud isn't some magic thing that solves all scaling problems

In this case it actually created scaling problems where none needed to exist. AWS is expensive compared to dedicated machines in a colo. Part of the justification for that high cost is seamless scalability and ability to 'flex'.

But Slack doesn't need the ability to flex here. Scaling down over the holidays and then back up once people returned to work just isn't that important for them - it's unlikely there were a large number of jobs queued up waiting to run on their spare hardware for a few days anyway. It just wasn't a good way to save money: a massive outage certainly cost them far more than they'll ever save.

gowld · on Feb 2, 2021

It wasn't scaling "back up". It was a huge spike as evertone refilled cache at the same time.

It's similar to Black Friday spikes Amazon handles themselves.

paranoidrobot · on Feb 2, 2021

> The cloud isn't some magic thing that solves all scaling problems, it's a tool that gives you strong primitives (and once you're a large enough customer, an active partner) to help you solve your scaling problems.

I don't think anyone who's got any reasonable level of experience is expecting that it's a magic wand.

There are, though some things in AWS (and for sure other cloud providers) where you get no useful signals or controls. It's entirely managed by the cloud provider, based on their own internal metrics and scaling behaviors.

Behind the scenes, their load balancer services don't give you indications of how heavily loaded they are - nor do you get to directly control how many/big those load balancers are.

In some parts you can hack around this by pre-warming infrastructure by generating fake traffic - but that assumes that you have those metrics and knowledge that you even need to do this.

This applies to all sorts of things - there's hidden caps and other capacity limits all over AWSs platform that you don't know about until you hit them. There's even capacity limits that you can know about, because they're publicly documented, but AWS lies and won't tell you the actual limit being applied to your account - the console and documentation says one thing, but in reality it's a lot lower.

If that capacity limit resulted in an outage, well, tough luck.

_skel · on Feb 1, 2021

If you are serious about reliability you always need infrastructure experts.

AWS is pretty good about documenting the limits of their systems, SLAs, how to configure them, etc. They don't just say you should wave a magic wand -- and even if they did say that, professional software engineers know better.

"a hosting provider that allows allocation of resources via API" is exactly what AWS is. Your infrastructure experts come into the picture because they need to know which resources to request, how to estimate the scale they need, and how to configure them properly. They should also be doing performance testing to see if the claimed performance really holds up.

remus · on Feb 1, 2021

> Isn't that literally supposed to be the sales pitch for the cloud? Get away from the infrastructure as a whole so you can focus on code, and let the cloud providers wave their magic wand to enable scaling?

Clearly there are limits even with the largest cloud providers. You'll have to engage a bit of critical thought in to whether you're going to get near those limits and what that might mean for your product. Obviously that's easier said than done, but you could argue that the cloud providers are still giving you reasonable value if you can pass the buck on a given issue for x years.

JMTQp8lwXL · on Feb 1, 2021

You have to know how to write code that fits into the cloud. You can't arbitrarily read/write to the file system, acting as if there's only one instance of the server running (if you plan to run hundreds or thousands). So even by waving the cloud 'magic wand', you still need to understand writing code in a cloud-friendly way. So in some sense, it's a shared responsibility between the vendor and engineering. You need to understand how to apply the tools being given to you.

tw04 · on Feb 1, 2021

Per the article, literally nothing in their code would have solved the issue. AWS was supposed to auto-scale TGWs and didn't.

>Our own serving systems scale quickly to meet these kinds of peaks in demand (and have always done so successfully after the holidays in previous years). However, our TGWs did not scale fast enough. During the incident, AWS engineers were alerted to our packet drops by their own internal monitoring, and increased our TGW capacity manually. By 10:40am PST that change had rolled out across all Availability Zones and our network returned to normal, as did our error rates and latency.

JMTQp8lwXL · on Feb 1, 2021

Correct, I was disputing the point that you can freely code without being mindful of the architecture even though the selling point of cloud providers is "focus on code, leave architecture to us". I'm not disputing in this case AWS was at fault: as the customer, Slack did everything right.

ncallaway · on Feb 1, 2021

> Isn't that literally supposed to be the sales pitch for the cloud?

Yes.

But a sales pitch is the most positive framing of the product possible. I wouldn't rely on the sales pitch when making the decision about how much you should depend on the cloud.

SilasX · on Feb 1, 2021

>This usually works well, under the rationale that "upstream provider does this for a living, so they must be better than us at this", but if you have too unique needs (or are just a bit "unlucky"), it can fail too.

Heh, a while ago I joked that one way to scale is to "make it somebody else's problem", with the proviso that you need to make sure that the someone else can handle the load. And then (due to the context) a commenter balked at the idea that a big player like YouTube would be unable handle the scaling of their core business.

https://news.ycombinator.com/item?id=23170685

(If they're really blaming it on AWS, it really takes guts to do it so publicly, I think.)

Johnny555 · on Feb 1, 2021

I think Slack did not actually know how to handle it: they outsourced the handling of this; they passed the buck

The issue was a transit gateway, a core network component. If they weren't in the cloud, this would have been a router, so they "outsourced" it in the same way an on-prem service outsources routing to Cisco. I guess the difference is they might have had better visibility into the Cisco router and known it was overloaded.

ak217 · on Feb 1, 2021

I don't think that's true. Slack seems to have their core online services split across a number of VPCs, and for some reason decided to use Transit Gateway to connect them. Transit Gateway is a special-purpose solution that is geared toward cross-region and on-prem to VPC connections in corporate networks, not to global high-traffic consumer products. It's the wrong tool for the job. Its architecture is antithetical to the other horizontally scalable AWS solutions. It introduces a single (up to) 50 gbps network hub that all inter-service traffic must go through. Native AWS architectures avoid such central hubs and provide a virtual routing fabric instead.

Slack could have chosen one of many other AWS design patterns such as VPC peering, transit VPC, IGW routing, or colocating more services in fewer VPCs (with more granular IAM role policies to separate operator privileges), to provide an automatically scaled network fabric to connect their services.

(This isn't to criticize Slack's engineering team. They have successfully scaled their service in a short time, and I'm happy with their product overall, and with their transparency in this report. But I think AWS has the world's biggest and most scalable network fabric - it's just a matter of knowing how to harness it.)

floatingatoll · on Feb 1, 2021

If your oldest request was queued 5+ seconds ago in a near-realtime system (such as Slack), CPU usage isn't your biggest problem.

Slack wrote an autoscaling implementation that ignored request queue depth and downsized their cluster based on CPU usage alone, so while they knew how to resolve it, I would not go so far as to say they knew how to prevent it. The mistake of ignoring the maxage of the request queue is perhaps the second most common blind spot in every Ops team I've ever worked with. No insult to my fellow Ops folks, but we've got to stop overlooking this.

nicoburns · on Feb 1, 2021

> The mistake of ignoring the maxage of the request queue is perhaps the second most common blind spot in every Ops team I've ever worked with. No insult to my fellow Ops folks, but we've got to stop overlooking this.

What's the first?

floatingatoll · on Feb 1, 2021

Non-randomized wallclock integers.

For example: “sleep 60 seconds”, “cron 0 * * * * command”, “X-Retry-After: 300”

Found in: recurring jobs, backoff algorithms, oauth tokens.

Found in: ops-created tasks, dev-released software.

encoderer · on Feb 1, 2021

I'm building something at Cronitor to help detect those hot-spots! If you want to learn more, email me: shane at cronitor.io

floatingatoll · on Feb 2, 2021

Tell us more here!

tempest_ · on Feb 1, 2021

Well Slack depended on the Cloud(tm).

It is a interesting though because a lot of the blog posts like "How we handled a 3000% traffic increase overnight!" boil down to "We turned up the AWS knob".

What happens when the AWS knob doesn't work?

buildawesome · on Feb 1, 2021

You do what Slack did and call the maker of the AWS Knob:tm:.

rrrrrrrrrrrryan · on Feb 1, 2021

Some of my co-workers came from active.com (a website that lets people register for marathons and events). The infrastructure had to handle massive spikes because registrations for big races would open all at once, so scalability was everything.

They explained to me that they'd intentionally slam the production website with external traffic a couple of times per year, at a scheduled time in the middle of the night. Like basically an order of magnitude greater than they'd every received in real life, just to try to find the breaking point. The production website would usually go down for a bit, but this was vastly better than the website going down when actual real users are trying to sign up for the Boston Marathon.

Slack probably should've anticipated this surge in traffic after the holidays, and if might have been able to run some better simulations and fire drills before it occurred.

bryan_w · on Feb 1, 2021

The problem you run into is that while you can load test your website with no problems, when running on shared infrastructure (AWS), you have to account for everyone's website being under load at the same time. That isn't as easy to test or find bottlenecks for.

t0mas88 · on Feb 1, 2021

Very good test. The guys at iracing.com should have done this before organising the e-sports Daytona 24 hours race last week, it was by far their largest event (boosted by Covid lockdown). It crashed their central scheduling service with a database deadlock. Classic case of a bug you only find under heavy load.

cle · on Feb 1, 2021

> During the incident, AWS engineers were alerted to our packet drops by their own internal monitoring, and increased our TGW capacity manually. By 10:40am PST that change had rolled out across all Availability Zones and our network returned to normal, as did our error rates and latency.

Sounds like AWS knew how to handle it too.

Given how AWS has responded to past events like this, I'd bet there's an internal post-mortem and they'll add mechanisms to fix this scaling bottleneck for everyone.

Although one thing I'm not clear on is if this was really an AWS issue or if Slack hit one of the documented limits of Transit Gateway (such as bandwidth), after which AWS started dropping packets. If that's the case then I don't see what AWS could have done here, other than perhaps have ways to monitor those limits, if they don't already. The details here are a bit fuzzy in the post.

oxfordmale · on Feb 1, 2021

If you hit yourself on the thumb while using a hammer, do you blame the hammer manufacturer or yourself? TGW limits are well documented.

0dmethz · on Feb 1, 2021

I mean, when the hammer manufacturer sells managed, auto scaling thumb-avoiding services you might rely on that.

If I understand correctly they didn't initially hit a TGW quota, it just didn't scale up fast enough.

sokoloff · on Feb 1, 2021

"Hey Boss, this system that our team selected and configured, behaved as documented but not in a way that protected our customers' experience.

It's Amazon's fault, not ours..."

If someone came to me with that, I'd educate them on how I saw it quite differently, politely but firmly.

0dmethz · on Feb 1, 2021

Unless I'm misunderstanding something the system did not perform as documented. It should have scaled, it didn't.

When a critical piece of infrastructure fails under massive load I'm not sure it it'll help much when you politely tell your engineers they fucked up for not anticipating it.

You learn lessons. Both Slack and AWS seem to have learnt lessons here.

sokoloff · on Feb 1, 2021

I agree with much of what you say, but if you change it to "It's Amazon's fault, not ours", that's where I diverge.

Slack did fuck up here, as evidenced by the outage and you seem to at least partially agree by the fact that Slack learned a lesson. Further, I think that "understanding how your system scales up from a low baseline to a high level of utilization (such as Black Friday/Cyber Monday for e-commerce, or special event launches, or a SuperBowl ad landing page)" is a standard, "par for the course" cloud engineering topic to be on top of nowadays.

jabart · on Feb 1, 2021

If you have a known increase in traffic at a certain date/time that auto-scaling (either EC2 or NLB/ALB or another service) can't handle fast enough you can let AWS know through your support contract to over-provision during that time or the scale up will take too long.

curun1r · on Feb 1, 2021

I remember going to a presentation by someone from FanDuel where he discussed something similar. Their usage patterns (heavy spike on NFL Sundays) caused similar problems with infrastructure that expected more gradual build-up. They engineered for it with synthetic traffic in advance of their expected spike to ensure their infrastructure was warm.

TL;DR it’s still your responsibility to understand the limitations of your infrastructure decisions and engineer your systems accordingly.

kortilla · on Feb 1, 2021

No, using AWS does not absolve you of that responsibility. The game of paying a vendor to be an engineer is that you have to have strategies to test this kind of stuff.

Slack didn’t know how to handle it, they paid AWS hoping the product did what it said on the tin. They didn’t test for this case and got bit.

They have millions of clients they could have coordinated to load test this stuff by picking some time to disable the cache and fallback to cache if it failed.

ignoramous · on Feb 1, 2021

True, but with any managed service, hidden limits and a cloud provider's own engineering (or lack thereof) may come back to bite the top 0.1% (the whales).

One approach to solve problems of scale is to trim down scale and bound it across multiple disparate silos that do not absolutely interact with each other at all, under any circumstances, except for making quick, constant-time, scale-independent decisions, may be.

In short, do things that don't need scale.

m463 · on Feb 1, 2021

AWS might have had back-to-work traffic in lots of domains simultaneously.

Or maybe their monitoring and response staff was just coming back online.

_skel · on Feb 1, 2021

Lots of other services that use AWS didn't go down the same day -- because they provisioned enough AWS capacity.

helper · on Feb 1, 2021

Where's the button to provision more TGW capacity?

coldcode · on Feb 1, 2021

I actually thought something in AWS was a cause but did not know anything about how TGW works internally.

coldcode · on Feb 1, 2021

I actually thought something in AWS was a cause but did not know about these internal systems.

polote · on Feb 1, 2021

But what HN predicted was wrong https://news.ycombinator.com/item?id=25632346

"My bet is that this incident is caused by a big release after a post-holiday "code freeze". "

hnlmorg · on Feb 1, 2021

HN comments suggested a broad plethora of things. I’m not surprised some happened on the right cause.

floatingatoll · on Feb 1, 2021

That's very kind of you to remember :)

thunderbong · on Feb 1, 2021

But Slack has been around for longer than a year, right? Shouldn't they have noticed this happening earlier?

I mean, considering Slack is mostly used as a workplace chat mechanism, they should have faced this kind of a scenario previously and had a solution for this by now.

delfaras · on Feb 1, 2021

Yeah but this year the number of people working from home that would connect to slack directly at the beginning of their work day must be much much larger than the other years

steve_adams_86 · on Feb 1, 2021

Another small but potentially relevant detail: Not many people vacationed, so more people would have returned to work at standard times. Many people travel over holidays for example (usually) but this time around in many places it wasn't even an option. Other people extend their holidays to relax more, but I don't know any people interested in staycations in their house. We've had enough of it.

gowld · on Feb 2, 2021

But also smaller because much fewer people went on a long traveling vacation away from laptop.

kparaju · on Feb 1, 2021

Some lessons I took from this retro:

- Disable autoscaling if appropriate during outage. For example if the web server is degraded, it's probably best to make sure that the backends don't autoscale down.

- Panic mode in Envoy is amazing!

- Ability to quickly scale your services is important, but that metric should also take into account how quickly the underlying infrastructure can scale. Your pods could spin up in 15 seconds but k8s nodes will not!

jrockway · on Feb 1, 2021

The thing that always worries me about cloud systems are the hidden dependencies in your cloud provider that work until they don't. They typically don't output logs and metrics, so you have no choice to pray that someone looks at your support ticket and clicks their internal system's "fix it for this customer" button.

I'll also say that I'm interested in ubiquitous mTLS so that you don't have to isolate teams with VPCs and opaque proxies. I don't think we have widely-available technology around yet that eliminates the need for what Slack seems to have here, but trusting the network has always seemed like a bad idea to me, and this shows how a workaround can go wrong. (Of course, to avoid issues like the confused deputy problem, which Slack suffered from, you need some service to issue certs to applications as they scale up that will be accepted by services that it is allowed to talk to and rejected by all other services. In that case, this postmortem would have said "we scaled up our web frontends, but the service that issues them certificates to talk to the backend exploded in a big ball of fire, so we were down." Ya just can't win ;)

bengale · on Feb 1, 2021

Experienced something similar with mongo atlas today. Our primary node went down and the cluster didn’t failover to either of the secondaries. We got to sit with our production environment completely offline while staring at two completely functional nodes that we had no ability to use. Even when we managed to get hold of support they also seemed unable to trigger a failover and basically told us to wait for the primary node to come back up. It took 90 minutes in the end and has definitely made us rethink about the future and the control we’ve given over.

gen220 · on Feb 1, 2021

Some customers have a hard requirement that their slack instances be behind a unique VPC. Other customers are easier to sell to if you sprinkle some “you’ll get your own closed network” on top of the offer, if security is something they’ve been burned by in the past.

I agree with you the mTLS is the future. It exists within many companies internally (as a VPC alternative!) and works great. There’s some problems around the certificate issuer being a central point of failure, but these are known problems with well-understood solutions.

I think there’s mostly a non-technical barrier to be overcome here, where the non-technical executives need to understand that closed network != better security. mTLS’s time in the sun will only come when the aforementioned sales pitch is less effective (or even counterproductive!) for Enterprise Inc., I think.

ViViDboarder · on Feb 1, 2021

mTLS appears to work great between servers, but I’ve been unable to get my iPhone to authenticate with a web server via Safari using mTLS. Even after installing the cert, it never presents it.

I wish it were better supported though.

danw1979 · on Feb 1, 2021

So many fails due to in-band control and monitoring are laid bare, followed by this absolute chestnut -

> We’ve also set ourselves a reminder (a Slack reminder, of course) to request a preemptive upscaling of our TGWs at the end of the next holiday season.

Thaxll · on Feb 1, 2021

Surprised that there is just a few metrics available for TGW: https://docs.aws.amazon.com/vpc/latest/tgw/transit-gateway-c...

Probably the only way to see a problem is if you have a flat line for bandwidth, but as the article suggested they had packet drop wich does not appear on the cloudwatch metrics, aws should add those metrics imo

miyuru · on Feb 1, 2021

some TGW limits are documented here: https://aws.amazon.com/transit-gateway/faqs/#Performance_and...

keyle · on Feb 1, 2021

Didn't we just read a story about the exact same issue?

Traffic picked up heavily on some website or app, AWS didn't auto-scale fast enough or at all and the very systems that are designed to be elastic just tumbled down to a grinding halt?

keyle · on Feb 2, 2021

Update: it was Advent Of Code 2020, where he reported the exact same issue. The AWS auto scaling framework rocked a pooper when the site exploded on release day.

jeffbee · on Feb 1, 2021

Why don't they mention what seems like a clear lesson: control traffic has to be prioritized using IP DSCP bits, or else your control systems can't recover from widespread frame drop events. Does AWS TGW not support DSCP?

bovermyer · on Feb 1, 2021

I really enjoy reading these write-ups, even if the causal incident is not something I enjoy.

ignoramous · on Feb 1, 2021

Not at all fun when you're actively involved in mitigating these, though. Pretty rough and sometimes scars you for life.

cle · on Feb 1, 2021

I work on critical services, similar to this. While nobody likes the overall business impact of events like this, I love being in the trenches when things like this happen. I enjoy the pressure of having to use anything I can to mitigate as quickly as possible.

I used to work in professional kitchens before software, and it feels a lot like the pressure of a really busy night as a line cook. Some people love it.

bovermyer · on Feb 1, 2021

I've been in the middle of events like these and had to write my share of postmortem documents.

Hearing about others' similar experiences makes me feel a connection to them, and often teaches me something.

fullstop · on Feb 1, 2021

I was kind of surprised to see that they are using Apache's threaded workers and not nginx.

dijit · on Feb 1, 2021

Apache mod_php is still much faster than php-fpm, and since slack uses a lot of PHP on the backend it makes a lot of sense for them.

saurik · on Feb 1, 2021

I loved this article where someone started with the goal of writing an article about how much faster nginx was, but then discovered the opposite (and for some reason didn't change the title of the article, which is hilarious)... both because it showed the author "cared", but also because it showed that people just assume what amounts to marketing myths (such as that Apache and mod_php are ancient tech vs. the more modern php-fpm stack) before bothering to verify anything.

https://www.eschrade.com/page/why-is-fastcgi-w-nginx-so-much...

fullstop · on Feb 1, 2021

When I moved things from Apache -> nginx years ago, I did it not because it was faster but because the resource requirements of nginx were so much more predictable under load.

cholmon · on Feb 1, 2021

Are you saying that Slack uses mod_php? According to a Slack Engineering blog post[0] from 9 months ago, Slack has been using Hack/HHVM since 2016 in place of PHP. My understanding is that HHVM can only be run over FastCGI, unless there's a mod_hhvm that I'm unaware of.

[0]: https://slack.engineering/hacklang-at-slack-a-better-php/

kmavm · on Feb 1, 2021

This is accurate: Slack is exclusively using Hack/HHVM for its application servers.

HHVM has an embedded web server (the folly project's Proxygen), and can directly terminate HTTP/HTTPS itself. Facebook uses it in this way. If you want to bring your own webserver, though, FastCGI is the most practical way to do so with HHVM.

user5994461 · on Feb 1, 2021

mod_xyz plugins were deprecated long ago because they are unstable. If you do a comparison you should compare fastcgi versus fastcgi, that is the standard way to run web applications. Running with mod should be faster because it's running the interpreter directly into the apache process but it's also making apache unstable.

mod_python was abandoned around a decade ago. It's crashing on python 2.7.

mod_perl was dropped in 2012 with the release of apache 2.4. It was kicked out of the project but continues to exist as a separate project (not sure if it works at all).

Denvercoder9 · on Feb 1, 2021

That's not the reason, as mod_php only supports the prefork MPM, and not the threaded MPMs.

johannes1234321 · on Feb 1, 2021

Threaded MPMs and PHP work quite well, I fixed a few bugs in that space some time ago.

There however might be issues in some of the millions libraries PHP potentially links in an called from it's extensions, and those sometimes at emit thread safe, but finding that and finding bypasses isn't easy ...

The other issue is that it's often slower (while there recently were changes from custom thread local storage to more modern one) and if there is a crash (i.e. Recursion stack overflow ...) it affects all requests in that process, not only the one.

gscho · on Feb 1, 2021

> our dashboarding and alerting service became unavailable.

Sounds like the monitoring system needs a monitoring system.

jrockway · on Feb 1, 2021

It is quite awkward that the output of "working" and "completely broken" alerting systems have the same visible effect -- no alerts.

For Prometheus users, I wrote alertmanager-status to let a third-party "website up?" monitoring server check your alertmanager: https://github.com/jrockway/alertmanager-status

(I also wrote one of the main Google Fiber monitoring systems back when I was at Google. We spent quite a bit of time on monitoring monitoring, because whenever there was an actual incident people would ask us "is this real, or just the monitoring system being down?" Previous monitoring systems were flaky so people were kind of conditioned to ignore the improved system -- so we had to have a lot of dashboards to show them that there was really an ongoing issue.)

TonyTrapp · on Feb 1, 2021

"Who monitors the monitors?"

sargun · on Feb 1, 2021

I wonder why Slack uses TGW instead of VPC peering.

tikkabhuna · on Feb 1, 2021

The "I've just done my Solutions Architect exam" answer would be that TGW simplifies the topology by having a central hub, rather than each VPC having to peer with all the other VPCs.

I wonder how many VPCs people have before transitioning over to TGW.

saurik · on Feb 1, 2021

I miss EC2 Classic :/. It always feels like the entire world of VPCs must have come from the armies of network engineers who felt like if the world didn't support all of the complexity they had designed to fix a problem EC2 no longer had--the tyranny of cables and hubs and devices acting as routers--that maybe they would be out of a job or something, and so rather than design hierarchical security groups Amazon just brought back in every feature of network administration I had been happily prepared to never have to think about every again :(.

whatisthiseven · on Feb 1, 2021

Agreed, I always thought VPC and all that complexity was a big step backwards. My org is moving from a largely managed network into AWS, and now we have to configure the whole network and external gateways ourselves? What engineer wants to do this?

VPCs are virtual, but I don't need VPCs, I need the entire network layer virtualized and abstracted. As you suggested,just grouping devices in a single network and saying "let them all talk to each other, let this one talk to that one over this port/IP" should be all I describe. Let AWS figure out CIDR, routing, gateways, etc.

gen220 · on Feb 1, 2021

People use it as a (imo lazy) form of enforcing access control. If two services aren’t in the same VPC, they can’t talk to each other. It theoretically limits the damage of a rogue node.

Of course, it also creates a ton of overhead and complexity, because you still have to wire all your VPCs together to implement things like monitoring and log aggregation, for example.

As other people have suggested, the better solution (imo) is to have all your traffic be encrypted with mTLS, and enforce your ACLs with certs instead of network accessibility.

whatisthiseven · on Feb 1, 2021

Sure, but we also do the things you suggested.

However, if you are relying on defense in depth for security, then having them be network separate helps prevent internal DDoS attacks, whether malicious or not.

Enforcing security across the entire network layer has many positives. But I don't want to be messing with the lower levels, and those lower levels all have the same security models and solutions as one another, at least if you view them at a high level.

VPCs have value as a security and availability solution, I just don't want to have to configure it to get what could be an automatic benefit.

BillinghamJ · on Feb 1, 2021

Generally inclined to agree, but to be fair you can operate a VPC in exactly the same way as EC2 Classic - give everything public IPs, public subnets and ignore the internal IPs. Pretty sure those are the defaults too

Lammy · on Feb 1, 2021

My assumption is it's an IPv4 address exhaustion thing too.

mandis · on Feb 2, 2021

Amazon and IPv4 exhaustion?

https://www.techradar.com/in/news/amazon-has-hoarded-billion...

jen20 · on Feb 1, 2021

It's more to do with what entries you would put in a routing table to get EC2 Classic over a DirectConnect, no?

schoolornot · on Feb 1, 2021

Exams these days are a cross between pre-sales training and AWS dogma. Fundamentally all AWS services share the same primitives. It would be great if AWS could take incidents like these, provide some guidance on how to avoid them, and then add 1 or 2 questions to the exam. It would give some credence to the exams which are now basically crammed material.

miyuru · on Feb 1, 2021

this article from 4 months ago explains it.

https://slack.engineering/building-the-next-evolution-of-clo...

hanikesn · on Feb 1, 2021

Interesting, I was wondering whether using a shared VPC would have been the better solution, but it turns out they use a shared VPC per region and peer them via TGW. IMHO it'd be worth peering those regions individually, to get rid of that potential bottleneck. Of course you loose quite a few interesting features.

mbyio · on Feb 1, 2021

Yeah this stuck out to me too. Adding an extra hop and point of failure isn't acceptable IMO.

grumple · on Feb 2, 2021

Automated scaling has been a persistent problem for me, especially if I try to scale on simple metrics, or even worse (in Slack's case) on metrics that could potentially compete. The situations in which multiple metrics could compete are sometimes difficult to conceive, but it will always happen if you aren't performing something more sophisticated than "up if metric > value, down if < value" for multiple metrics. I think you've got to combine these somehow into a custom metric and scale just on one metric. I'm totally unsurprised to see that autoscaling failed for both Slack and AWS in this case.

I think you really have to look at metric-based autoscaling and say: is it worth the X% savings per month? Or would I rather avoid the occasional severe headaches caused by autoscaling messing up my day? Obviously this depends on company scale and how much your load varies. I'd rather have an excess of capacity than any impact on users.

tobobo · on Feb 1, 2021

The big takeaway for me here is that this “provisioning service” had enough internal dependencies that they couldn’t bring up new nodes. Seems like the worst thing possible during a big traffic spike.

throwdbaaway · on Feb 2, 2021

> We run a service, aptly named ‘provision-service’, which does exactly what it says on the tin. It is responsible for configuring and testing new instances, and performing various infrastructural housekeeping tasks. Provision-service needs to talk to other internal Slack systems and to some AWS APIs.

The "configuring and testing new instances" part also sounds very fishy to me. Configuration should be done when creating the image and launch template, while testing should be the job of the load balancing layer. Why do we need a separate "provision-service" to piece everything together?

DrJones1098 · on Feb 1, 2021

Wow. The trail leads back to AWS. Wasn't there a number of other companies that were down around that same time or was that a different time?

jjtheblunt · on Feb 1, 2021

Is this the recent event you refer to? https://aws.amazon.com/message/11201/

DrJones1098 · on Feb 1, 2021

Ah yes, that was the one I was referring to. Looks like a different event.

conradfr · on Feb 1, 2021

Does AWS compensate you in cases like this?

dewey · on Feb 1, 2021

It depends: https://aws.amazon.com/legal/service-level-agreements/

dandigangi · on Feb 1, 2021

They do. I can't remember where the documentation is but there's a clause where they payout as long as they aren't meeting their SLAs.

Who knows the different ways they may be able to get out of that. I assume this wasn't one of those times.

ianrw · on Feb 1, 2021

I wonder if you can pre-warm TGWs like you can ELB? It would be annoying to have to have AWS prewarm a bunch of you stuff, but it's better than it going down.

plaidfuji · on Feb 2, 2021

Maybe I’m being naive, but what I’m curious about is: how did their whole team communicate through this triage? I assume not Slack?

allannienhuis · on Feb 2, 2021

our emergency backup for slack is zoom. Horrible UX for group chats, but everyone already has it installed and it's quick and simple to set up a new room for each team. For temporary use you can put up with a fair bit of annoying behaviour or lack of features.

nhoughto · on Feb 1, 2021

Never seen Transit Gateway, I assume they wouldn't have this problem if it was just single VPC or done via VPC peering?

johnnymonster · on Feb 1, 2021

TL;DR outage caused by traffic spike from people returning to work after the holiday.

jeffrallen · on Feb 1, 2021

tldr: "we added complexity into our system to make it safer and that complexity blew us up. Also: we cannot scale up our system without everything working well, so we couldn't fix our stuff. Also: we were flying blind, probably because of the failing complexity that was supposed to protect us."

I am really not impressed... with the state of IT. I could not have done better, but isn't it too bad that we've built these towers of sand that keep knocking each other over?

bobthebuilders · on Feb 2, 2021

The thing is anyone can build a system that can scale out to Slack's level given enough machines and money. What's harder is scaling out to that level and not burning gobs of cash.

It's similar to the whole buildings a long time ago last much longer than those today. Its true in the literal sense, but it ignores the fact that we've gotten at reducing the cost of stuff like skyscrapers and bridges.

In our pursuit of efficiency, we do things like JIT delivery, dropshipping, scaling, building to the minimum spec. Sometimes, we get it wrong and it comes tumbling down (covid, HN hug of death, earthquakes).