This is a great writeup, and very broadly applicable. I don't see any "this would only apply at Google" in here.
> COMMUNICATION CHANNELS! AND BACKUP CHANNELS!! AND BACKUPS FOR THOSE BACKUP CHANNELS!!!
Yes! At Netflix, when we picked vendors for systems that we used during an outage, we always had to make sure they were not on AWS. At reddit we had a server in the office with a backup IRC server in case the main one we used was unavailable.
And I think Google has a backup IRC server on AWS, but that might just apocryphal.
Always good to make sure you have a backup side channel that has as little to do with your infrastructure as possible.
God, I feel like such an idiot. All this time I’ve been making fun of Google for having Google Talk, Hangouts, Allo, Duo, Messages, Spaces, Wave, Buzz, Plus, and Meet, I’d never realized that it’s simply a necessary SRE measure at their scale.
The ultimate in "this would only apply at Google", was having to establish backup communication side channels that weren't Google dependent. While on call as an SRE for Google on the Internet traffic team, we would naturally default to Google Meet for communications during an indent, but the questions is, what would we do if Google Meet is down? It's a critical-path team, which mean that if I got paged because my team's system is down, it's not improbable that Google Meet (along with Google.com) was down because of it.
We needed to have extra layers of backup communications because the first three layers systems were all also Google properties. That is to say, email wouldn't work because Gmail is Google, our work cell phones wouldn't work because they're Google Fi, and my home's ISP was Google Fiber/Webpass.
All of which is to confirm that, yes, Google has a backup IRC server for communication. I won't say where, but it's explicitly totally off Google infrastructure for that very reason.
> The ultimate in "this would only apply at Google", was having to establish backup communication side channels that weren't Google dependent.
Not sure this is an "only at google" thing. In a past life, I ran Engineering for an App/SMS messaging application. We definitely used an IRC channel as well since the nature of our outages would mean messaging channels could be down. It's also why we didn't rely soley on SMS for our on-call alerting.
I wouldn't say this is a Google only problem. Really, all of these problems of redundancy and recovery have been best considered by militaries well before private companies started to have sufficiently global operations and communications networks that they started thinking about it, too. In the Army Recon Course, they taught us to bounce signals off the ionosphere using hand-held radios if we needed to communicate with an otherwise cut-off remote unit too far past the earth's curvature horizon to get line of sight. And, of course, every unit is drilled in contingency plans to the point that it can operate independently as necessary even when all communication gets cut off and continue to push the overall mission forward even in the absence of further coordination, which, for what it's worth, obviously does present the potential for problems of its own when the mission changes and a disconnected unit doesn't know it, as well explored in Doctor Strangelove.
It’s still batshit insane to me that Google Fi had any infrastructure overlap with Gmail. When I was there these grand unified base layer systems everything else was built on top of were a point of pride. From a risk perspective though it was sheer stupidity in retrospect.
As Google continues to atrophy and suffers attrition of the original people that built those systems, the probability creeps up more and more that someone will one day cause a catastrophic global outage of systems that should share absolutely no dependencies but do because “it scales”.
If Google made auto pilot for airplanes, it would all be on a centralized SaaS with white papers published in academic conferences about the elaborate systems designed to ensure server crashes don’t impact it. Nobody will ever complain about how all of those guarantees depend on chubby, big table, etc.
At my not Google job we talk about "what happens if a meteor hits a DC".
We agree that that is so rare that as long as there are buttons we can push to recover after a reasonable timeframe that is an acceptable risk, we don't need a fully automatic way to recover from that.
However your SRE teams needs a way to recover without intervention which is why there is talk of backups.
BTW even using different cloud providers isn't enough to avoid a DC outage necessarily. No amount of redundancy can protect you from it beyond a ton of services which intentionally slice off access to the DC leading to the risk of that happening accidentally which is its own risk.
But that’s incredibly stupid. A meteor hitting a DC is an intentionally dumb way to eclipse the much more likely risks of thousands of other things that can wipe out a DC.
I’ve been involved in such discussions and on the surface they seem reasonable but it turns into an easy reason to write of DC being wiped out.
In comparison there are far more likely reasons for a DC to be wiped out effectively permanently. Data centers at the base of WTC 1 and 2 are good examples. The myriad of targeted attacks on the power grid are also prime examples of attacks that would cripple a data center for weeks if they were in the cross chairs.
The Cascadia subduction zone has a much higher probability of wiping out all of them in the PNW simultaneously than a meteor hitting a single one.
Looking at it from a different perspective, there are other teams at Google who understand how sensitive their data centers are and they act appropriately. The list of data center locations is not public. A motivated group with long range rifles purchase from Walmart could wipe out a Google data center.
Afair Google just ran irc on their corp network which was completely separate from prod so I wouldn’t be surprised if it was in a small server room in the office somewhere.
> very broadly applicable. I don't see any "this would only apply at Google" in here.
One thing I haven’t even heard of anyone else doing was production panic rooms - a secure room with backup vpn to prod
When I was there, the main IRC ran in prod. But it was intentionally a low-dependency system, an actual IRC server instead of something ridiculous like gIRC or IRC-over-stubby.
i think it had a corp dns label but i'm fuzzy on that. yes it could've been a prod instance which would mean you'd need to go to panic room in case corp was down but maybe that was the intention.
Correct. At an old job we did zero trust corp on a different AWS region and account. The admin site was a different zero trust zone in prod region/account and was supposed to eventually become another AWS account in another region (for cost purposes).
I can’t say if any of this was ideal but it did work unobtrusively.
Way back when, for a while, our local (Google) office's internet access ran off the same physical lines as the local prod datacenter traffic. So, any time there was a datacenter traffic outage of any kind, our office was also out. There weren't a lot of outages of that variant, but we knew immediately when one was happening. It's not particularly fun to have all of your access go out concurrently with a prod outage.
Before divestiture we had two groups splitting responsibility for IT, so our DevOps team (bleh) only had partial control of things. We were then running in heterogenous mode - one on prem data center and cloud services.
One day a production SAN went wonky, taking out our on prem data center. ... and also Atlassian with it. No Jira, no Confluence. Possibly no CI/CD as well. No carefully curated runbooks to recover. Just tribal knowledge.
People. Were. Furious. And rightfully so. The 'IT' team lost control of a bunch of things in that little incident. Who puts customer facing and infrastructure eggs into the same basket like that?
Also make sure your backup channel can scale. Getting a flood of 10,000+ folks over to a dinky IRC server can knock it over easy. Throttling new joinees isn't a panacea either, since there might be someone critical to get on a channel which throttling can complicate.
Maybe I'm naive, but I would imagine that any raspberry pi could run an IRC server with 10,000 users...
Surely 10,000 Users each on average receiving perhaps five 50 byte messages per second (that's a very busy chat room!) is a total bandwidth of 2.5 megabytes per second.
And the CPU time to shuffle messages from one TCP connection to another, or encrypt/decrypt 2.5 megabytes per second should be small. There is no complex parsing involved - it is literally just "forward this 50 byte message into this list of TCP connections".
If they're all internal/authed users, you can hopefully assume nobody is deliberately DoSing the server with millions of super long messages into busy channels too.
Let's not forget IRC channels can be split between servers too - don't wanna complicate the backup system too much but this is in the original design of IRC.
I started getting phone calls at 5AM regarding servers going down. I immediately went to email to connect with other members of the team and found my inbox full of thousands of emails from the monitoring service notifying me of the outage. Those were days:)
Something I hope to eventually hear is the solution to the full cold start problem. Most giant custom-stack companies have circular dependencies on core infrastructure. Software-defined networking needs some software running to start routing packets again, diskless machines need some storage to boot from, authentication services need access to storage to start handing out service credentials to bootstrap secure authz, etc.
It's currently handled by running many independent regions so that data centers can be brought up from fully dark by bootstrapping them from existing infra. I haven't heard of anyone bringing the stack up from a full power-off situation. Even when Facebook completely broke its production network a couple years ago the machines stayed on and running and had some internal connectivity.
This matters to everyone because while cloud resources are great at automatic restarts and fault recovery there's no guarantee that AWS, GCP, and friends would come back up after, e.g., a massive solar storm that knocks out the grid worldwide for long enough to run the generators down.
My guess is that there are some dedicated small DCs with exceptional backup power and the ability to be fully isolated from grid surges (flywheel transformers or similar).
When I was in Google SRE we had monitoring and enforcement of permitted and forbidden RPC peers, such that a system that attempted to use another system would fail or send alerts. This was useful at the top of the stack to keep track of dependencies silently added by library authors, and at the low levels to ensure the things at the bottom of the stack were really at the bottom. We also did virtual automated cluster turn-up and turn-down, to make sure our documented procedures did not get out of date, and in my 6 years in SRE I saw that procedure fall from 90 days to under an hour. We also regularly exercised the scratch restarts of things like global encryption key management, which involves a physical object. The annual DiRT exercise also tried to make sure that no person, team, or office was necessary to the continuing function of systems.
The solution to this has to be done earlier, but it's simple: start a habit of destroying and recreating everything. If you wait to start doing this, it's very painful. If you start doing it at the very beginning, you quickly get used to it, and breaking changes and weird dependencies are caught early.
You can even do this with hardware. It changes how you architect things, to deal with shit getting unplugged or reset. You end up requiring more automation, version control and change management, which speeds up and simplifies overall work, in addition to preventing and quickly fixing outages. It's a big culture shift.
My SaaS app is small even compared to some desktop apps. But at least once a year, I try to reboot it from scratch.
Of course this is way easier for a nano-scale app, but I love the feeling of knowing that it can be started from any server in the world in less than 10 minutes (including copying the data).
I also make sure there are 0 errors even with a clean slate database.
For some reason I can't understand, this gives me some joy.
That really doesn't address GP's concerns at all, which are only concerns for the hyperscale cloud providers. Because they use their global infrastructure to turn on a data center that has been down, we don't know if they have a way to restart the entire planet. It is impossible to test this, because that would cause a global outage that could possibly be permanent.
The problem is that you can never really destroy and recreate everything. At the very least you want to preserve your user data but you probably also don't want to regularly go offline. IIRC Google used to have a dependency on starting up a new datacenter where it needed a DNS request to have some core services find each other and start up. This could be served by any other Google datacenter. But if all datacenters went offline at the same time it wouldn't be able to start. So just destroying and recreating full datacenters won't find this. You literally would have to take all of Google offline at the same time to find this by testing.
Are you saying they can bring a new data center online without any connectivity to the rest of their infrastructure? GP isn't concerned about turning on one data center, they are concerned about turning them all on at the same time, and that can never be tested.
If you practice bringing a new datacenter online without any connectivity on the existing deployment, and you practice then joining two disjoint "clouds", then you've pretty much covered your bases.
Are you making a rate limiting/ddos argument about "turning them all on at the same time"?
The power grid guys claim to have cold start plans locked and loaded, but I'm not convinced they would work. Anyone seen an after-action report saying how well a real grid cold start went? It would also be interesting to know which grid has had the most cold starts: in a perfect world, they'd be good at it by now. Bet it's in the Caribbean or Africa. But it's funny: small grid cold starts (i.e. an isolated island with one diesel generator and some solar) are so easy they probably wouldn't make good case studies.
It's clear that the Internet itself could not be cold started like that AC grids, there's simply too many AS's. (Think about what AS means for a second to understand why a coordinated, rehearsed cold start is not possible.)
Not just cold starts. Similar struggles if your Infra-as-code deployments depend on your CI/CD pipelines, conveniently running in the same environment as the one having an outage that you need to push new configs to.
Hyperscalers have several days’ worth of diesel that power generators after the batteries are used up. I’m pretty sure fueling trucks would be routed there should there be a longer power outage than 1-2 days.
For much much more on this, I'm most of the way through Google's book _Building Secure and Reliable Systems_, which is a proper textbook (not light reading). It's a pretty interesting book! A lot of what it says is just common sense, but as the saying goes, "common sense" is an oxymoron; it's felt useful to have refreshed my knowledge of the whole thing at once.
Recently, I've heard of several companies folding up SRE and moving individuals to their SWE teams. Rumors are that LinkedIn, Adobe, and Robinhood have done this.
This made me think: is SRE a byproduct of a bubble economy of easy money? Why not operate without the significant added expense of SRE teams?
I wonder if SRE will be around 10 years from now, much like Sys Admins and QA testers have mostly disappeared. Instead, many of those functions are performed by software development teams.
Eh it’s not the same thing. (I’m very full stack with intermittent devops/sre experience).
Full stack means you write code running on back end and front end. 99% of the time the code you write on the FE interfaces with your other code for the BE. It’s pretty coherent and feedback loops are similar.
Devops/SRE on the other hand is very different and I agree we shouldn’t expect software developers be mixing in SRE in their day to day. The skills, tools, mindset, feedback loop, and stress levels are too different.
If you’re not doing simple monoliths then you need a dedicated devops/SRE team.
If you can be good at front and back end and keep up with both of them simultaneously, that's great, but:
- you spend more time to keep up with both of those sectors compared to dedicated front or back end positions
- you context switch more often than dedicated positions
- you spent more time getting good at both of those things
- you removed some amount of communication overhead if there were two positions
You are definitely not being compensated for that extra work and benefit to the business given that full stack salaries are close to front end and back end position salaries.
SRE is not a byproduct of a bubble economy. I believe Google has had SREs since the very beginning. But still I think the rest of the point still stands. These days with devops the skill set needed for devs have indeed expanded to have significant overlap with SREs. I expect companies to downsize their SRE teams and distribute responsibilities to devs.
A second major reason is automation. If you read the linked site long enough you'll find that in the early days of Google, SREs did plenty of manual work like deployments while manually watching graphs. They were indispensable then simply because even Google didn't have enough automation in their systems! You can read the story of Sisyphus https://www.usenix.org/sites/default/files/conference/protec... to kind of understand how Google's initial failure of adopting standardized automation ensured job security for SREs.
Pedantically, Google didn't have SREs as the beginning. I asked a very early SRE, Lucas, (https://www.nytimes.com/2002/11/28/technology/postcards-from... and https://hackernoon.com/this-is-going-to-be-huge-google-found...), and he said that in the early days, outages would be really distracting to "the devs like Jeff and Sanjay" and he and a few others ended up forming SRE to handle site reliability more formally during the early days of growth, when Google got a reputation for being fast and scalable and nearly always up.
Lucas helped make one of my favorite Google Historical Artefacts, a crayon chart of search volume. They had to continuously rescale the graph in powers of ten due to exponential growth.
I miss pre-IPO Google and the Internet of that time.
> “These days with devops the skill set needed for devs have indeed expanded to have significant overlap with SREs”
Respectfully disagree on this. SRE is a huge complex realm unto itself. Just understanding how all the cloud components and environments and role systems work together is multiple training courses, let alone how to reliably deploy and run in them.
But modern approaches to dev require the SWEs to understand and model the operation of their software, and in fact program in terms of it — “writing infrastructure” rather than just code.
Lambda functions, for example: you have to understand their performance and scalability characteristics — in turn requiring knowledge of things like the latency added by crossing the boundary between a managed shared service cluster and a VPC — in order to understand how and where to factor things into individual deployable functions.
Alright, how about expecting devs to repackage their entire until-that-point-SaaS stack into an "appliance" (Kubernetes Helm chart), containing SWE-written resource manifests that define the application's scaling characteristics across arbitrarily-shaped k8s clusters they won't get to see in advance, using only node taints; memory limits for layers of their stack they've never even seen run full-bore before; health checks that multiplex back up to a central monitoring platform; safely-revertible multiphase upgrade rollout behavior that never decreases availability; and so forth;
...and then those same devs being expected to directly debug the behavior of this "appliance" in a client environment (think: someone consuming the "appliance" through the Amazon Marketplace, where this launches the workload into an EKS cluster in the customer's own VPC, with the customer in control of defining that cluster's node pools);
...where this can involve, for example, figuring out that a seemingly-innocent bounded-size Redis cache deployment, needs 10x its steady-state memory, when booting from a persisted AOF file... for some godforsaken reason.
The idea of ops people who wrote code for deployment and monitoring and had responsibility for incident management and change control existed before Google gave it a name.
Source: I was one at WebTV in 1996, and I worked with people who did it at Xerox PARC and General Magic long before then.
I never got why SRE existed.(SRE has been my title...) The job responsibilities, care about monitoring, logging, performance, metrics of applications are all things a qualified developer should be doing. Offloading caring about operating the software someone writes to someone else just seems illogical to me. Put the swes on call. If swes think the best way to do something is manual, have them do it them selves, then fire them for being terrible engineers. All these tedious interviews and a SWE doesn't know how the computer they are programing works? Its insane. All that schooling and things like how does the OS work, which is part of an undergrad curriculum, gets offloaded to a career and title mostly made up of self taught sysadmin people? Every good swe Ive known, knew how the os, computer, network works.
> if SRE will be around 10 years from now,
Other tasks that SRE typically does now, generalized automation, provide dev tools and improve dev experience, is being moved to "platform" and teams with those names. I expect it to change significantly.
Oddly, the call to put the SWEs in the on-call rotation was one of the original goals of site reliability engineering as an institutional discipline. The idea at conception was that SREs were expensive, and only after product teams got their act together could they truly justify the cost of full-time reliability engineering support.
It's only in the past 10 years (reasonable people may disagree on that figure) that being a site reliability engineer came to mean being something other than a professional cranky jackass.
What I care about as an SRE is not graphs or performance or even whether my pager stays silent (though, that would be nice). No, I want the product teams to have good enough tools (and, crucially, the knowledge behind them) to keep hitting their goals.
Sometimes, frankly, the monitoring and performance get in the way of that.
> Other tasks that SRE typically does now, generalized automation, provide dev tools and improve dev experience, is being moved to "platform" and teams with those names. I expect it to change significantly.
Yeah, this is my experience, too. "DevOps" (loosely, the trend you describe in the first paragraph) is eating SRE from one end and "Platform" from the other. SRE are basically evolving into "System Engineers" responsible for operating and sometimes developing common infrastructure and its associated tools.
I don't think that's a bad thing at all! Platform engineering is more fun, you're distributing the load of responsibility in a way that's really sensible, and engineers who are directly responsible for tracking regressions, performance, and whatnot ime develop better products.
>s SRE a byproduct of a bubble economy of easy money? Why not operate without the significant added expense of SRE teams?
I'm a SWE SRE. I think in some cases it is better to be folded into a team. In other cases, less so.
One SRE team can support many different dev teams, and often the dev teams are not at all focusing time on the very complicated infra/distributed systems aspect of their job, it's just not something that they worry about day to day.
So it makes sense to have an 'infra' that operates at a different granularity than specialized dev teams.
That may or may not need to be called SRE, or maybe it's an SRE SWE team, or maybe you just call it 'infrastructure' but at a certain scale you have more cross cutting concerns across teams where it's cheaper to split things out that way.
I think it’s simply swapping one set of trade offs for another. With dedicated SREs you have true specialists in production operations and their accompanying systems (tooling, alerting, etc) with a clear mandate and ownership of outcomes; but they don’t necessarily have full ownership of what they’re keeping running, and that can cause organizational problems (we want to launch X, SRE says no, or vice versa) and make it so non-SREs take no ownership over their hard-to-support code.
Conversely you can have Eng teams without SREs and most of those organizational/social problems, at the cost of production reliability being only one of many priorities.
I think what’s really happening is that a lot of companies are deciding they don’t care about reliability very much as a business outcome, especially when it comes at the expense (at least in opportunity cost) of less features.
I work for a large cloud service that is not Google where the SRE culture varies heavily depending on which product you’re building. SREs are a necessity to free up devs to do actual dev work. Platform and infra teams should tightly couple SWEs and SREs to keep SWEs accountable, but not responsible for day to day operations of the infra - you’ll never get anything done :)
The fact is many/most SWEs don't have the skillset or interest to do SRE work. While there is a lot of overlap, the work can be quite different between the two areas. SRE basically maps to the sysadmin role of old, which has never really gone away and I don't think it's a product of a "bubble economy".
If you think of an SRE as an expensive sysadmin then yes, you should absolutely scratch that entire org. SRE, by Google's definition, is supposed to contain software engineers with deep systems expertise, not some kind of less-qualified SWEs.
I haven’t noticed that in my corner of one of those mentioned companies. Also I’m not an SRE, but during the height of the recent tech layoffs the only job postings I was seeing was for SRE.
> much like [...] QA testers have mostly disappeared.
Who told you that?
QA isn't going anywhere... someone is doing testing, and that someone is a tester. They can be an s/w engineer by training, but as long as they are testing they are a tester.
With sysadmins, there are fashion waves, where they keep being called different names like DevOps or SRE. I've not heard of such a thing with testing.
> someone is doing testing, and that someone is a -tester- user
excuse me for remembering something surely HN considers a platitude: "everyone has a TEST environment, few are fortunate enough to also have a PROD one"
Well, let me take this seriously for a moment. I believe that companies which don't have dedicated testers today are the same companies which didn't have dedicated testers before.
We really use the language of "users doing the testing" jokingly. No software is written w/o testing, not even very trivial programs would run firs time. So, we just mean that there wasn't enough testing, when we say that.
There is a process, however, that is meant to decrease the number of testers employed. The more testing can be automated, the fewer testers would be necessary... but that hinges on the premise that prior number of testers was somehow sufficient for the amount of testing that was necessary. I believe though that the number of testers hired was a function of budget more than anything else. There's never enough testing, and, in principle, it's hard to see how testing can be exhaustive. So, hopefully, with more automation, it's possible to test more, but, I believe that the number of testers will remain more or less the function of budget.
> With sysadmins, there are fashion waves, where they keep being called different names like DevOps or SRE.
I don't think the name change really originated with Sysadmins. Basically these new titles were created (with narrow definitions) and then other companies said "We are cool like Google, we have SREs now, no Sysadmins" so all the jobs had new titles.
Source: Me and my last 4 jobs ( Sysadmin -> Devops Engineer -> Infrastructure Developer -> SRE ) which are all basically the same thing
In my G SRE interview, I had to do the same rigorous Software Engineering algorithms rounds as well as show deep distributed systems knowledge in designing highly available systems.
If by rigorous algorithms you mean, spend a month memorizing a few dozen leetcode problems then sure, I’ll agree that is sadly the state of SRE interviews at FAANG.
> is SRE a byproduct of a bubble economy of easy money?
I think it’s definitely one of the aspects.
Talking with some SRE friends the point that they think part of their role is important are the multitude of moving parts in the current development environment (partially related with the easy money for resume driven development and a lot of tech stack side quests) and how the bar had lowered to hire folks (for this one with hiring managers with almost infinite budget e a lot of questionable product initiatives).
I imagine the threshold is something like 1 SRE for every $1mm of high-margin revenue you can link to guaranteeing the 2nd "9" of $product availability/reliability.
I believe that is indeed a good guide for when it makes sense to have a SRE team supporting a service or product (with the caveat that the number probably isn't $1MM).
There are also good patterns for ensuring you actually have adequate SRE coverage for what the business needs. 2 x 6ppl teams geo-graphically dispersed doing 7x12 shifts works pretty well (not cheap). You can do it with less but you run into more challenges when individuals leave / get burnt out / etc.
It's marginal revenue attributable to a high-performing SRE (i.e. an SRE who would be able to elevate a product they're supporting from 90.0% availability to 99.0% availability.
It's actually a pretty high bar, because there aren't that many products for which the that segment of availability translates to >$1mm in marginal revenue. $1mm is a ballpark figure, but I think it's the right order of magnitude (i.e. the true number might be $5mm).
Expanding on another point in the original post: decision varies with the profitability of that marginal revenue. For example, it's basically pure profit for Google, Amazon or Netflix – accordingly, it makes sense that they'd have many people who focus exclusively on performance and availability, to make sure they aren't leaving that revenue on the ground.
I guess now they have a team of software engineers, where part is focused on infra and part on backend. Sys Admins disappeared? They are DevOps/IT Engineers now. QA? SWE in Test, and so on.
Think long and hard about this one. Multiple times in my three-decade career I've seen automated mitigations make the problem worst. So really consider whether self-healing is something you need.
I built my company's in-house mobile crash reporting solution in 2014. Part of the backend has had one server running Redis as a single point of failure. The failover process is only semi-automated. A human has to initiate it after confirming alerts about it being down are valid. There's also no real financial cost to it going down - at worst my company's mobile app developers are inconvenienced for a bit.
In the decade the system has been operational I can count on two fingers the number of times I've had to failover.
Despite this system having no SLA it's had better uptime than much more critical internal systems.
To be fair, GitHub operates at a much larger scale. My point is only that redundancy and automated mitigations add complexity and are almost by definition rarely tested and operate under unforeseen circumstances.
So really, consider your SLA and the cost of an outage and balance that against the complexity you'll add by guarding against an outage.
I think my first introduction to this was circa 1998 when I had a pair of NetApps clustered into an HA configuration and one of them failed and caused the other to corrupt all its disks. Fun times. A similar thing happened around the same time with a pair of Cisco PIX firewalls. I've been leery of HA and automated failover/mitigations ever since.
I'm curious how people approach big red buttons and intentional graceful degradation in practice, and especially how to ensure that these work when the system is experiencing problems.
E.g. do you use db-based "feature flags"? What do you do then if the DB itself is overloaded, or the API through which you access the DB?
Or do you use static startup flags (e.g. env variables)? How do you ensure these get rolled out quickly enough?
Something else entirely?
When you're a small company, simpler is actually better... It's best to keep it simple so that recovery is easy over building out a more complicated solution that is reliable in the average case but fragile in the limits. Even if that means there's some places on the critical path where you don't use double redundancy but as a result the system is simple enough to fit in the heads of all the maintainers and can be rebooted or reverted easily.
... But once your firm starts making guarantees like "five nines uptime," there will be some complexity necessary to devise a system that can continue to be developed and improved while maintaining those guarantees.
At google we also had to routinely do “backend drains” of particular clusters when we deemed them unhealthy and they had a system to do that quickly at the api/lb layer. At other places I’ve also seen that done with application level flags so you’d do kubectl edit which is obviously less than ideal but worked
Implantation details will depend on your stack, but 3 main things I’d keep in mind:
1. Keep it simple. No elaborate logic. No complex data stores. Just a simple checking of the flag.
2. Do it as close to the source as possible, but have limited trust in your clients - you may have old versions, things not propagating, bugs, etc. So best to have option to degrade both in the client and on the server. If you can do only one, so the server side.
3. Real world test! And test often. Don’t trust test environment. Test on real world traffic. Do periodic tests at small scale (like 0.1% of traffic) but also do more full scale tests on a schedule. If you didn’t test it, it won’t work when you need it. If it worked a year ago, it will likely not work now. If it’s not tested, it’ll likely cause more damage than it’ll solve.
To make up an example that doesn't depend on any of those things: imagine that I've added a new feature to Hacker News that allows users to display profile pictures next to their comments. Of course, we have built everything around microservices, so this is implemented by the frontend page generator making a call to the profile service, which does a lookup and responds with the image location. As part of the launch plan, I document the "big red button" procedure to follow if my new component is overloading the profile service or image repository: run this command to rate-limit my service's outgoing requests at the network layer (probably to 0 in an emergency). It will fail its lookups and the page generator is designed to gracefully degrade by continuing to render the comment text, sans profile photo.
(Before anyone hits send on that "what a stupid way to do X" reply, please note that this is not an actual design doc, I'm not giving advice on how to build anything, it's just a crayon drawing to illustrate a point)
I worked at enough companies to see that many of them have some notion of "centralized config that's rolled to the edge and can be updated at runtime".
I've done this with djb's CDB (constant database). But I've seen people poll an API for JSON config files or dbm/gdbm/Berkeleydb/leveldb.
This can extend to other big red buttons. It's not that elegant but I've had numerous services that checked for the presence of a file to serve health checks. So pulling a node out of load balancer rotation was as easy as creating a file.
The idea is that then when there's a datab outage the system defaults to serving the last known good config.
Versioning, feature toggle and rollback - automated and implemented at different level based on the system. It could be an env configuration, or db field or down migration scripts or deploying last working version.
I would like to take this moment to really highlight "Recovery mechanisms should be fully tested before an emergency". As the unexpected SRE at Google who became known by entire company for using a double negative incorrectly, it is something very important to do right away.
For those Googlers curious, you can search my username internally for how I generated more impact then could be measured.
The cheapest way to prevent an outage is to catch it early in its lifecycle. Software bugs are like real bugs. First is the egg, that's the idea of the change. Then there's the nymph, when it hatches; first POC. By the time it hits production, it's an adult.
Wait - isn't there a stage before adulthood? Yes! Your app should have several stages of maturity before it reaches adulthood. It's far cheaper to find that bug before it becomes fully grown (and starts laying its own eggs!)
If you can't do canaries and rollbacks are problematic, add more testing before the production deploy. Linters, unit tests, end to end tests, profilers, synthetic monitors, read-only copies of production, performance tests, etc. Use as many ways as you can to find the bug early.
Feature flags, backwards compatibility, and other methods are also useful. But nothing beats Shift Left.
If you are interested in a similar list but with a bent towards being a SRE for 15 years in FinTech/Banks/Hedge Funds/Crypto, let me humbly suggest you check out:
Teaser:
"25. If you have a rules engine where it's easier to make a new rule than to find an existing rule based on filter criteria: you will end up with lots of duplicate rules."
"We once narrowly missed a major outage because the engineer who submitted the would-be-triggering change unplugged their desktop computer before the change could propagate". Sorry, what?
The change was being orchestrated from their desktop, and they noticed thing were going sideways, so they unplugged their desktop to stop the deployment. Aka "pressed the big red button".
It always cracks me up how Google is simultaneously the most web-based company in the world, but their internal political landscape was (Infra, Search, Ads) > everything else. This leads to infra swe writing stupid CLIs all day, rather than having any literal buttons. Things were changing a lot by the time I left though.
I do think Google should be more open about their internal outages. This one in particular was very famous internally.
It's a logical consequence of the "zero trust" network. If an engineer's workstation can make RPCs to production systems, and that engineer is properly entitled to assume some privileged role, then there's no difference between running the automation in prod and running it on your workstation. Even at huge scales, shell tools plus RPC client CLIs can contact every machine in the world pretty promptly.
There's still differences. If you're running it in prod then the functionality has at least gone through code review and you have higher confidence what's running is what you think it is. If you run things from personal boxes there's always the risk of them not having the latest code, having made a local change and not checking it in, or the worst case of a bad actor doing whatever they want with the privileged role. But if code review isn't required or engineers have unrestricted SSH access to prod hosts then it's pretty much equivalent.
At one point I had to run a script on a substantial portion of their server fleet (like hundreds of thousands machines) and I remember I ran it with a pssh-style utility from desktop (was 10y ago so dunno if they still use this). It was surprisingly quick to do it this way. Could’ve been something like that
Yeah, interesting tidbit. It might sound insane today that one engineer's desktop computer could cause such an outage. But that was probably more commonplace 20 years ago and even today in smaller orgs.
There was a famous incident at one point where code search internally went down. It turned out that while they had deployed the tool internally, one piece of the indexing process was still running as a cron job on the original developer's desktop machine. He went on vacation, his credentials aged out, and of crawler stopped refreshing the index.
But my favorite incident will forever be the time they had to drill out a safe because they were disaster-testing the password vault system and discovered that the key needed to restore the password vault system was stored in s aafe, the combination for which had been moved into the password vault system. Only with advanced, modern technology can you lock the keys to the safe in the safe itself with so many steps!
Well, mostly-kinda-sorta. ;) It's the internal password vault, and there's only one of them, so it's more like "they broke it on purpose and then had to fix it before the company went off the rails." Among the things kept in that vault are credentials that if they age out or aren't regularly refreshed, key internal infrastructure starts grinding to a halt.
But still, "it broke while engineers were staring at it and trying to break it" is a better scenario than "it broke surprisingly while engineers were trying to do something else."
100% - Including some of the Chaos Engineering features that are recently offered in the platform (e.g., simulating service errors/latencies, region outages, etc)
Error budgets are to control the workload of the guy who is holding the oncall pager, who otherwise has no say over his or her situation. In recent years companies have shifted to 'you build it you own it' and the infra has been abstracted to the point that the SWE can own the entire thing.
Error budgets also only matter if you either give a shit about your guys or have to pay them for that oncall time. Plenty of employers are happy so say 'salary is exempt, suck it up lol' so errors are effectively free.
This feels very apropos to a recent banking outage we had here in Singapore. On Oct 14, I was out buying groceries and was asked a very strange question by the NTUC FairPrice supermarket cashier, "which bank is your card from?". Expecting the usual "would you like this product on offer" type question, I didn't even register the question for a second.
Turns out that we had a major outage at a data centre that served DBS - one of the largest lenders, as well as Citi. [1]
The disruption was attributed to a cooling system failure at a data centre operated by Equinix [2]. Further digging led to information that the culprit was the SG3 data centre [3] marketed as their largest IBX data centre in the Asia Pacific region and one of the newest. It turned out that the cooling system was being upgraded on contract with an external vendor, who applied incorrect settings which brought it down.
Further, this particular data centre has 2N electrical redundancy but a cooling redundancy of only N+1 chillers, in comparison to other financial services organizations like the Singapore exchange (SGX) that offers [4] CoLo hosting with 2N chillers, which I believe is essential for warm equatorial climates.
Sadly, this outage was followed by yet another smaller payment related outage the following week, making it the fifth outage this year. DBS was trumpeting their move to the cloud [5] as part of their grand plan to transform themselves from a bank into a software company that also offered banking services [6][7].
In going all out with this questionable and misguided transformation they've lost focus on what made people trust them in the first place - the decades of trust that was built on solid, reliable, transparent and efficient banking services.
There were questions about why their backup data centre didn't kick in and there are no answers till date.
It's clear to see that the recovery mechanisms weren't tested, performance degrade modes were not implemented or not tested, disaster resilience utterly failed, there were no working mitigations for cooling system failures or DC failures, and as a result ATMs and other services were down from 3pm on Oct 14 until the following morning. For a country that prides itself on digital transformation, this is just the latest banking systems failure which makes it far more than an egg in the face, it's an erosion of trust.
Items (2), (3), (7), (8), (9) from the Google report directly apply to this failure. I can only hope something good comes out of it and lessons are learnt.
> COMMUNICATION CHANNELS! AND BACKUP CHANNELS!! AND BACKUPS FOR THOSE BACKUP CHANNELS!!!
Yes! At Netflix, when we picked vendors for systems that we used during an outage, we always had to make sure they were not on AWS. At reddit we had a server in the office with a backup IRC server in case the main one we used was unavailable.
And I think Google has a backup IRC server on AWS, but that might just apocryphal.
Always good to make sure you have a backup side channel that has as little to do with your infrastructure as possible.