This reminds me very much of Sidney Dekker's work, particularly The Field Guide to Understanding Human Failure, and Drift Into Failure.
The former focuses on evaluating the system as a whole, and identifying the state of mind of the participants of the accidents and evaluating what led them to believe that they were making the correct decisions, with the understanding that nobody wants to crash a plane.
The latter book talks more about how multiple seemingly independent changes to complex loosely coupled systems can introduce gaps in safety coverage that aren't immediately obvious, and how those things could be avoided.
I think the CAST approach looks appealing. It seems as though it does require a lot of analysis of failures and near-misses to be best utilized, and the hardest part of implementing it will undoubtably be the people, who often take the "there wasn't a failure, why should we spend time and energy investigating a success" mindset.
Yes you're 100% right. Dekker is a valuable complement to CAST & STAMP because Dekker emphasizes people aspects of psychology, goals, beliefs, etc., while CAST emphasizes engineering aspects of processes, practices, metrics, etc.
CAST describes how to pragmatically bring together the people aspects and the engineering aspects, by having stakeholders write a short explicit safety philosophy:
Like so many things from Google engineering this will be toxic to your startup. SREs read stuff like this, they get main character syndrome and start redoing the technical designs of all the other teams, and not in a good way.
This phenomenon can occur in all “overlay” functions, for example the legal department will try to run the entire company if you don’t have a good leader who keeps the team in their lane.
In my experience, SREs are usually "enforcers of maintainability". If your engineers don't want to be oncall, they need to produce applications and services that are documented and maintainable. It's an amazing forcing function. SRE doesn't often redo technical designs, there's plenty enough reliability and scalability work to do...
At a 200-person company, sure. But when you're in the tens or hundreds of thousands, that's a hard no. Especially when dealing with out-of-scope dependencies.
What? Engineers should own the code they write, including being on call to maintain it. Out-of-scope dependencies should be irrelevant, and if they're not, get some of those tens or hundreds of thousands of employees to work on better observability.
I agree that if you own the blahblah service then you shouldn't get alerts for a broken dependency foobaz if that team is already aware, but if blahblah itself breaks, not being around to fix it is pretty dangerous.
> But when you're in the tens or hundreds of thousands, that's a hard no.
What? No, not at all. I worked in such a company,and oncall was indeed a thing and it was tremendously easy to deal with upstream and downstream dependencies. You have dashboards monitoring metrics emitted in calls to-from third party services and run books that made it quite clear who to call when any of the dependencies misbehaved. If anything happened, everyone was briefed and even on a call.
This boils down to ownership and accountability. It means nothing if the company had 10 or 100k employees.
From the 90s the whole DNS on which the internet is standing today was run successfully with minimum error by a bunch of folks who used to call themselves sysadmins. Developers seems to run out of things to develop and they have been reinventing themselves as devops and SREs. They have been pushing out pure sysadmins but at the same time this trend shows how demand for developers or SWEs falls far short of the supply of developers in the market.
Yes it is worse, much worse. A large part of the reason for that is that it's written in Go. The other part is that it's written by Googlers and sysadmin people; two groups not particularly known for their great software engineering skills. My personal experience here is mostly with cAdvisor (which I guess is not strictly part of Kubernetes but comes from the same ecosystem). It is chock full of horrible error handling (if there is any), uninitialized structures and a dozen layers of indirection.
I wish this article was at most a quarter of its current length. Preferably even shorter. There's so much self-congratulatory and empty talk, it's really hard to get to the main point.
I think, the most important (and actually valuable) part is the mention of the work done by someone else (STPA and CAST). That's all there is to the article. Read about Causal Analysis based on Systems Theory (CAST) and System-Theoretic Process Analysis (STPA) do what the book says.
Agreed that the whole article could've been much shorter. Anyway, for me the key takeaway is not to trust your inputs. It's true that code correctness often boils down to "given input X, the program will correctly give output Y", but the actual issue is that sometimes the input X itself might be wrong. I think it's clearly visible in project management, where people tell you one thing, you plan accordingly, then later they do another thing, and if you haven't predicted this, you're done. If this behavior is so common in human projects in general, I see no reason why it wouldn't emerge in software projects too.
The problem is, software that tries to do something smart with inputs is much harder to reason about, which in turn increases your likelihood of failure, which is exactly the thing you wanted to avoid in the first place. For example, imagine you have an edge case in your script where you want to perform "rm -rf /" but the safety mechanism prevents you from doing this, which effectively makes your script fail.
In conclusion, in my humble opinion, the most important part of safety is choosing tools that are simplest to reason about. If you have a bash script you're guaranteed to have some bug related to some edge case - people managing POSIX realized that bash is so fundamentally broken that it's better to forbid certain filenames rather than fix bash. Use a Python library for 10x the safety but half the comfort. If you have a C++ program it will leak memory no matter how hard you try. And so on.
Similarly, when writing programs, you should give simple and strong promises about its API. Don't ever do "program accepts most sensible date strings and tries to parse that", do "it's either this specific format or an error".
Verifying inputs and being smart about them is a good idea that should be used carefully because it can backfire spectacularly.
It’s not the first article / publication on Google SRE I’ve read, and they’re all similarly (and imho unnecessarily) verbose.
Whilst I’m deeply grateful to the good folks at Google for sharing their hard-earned knowledge with us, I do wish their publications on this important topic were far more succinct.
Couple thoughts here:
1. The “rightsizer” example mentioned might well have had the same outcome if the outage was analyzed in a “traditional” sense. That said, it is much easier and more actionable with this new approach.
2. I’ve always hated software testing because faults can occcur external to the software being tested. It’s difficult to reason about those if you have a myopic view of just your component of in system. This line of thinking somewhat fixes that- or at least paves a path to fixing that.
Unfortunately, while this article says a lot, much just repeated itself and I’d wish there was more detail. For example: who all is involved in this process? Are there limits on what can be controlled? How (politically) does this all shake out with respect to the relationships between SREs and software engineers? Etc..
Agreed, the devil is in the detail for SRE functions, and the organizational details of how to leverage this framework are largely absent from this writeup. With so many teams struggling to get the organizational components right just for traditional SRE (due to budget constraints, internal politics, misunderstanding of SRE by leadership, etc), I'd imagine implementing the changes need to leverage the ideas in this writeup will be impossible for all but extremely deep-pocketed tech companies.
Nonetheless, lots of interesting concepts, so I would like to see a Google SRE handbook style writeup with more info that might be of more practical value.
I've been reading about CAST (Causal Analysis based on Systems Theory) and noticed some interesting parallels with mechanistic interpretability work. Rather than searching for root causes, CAST provides frameworks for analyzing how system components interact and why they "believe" their decisions are correct - which seems relevant to understanding neural networks.
I'm curious if anyone has tried applying formal safety engineering frameworks to neural net analysis. The methods for tracing complex causal chains and system-level behaviors in CAST seem potentially useful, but I'd love to hear from people who understand both fields better than I do. Is this a meaningful connection or am I pattern-matching too aggressively?
I do AI/ML research for a living (my degrees were in theoretical CS and AI/ML and my [unfinished] phD work was in computational creativity [essentially AGI]). I also do SRE work as a living.
and yeah that's a useful way of characterizing some of the behaviors of some kinds of neural networks. There's a point at which the distinction between belief and "frequency (or probability-amplitude) state filter" become less apparent, though, that's more of a function-of-medium vs function-of-system distinction.
However, systems like these can often become mediums, themselves, for more complex systems.
Additionally, a system which has "closed-the-loop" by understanding the medium and the system as coupled as "self" and separate from the environment along with a direction/goal is a pretty decent, if imprecise, definition of a strange loop. Contradiction resolution between internal component beliefs gives a possible (imo, highly probable) mechanistic explaination for the phenomenon of free energy minimization in such systems. External contradiction resolution extends it to active inference.
I was listening to a Titus Winters podcast, and I’m not sure he exactly put it like this, but I took it away as:
There are two problems with automated testing. 1) tests take too long to run 2) difficult to root cause breakages.
Most devs solve this with making unit tests ever more granular with heavy use of mocks/fakes. This “solves” both problems in a narrow sense: the tests run faster and are obvious to root cause breakages.
But you didn’t actually solve the problem. Since the entire point of writing tests in the first place was to answer the question: “does my system work”? Granular and mocked unit tests don’t help much.
However, going back to the original question, we can actually reframe the problems as: 1) a work scheduling problem and 2) a signal processing problem.
Those are pretty well understood problems with good solutions. It’s just that this is a somewhat novel way of thinking of tests, so it hasn’t really been integrated into the open source tool chain.
You could imagine integration tests automatically be correlated to a micro service release. Some CI automation constantly running expensive tests over a range of commits and automatically bisecting on failure. Etc.
Put another way, automated tests don’t go far enough. We need yet another higher layer of abstraction. Computers are better at deciding what tests to run and when, and are also better at interpreting the results.
> Put another way, automated tests don’t go far enough. We need yet another higher layer of abstraction. Computers are better at deciding what tests to run and when, and are also better at interpreting the results.
> 1) tests take too long to run 2) difficult to root cause breakages.
I have gotten a ton of value out of targetting these problems specifically. Some low-cost-high-value test infra changes:
1. Passing tests are not allowed to emit error level logs without expecting them; if the expected error does not occur, the test fails. This is good in general, but also makes root causing test failures easier as if you encounter an unexpected error log you can attach it to the failure message, and often that pinpoints the root cause.
2. Have lots of off-by-default instrumentation. For example, if a test fails auto-rerun it with sanitizers on.
3. In tests, gather backtraces when moving tasks between threads. This makes identifying what triggered an error straightforward.
4. Collect data during test runs, like line coverage, in a sequential order that can be diffed between passing and failing runs.
As for making tests fast, the solution is aggressive optimization just like prod code - profile the test, and fix what makes it slow. Cut corners where it makes sense - should you ever fsync in a test? VM snapshotting can be very useful in this area for things that are slow to start up.
Are there any resources to show how to apply this in practice? This is too theoretical to grok for me, there are too many terms. It seems too time-consuming to understand (and to perform IMO)
Here's a fast, easy, practical way to think about CAST:
1. Causal: Novices may believe accidents are due to one "root cause" or a few "probable causes", but it turns out that accidents are actually due to many interacting causes.
2. Analysis: Novices may blame people, but it's smarter to do blame-free examination of why the loss occurred, and how it occurred i.e. "ask why and how, not who".
3. Systems: Novices may fix just one thing that broke, but it turns out it's better to discover multiple causes, then plan multiple ways to improve the whole system.
Thanks for writing the summary notes and sharing those here. After reading the Usenix article, I was thinking that we could apply some of the ideas at $WORK, but the exact "How" was still not super clear. Your notes offer a compact and accessible starting point without having to ask colleagues to dive in to a 100+ page PDF. :D
I wonder a what scale this very interesting approach start yielding more value than cost.
What I mean is: is it a faang only as so many things they seeded or is it actually relevant at a non-faang scale?
I tend to be invest much on risk avoidance, so this is appealing to me, but I know that my risk avoidance tendency is rarely shared by my peers/stakeholders.
It's definitely something that requires a high budget and a dedicated reliability team. In most orgs that have got as far as a proper post-mortem and analysis culture, they aren't even reliably draining the action items generated by the post mortems, so attempting to pre-emptively generate action items is kind of a moot point.
The example here seems to do with sizing appropriately the requirements of applications, which enables you to schedule more applications per machine, driving down costs.
This is useful for any company larger than say 10 people.
In general this is difficult to do, because there is more at play than memory, CPU and disk usage, especially if you have certain performance requirements.
I find that what's in Kubernetes (a Google product) pretty much useless, but maybe it works for web tech.
I understood their example more like: automating the scaling of servers is easy, having proper inputs for this scaling to be reliable is hard.
What they propose is to lend weeks if engineering time to perform analysis in the hope to find some relatable issues.
Are both this engineering time and the issues fixing time relevant for non faang companies?
In other words: The lever effect of not having issues is fewer, so the rentability of such analysis decrease. Where does the rentability become negative?
In practice it's pretty much impossible to get precise requirements without automatically learning them from how the application performed in the past.
The problem is that it is high-risk to automatically perform those changes since they might affect the application in ways you do not expect.
From what I gathered skimming over the article and especially spending a bit of time on their example, that the authors and whomever, invented a complicated system into/onto which they try to fit the real world, when in invariably doesn't fit they use band-aids, self reported to fix real-world problems. In their example the righsizer should never have set a wrong size as that should have been described or prescribed properly, thus they failed.
Quick connect I made is to when I was learning RDF and its various incantations, and trying to describe real world. I never did figure it out, but did learn it's a very hard problem.
I think the single biggest thing about Google SREs (at least in the early years) was that if your team was going to launch a new product, you had to have an SRE to help and to maintain the service.
Google deliberately limited the amount of SREs, so you had to prove your stuff worked and sell it to the SRE to even get a chance to launch.
This culture was, imo, directly responsible for google's failure to launch a facebook competitor early enough for it to matter.
The Orkut project was basically banned from being launched or marketed as an official google product because it was deemed "not production ready" by SRE. Despite that it gained huge market share in Brazil and a few other countries before eventually losing to FB. By the time their "production ready" product (G+) launched it was hilariously late.
Facebook probably would have won anyway, but who knows what might have happened if Google had actually leaned into this very successful project instead of treating it like an unwanted step-child.
How was it banned from being launched? It did launch and the desire to not be promoted as a Google product came from Orkut himself, iirc.
The reason it was not regarded as 'production ready' was that the architecture didn't scale. In fact it also didn't run on the regular Google infrastructure that everything else used and that SRE teams were familiar with; it was a .NET app that used MS SQL Server.
This design wasn't a big surprise. Facebook won not because Orkut lost but because Facebook were the first to manage gradual scaleup without killing their own adoption, by figuring out naturally isolated social networks they could restrict signup to (American universities). This made their sharding problem much easier. Every other competitor tried to let the whole world sign up simultaneously whilst also offering sophisticated data model features like the ability to make arbitrary friendships and groups, which was hard to implement with the RDBMS tech of the time.
Orkut did indeed suffer drastic scaling problems and eventually had to be rewritten on top of the regular Google infrastructure, but that just replaced one set of problems with another.
The attitude within SRE toward Orkut (the product) was one of disdain if not contempt. A healthy culture does not treat rapidly growing products this way.
I mean, I'm personal friends with a former Orkut SRE. The idea that Google SRE ignored or disdained Orkut just isn't right. Nonetheless, if your job is defined as "make the site reliable" and it's written in a way that can never be reliable then it's understandable that you're going to have at least some frustrations.
Of course restricting rollout to American university emails (including alumni addresses--at least at one point) was also a pretty natural consequence of Facebook's origins.
It's not good when you have an SRE on hand to act as a babysitter of sorts. That is how some companies use SREs these days. They do the toil and sysadmin work so the product engineers can focus on features. Exactly what we hoped to avoid, but here we are.
If by some you mean nearly all, then yes, and yes, it’s terrible.
Super fun being the adult in the room having to explain for the millionth time why someone can’t expect that a network call will always succeed, and will never experience latency.
Even at Google it's like this. I spent the holidays watching my on-call Google SRE friend trying to diagnose misbehaving mapreduce jobs devs had written. They are basically glorified first line support so SWEs don't get woken up in the night.
Which seems like the worst possible setup to me - devs should be first on call for code they write. That seems like a basic principle to me and creates the correct incentives.
Thanks for this detail, I worked at Google, with SREs, and didn't know it. It seems like the type of 'design' detail that might be more important than this entire article
It took Google more than 10 years after I showed them the problem with their current approach to service management, which was much aligned with SRE, to get to this point of awareness of the need for service cognition but here we are.
SWEs: are SRE/devops folks part of your day to day?
I have never been in a SWE role where I didn’t do my own “ops”, even at FAANG (I haven’t worked at Google). I know "SRE/devops" was/is buzzy in the industry, but it’s always seemed, in the vast majority of cases, to be a half-assed rebrand of the old school “operations” role -- hardly a revolution. In general, I think my teams have benefited from doing our own ops. The software certainly has. Am I living in a bubble?
I'm assuming in the ops side of your role, you're not filling in firewall rules paperwork to an network team, spinning up new servers to SSH in and SCP some files over and edit a couple of config files though. Operations just doesn't look like that anymore, so the fact that SWE teams can now do a meaningful amount of operations for their product is the revolution. It may not feel like it if you weren't doing operations the old way, but there are a lot of tools that are invisible to make things work.
SRE and DevOps is better summarized as 'cloud engineering', IMO. Basically, it's to set up and maintain the infrastructure which allows you to do your own ops as a dev.
That's my impression as well. My SWE team has always done all of that ourselves, I've never felt the need for a dedicated role to maintain IaC and click around in the console.
I don't think Ben Treynor knows what SRE at Google is, anymore. I've heard from multiple sources that he's checked out, retired, and chilling on his ranch.
I'm sure there's some team at Google that does this, but this reads like yet another "how Google works" books that nobody at Google recognises.
They're doing that thing that happened to DevOps. It started out as a guy who wanted a way for devs and sysadmins to talk about deploys together, so they didn't get dead-cat syndrome. It ended up as an entire branch of business management theory, consultants, and a whole lot of ignorant people who think it just means "a dev who does sysadmin tasks".
Abuse of a single word to mean too many things makes it meaningless. SRE now has that distinction. You've got SREs who (mostly) write software, SREs who (mostly) stare at graphs and deal with random system failures, and now SREs who use a framework to develop a complex risk model of multiple systems (which is more quality control than engineering).
> a way for devs and sysadmins to talk about deploys together
My take on this as someone who has 14 years of dev and 7 years of ops experience: DevOps is a flawed concept.
The problem never was the lack of communication between the devs and sysadmins, it’s just a symptom.
The root cause is that the management puts pressure on the devs to innovate and deliver as fast as possible, and puts pressure on the ops to ensure that the system is stable, reliable, scalable, it has a 99.95% uptime and any issues will be solved by the on-call.
So these two groups have conflicting interests and when this leads to conflicts and arguments the conclusion is that they just don’t want to collaborate/communicate.
There are many departments at a company that have conflicting interests and can interfere with each other. If DevOps was a real thing there would be a need for LegalOps, DevSales, DevProduct, HROps, etc..
> DevOps is a flawed concept... conflicting interests... the conclusion is that they just don't want to collaborate/communicate.
Wrong! This tendency is exactly what DevOps explains is the natural state of affairs without DevOps principles. The solution that DevOps advocates for is that such conflicting interests must not be expressed in meetings (where the culture conflict ensures they will get nowhere) but rather expressed in code. Infrastructure must be in code, deploys must be in code, testing must be in code, builds must be in code, policy must be in code, and the implicit pipeline with all the handoffs between teams connecting them all must also be in code. This makes everything fast (at least in comparison to manual processes), and makes everything explicit (in code) so that people can reach outside of their natural organizational silos to propose changes elsewhere in the pipeline, i.e. infra-focused engineers can add failing tests to prove the existence of a bug, developers can add infra that they need, QA can increase infra resources to ensure that sufficient resources are available for expected scale.
The problem most organizations have is that they're not actually willing to force everyone's concerns to be written in code, and people are forbidden from reaching outside their silos. Usually this is due to poor hiring and training practice, e.g. "QA doesn't know how to write code" or "we can't let developers touch security policy". Sometimes it is due to leadership itself misunderstanding DevOps ("developers are forbidden from touching production").
> If DevOps was a real thing there would be a need for LegalOps, DevSales, DevProduct, HROps, etc.
There is such a need, and it is generally being fulfilled by the systems in place. Small example - HROps would be ensuring that changes in the organization (i.e. people moving teams) accurately results in proper loss (of old) and gain (in new) privileges. This is done with integration between HR systems (the system of record as to who reports to whom and what their responsibilities are) and Active Directory or Google Workspace / Google Groups ensuring that people are automatically moved between the groups to which permissions are granted.
While reaching across silos seems good, in theory, my experience is there's just too much breadth and domain knowledge for it to work consistently.
Sure I know application code and have worked with a handful of frameworks, but if I'm enforcing infrastructure or performance concerns and implementing across a handful of different services, it's extremely time consuming getting up to speed in each repo and understanding the subtleties and patterns between each one.
I can optimize queries and debug performance issues but the usual roadblock is understanding what the code is _supposed_ to do and whether an optimization provides the correct results (which is not always clear from tests, assuming good ones exist)
The argument isn't so much that you can reach across and do everything independently, it's that seeing the underlying code makes collaboration easier. Sometimes it means you copy/paste links to the code into Slack to ask about them. Sometimes it means you open an MR and ask for review because you're not sure. But both of those are highly preferable to opening a ticket for a different team and waiting around for maybe someday they'll get around to tackling something that a) the Business didn't ask of them and b) they didn't open for themselves
> Wrong! This tendency is exactly what DevOps explains is the natural state of affairs without DevOps principles. The solution that DevOps advocates for is that such conflicting interests must not be expressed in meetings (where the culture conflict ensures they will get nowhere) but rather expressed in code.
Yes, there is the theory, principles, advocates, etc, etc. And there is the reality, and based on many-many years of experience as a dev, as an ops, and as a manager at from small-sized to enterprise companies, the reality isn't even close to this.
> Infrastructure must be in code, deploys must be in code, testing must be in code, builds must be in code, policy must be in code, and the implicit pipeline with all the handoffs between teams connecting them all must also be in code.
This is not DevOps, and you don't need DevOps for this. This is just about having an engineering mindset.
> so that people can reach outside of their natural organizational silos to propose changes elsewhere in the pipeline
Do you know how many times have I seen a developer touching Terraform code, ansible playbooks, or pipelines described as code? I am not saying that it never happened, but it was a rare occasion.
> The problem most organizations have is that they're not actually willing to force everyone's concerns to be written in code
I managed such an enforcement and change. It did not solve the cultural and collaboration issues.
> HROps would be ensuring that changes in the organization (i.e. people moving teams) accurately results in proper loss
This is just a matter of SoPs and workflows. It has nothing to do with the topic.
No offense, but your comment is a perfect reflection of why DevOps is a flawed concept. You are talking about enforcement of everything described as code, advocates, principles, etc.
If there is a good culture and collaboration the infrastructure as code, the advocates, etc will come naturally. People will find a way to collaborate. But not the other way around, these won't fix the culture.
> If there is a good culture and collaboration the infrastructure as code, the advocates, etc will come naturally. People will find a way to collaborate. But not the other way around, these won't fix the culture.
Sure I'll agree that DevOps is a culture change. You're right that exposing everything as code does not, in and of itself, get people to start having discussions around code or proposing changes to the code. If people are used to the only way to something getting done is by waiting for meetings and opening tickets, then the fact that every team has code in a pipeline doesn't inherently change that. But without ensuring that the code is there first, you cannot enable the cultural change; it is the fact that the code is available is what makes possible an expectation from management to first go look at the code and try to understand it and have discussions around the code and even propose changes to code outside of your domain.
Without the code exposed, you can have very "collaborative" cultures. They look like people whose calendars are booked straight through for weeks, where they spend the whole meeting trying to get their heads aligned, sometimes going over basic details, where at the end, hopefully, you get a ticket opened for them or their team. You spend so much time collaborating that nobody actually has any time to get anything done.
The point wasn't that there are multiple job titles in different fields.
> I'm not sure your examples say we don't need DevOps!
The main assumption of DevOps is that the friction between developers and operations is the main cause of slow delivery.
Well this friction can be happen between dev and legal, and dev and sales, and dev and product, and ops and product, and ops and sales, and ops and finance, etc.
The companies that need DevOps have a much deeply rooted problem: the lack of collaboration between any department (not just dev and ops). Therefore DevOps won’t solve their problems. And companies that don’t have this problem won’t need DevOps, because they don’t have problems with collaboration.
> You've got SREs who (mostly) write software, SREs who (mostly) stare at graphs and deal with random system failures, and now SREs who use a framework to develop a complex risk model of multiple systems (which is more quality control than engineering).
This was always the case or at least going back 15 years or more highlighted by the so called “treynor curve”
NB: The Treynor Curve is named after Ben Treynor and his ideas. Ben Treynor's name changed to Ben Sloss a few years back, and Ben Sloss is one of the authors of this article.
Without resorting to any "big-D Devops" definition, I have almost always seen devops referring to "supporting / running the code you write", and have never encountered the definition where dev and ops were 2 different roles. That was what things were like before devops, and coordination on product support and planning wasn't great, hence devops.
The reason you never encountered the second definition is two-fold:
1) There is no formal academic education behind the concept (that I'm aware of). If you do a CS major, nobody's going to explain to you the accumulated 15 years of practice and knowledge around the concept.
2) Due to 1), people just repeat what other people tell them. It's like a long game of telephone. It turns out most software development today is just a game of telephone between devs (and now AI). So almost everyone is misinformed.
The Wikipedia page for DevOps is the best generic starting point if you want to know more.
If you want to know more after that, there are a number of books and blog posts. Jez Humble, John Willis, Gene Kim, Patrick Debois, etc are the people to read. It's a much larger body of knowledge than you might think. Almost none of it has to do with devs supporting/running what they write (that's a small subset of a larger category, and there's multiple categories of 'stuff')
A lot of organizations simply renamed the functional area of "systems administration" or "systems engineering" to "DevOps," and at many of these places, "DevOps" is the new name for the group that software developers will throw stuff over the fence to.
The issue with the above names is that they can be applied to a domain or area of practices, or an organizational boundary. In a non-trivial number of organizations, "DevOps" is viewed as a support entity for one or more software development teams, versus software development teams practicing "devops."
This applies to many of the *_Ops names in fashion during the past five years or so.
After almost 10 years in the systems engineering / administration / devops / cloud etc space all I can say is:
The biggest improvement that devops brought is that it made managers feel dumb, outdated and scared because they were not "doing devops" while everybody else was, so they kinda started listening to sysadmins and what they had to tell.
Uh, devops engineers did not come out of nowhere. They did not come out of the ground like mushrooms. Most if not all the "devops engineers" i know are just former sysadmins. They were already willing to do whatever devops was supposed to be, it's just they they were largely ignored.
Writing this I just realized that maybe the best way to obtain organizational change is to make management and upper management feel stupid and outdated. Interesting.
I'm Ops type person so I work at companies where there is a split between the two. Ops is a skill not all developers have or frankly, not even mindset to properly do so you will need a team/person to do it. Generally companies don't like the cost of embedding Ops person into every team and that can create redundant work so they form a DevOps/SRE team.
Never heard of dead-cat syndrome. In case anyone else wonders:
> There is one thing that is absolutely certain about throwing a dead cat on the dining room table – and I don't mean that people will be outraged, alarmed, disgusted. That is true, but irrelevant. The key point, says my Australian friend, is that everyone will shout, "Jeez, mate, there’s a dead cat on the table!" In other words, they will be talking about the dead cat – the thing you want them to talk about – and they will not be talking about the issue that has been causing you so much grief
That's the one. Only old fogies (like me) know it I guess. It was the thing we all referred to as the impetus behind DevOps, when it became a thing a decade ago.
Also, just don't try to be Google and don't microservice the crap out of everything...
The problem is that too many people in our industry are trying to get experience to land a job at Google so they try to turn every job into Google...
Although honestly I suspect that Google could do more to simplify internally, but that is the kind of work that doesn't get you promoted, while layers of additional smart-sounding complexity do.
And it really sounds to me like we've gone wrong as an industry, where you can't bolt together lego blocks and get working larger systems out of them, and have to worry about large scale spooky-action-at-a-distance effects. It is like having to worry about the interaction of your radio with your car's drive train. A simple fuse keeps the radio from killing the engine, and then the designer of the engine never has to think about the radio.
These days I wonder if Google is really the example to follow. There was a time 10 or 15 years ago where Google seemed to be leading the industry in everything, and I feel like a lot of people still think they do when it comes to engineering culture. These days I tend to see Google as a bit of a red flag on a resume, and I have a set of questions I ask to make sure they didn’t drink too much of the koolaid. Perhaps more importantly, when I look at Google from the outside these days, I see that their products have really gone downhill in terms of quality. I see Google Search riddled with spam, I see Gemini struggling to keep up with OpenAI, Google Chat trying to keep up with Slack but missing the mark, Nest being stagnant, I could go on and on. All this to say that I don’t think Google is the North Star that it used to be in terms of guiding engineering culture throughout the industry.
I would never hire a product person from Google or someone I needed to be visionary. For the most part, their products suck, they have no vision and no follow through.
But their technology is top notch. I hire mostly for startups and green field initiatives though and I wouldn’t hire anyone from any BigTech company unless I had “hard” technical problems to solve.
They have top notch tech, yes, but it’s massively overkill for literally every company that’s not at google’s scale. If you’re not careful you may hire someone who will try to replicate everything google does, when you may need only 1% of the complexity. This is the experience I’ve generally had with xooglers… they lament that they don’t have the same tools/tech stack they had at google, and so their first act is to try to move everything to the closest open source equivalents, even if they’re not a good fit.
There’s good things and bad things to take away from experience at google… you have to be careful to ignore things that won’t actually help you.
I've been the "you're not google" person for several years, but now softened my position.
The thing is -- it depends. Sometimes when everyone knows some complex system well -- it becomes easy.
One example comes to mind -- Kubernetes. 90% of teams don't need all its complexity. And I've been "you don't need it" person for some time. But now I see that when everyone knows it -- it's actually much easier to deploy even simple websites on it, because it's a common lingo and you don't spend time explaining how it's deployed.
It's not like civic engineers, when an over-engineered bridge would cost a lot more in materials.
Like what, Lambda? I've seen so much horrible hacks and shit done with it (and other AWS services cough API gateway cough), these days I rather prefer a set of Kubernetes descriptors and Dockerfiles.
At least that combination all but enforces people doing Infrastructure-as-a-code and there's (almost) no possibility at all for "had to do live hack XYZ in the console and forgot to document it or apply it back in Terraform" .
You can’t edit Lambda code in the console when you deploy a Docker image to Lambda.
As far as flexibility, while there have been third party libraries that let you deploy your standard Node/Express, .Net/ASP, Python/Flask app to Lambda, now there is an official first party solution
> Also, I've witnessed people editing Lambda code through the console instead of doing a real deploy. what a mess...
Yeah, exactly that's what I am talking about. Utter nightmare to recover from, especially if whoever set it up thought they needed to follow some uber-complex AWS blog post with >10 involved AWS services and didn't document any of it.
Your response is the perfect example of my point. Each time you use "much simpler services" you still _need to explain_ the setup for the simpler services. Someone might know it, someone not. E.g. some project may eventually grow out of Lambda RAM limitations, but noone in the team knew that. While Kubernetes is one-size-fits-all setup, even if I don't like it.
And yes, I use the Cloud Run myself, but only for my one-person projects. For the team projects consistency is much more important (same way to access/monitor/version etc).
PS: I would say even AWS/GCP is already a huge overkill for most projects. But for some reason you didn't see exactly the same problem starting with clouds right away.
RAM is just one example. Every simpler service has its limitations, and if everyone (including new hires) knows the simpler service well -- it's perfect. E.g. in my experience everyone knew App Engine at some point and it worked well for us. Now it's a zoo of devops pieces, so I tolerate Kubernetes only because everyone kinda knows it.
And the Kubernetes was just one example of my "you're not google" point. There is many more technologies that are definitely overkill, but is a good common denominator, even when it's 1000x more complex than needed for the task at hand.
PS: Btw, I dunno why people downvoted your comment. It's fits the HN "Guidelines" at the bottom, so upvoted.
The problem is that it can create a chain-reaction of complexity because it opens up possibilities for over-engineering. In the sense of: "Yes, it's a bit over-engineered, but k8s makes it manageable for us anyway!" - consciously or subconsciously. When I'd often suspect that some restrictions in what's possible/acceptable would've created a significantly leaner overall design in the end.
> but it’s massively overkill for literally every company that’s not at google’s scale.
Before I worked at Google I was at a small telecom company that was running into limits of what some of the Dataflow/Apache Beam product could do, so we had to rewrite it (and commit it back to Beam).
There are companies that have massive scaling issues even if they're not planet scale cloud providers or something.
You can replicate a lot of Google tech now by....just using the OSS they release and/or jumping on a modern cloud provider (GCP or AWS). It's not 2012. You can use a good database and not have to reinvent it.
I agree. I haven’t run into a “hard problem” in my career
By hard problem I mean technically at the top 5% of a problems in the industry that can’t be solved by throwing money at a SaaS or using a cloud provider.
Most companies are really just crud apps. Very few are doing anything technically innovative.
And that's just fine.
I wonder how much of the early Google technical innovation was more a product of open source tech/distributed systems being a lot more immature (I'm particularly thinking databases) 25 years ago.
Ultimately all companies get bloated and loose their way. It shouldn't be a suprise this has happened to Google - 25 years on they are mega corp and idling. Probably for the best as it allows innovators a chance to compete.
The flip side from this assumption about their technology is that if some service is not working, people are very quick to blame the impacted (paying) user. “You are running into rate limits”, “Google is applying anti-abuse controls to your account”, and so on. But at least for some services, I strongly suspect we are actually experiencing random system failures. In my experience, it's rare to get acknowledgement of this from Google. Tickets may not even make it to them because of this pervasive “Google technology is perfect” assumption. The end result feels a bit like gaslighting (doubting our sanity because we can't spot the pattern that is supposed to be obvious): we are encouraged to attribute meaning to more or less random reactions from a complex system.
I was using "Leetcode" as in style of the interview. The Leetcode website was founded due to everyone in the industry copying google/big tech in these style of interviews.
As a Google Cloud customer, I’d say it might be best to split Google into some divisions or something, as Google Cloud’s reliability is a relative shitshow compared to Google.com.
Unless you have Google sized problems and resources, Google probably is not the best example because the things Google does are done to address Google size problems with Google sized resources. It's tooling and methods are not commercial products.
For example, Google can get away with the flaws of it's AI search results because it is Google.
I'm curious what you have in mind for evidence of "koolaid" there?
Hard not to disagree with the general trend you are outlining. Most of that feels driven by product choices, moreso than execution. I think a lot of the previous glorification of their work was likely misguided, as well. But I would be hard pressed to be quantitative on that.
An amusingly good quantification of some evidence. Well done! :D
Still, I don't have much to say that I think the engineering was overly good or bad. I typically think that what they captured for a short while, at least, was enthusiasm. In particular, developers were enthusiastic to be near Google technology in a way that I don't think I've seen for other companies, since.
I don't think they identified it as such, though. Which could be why they seem slow to see that a lot of that has evaporated.
Not to say that they have no enthusiasm, now. I'd wager they still have a lot. But as a percentage share of all developers, it feels very different.
The fact that some people prefer ChatGPT over Gemini is not something that SRE can help you with. The fact that ChatGPT is rarely available is something that SRE could help Microsoft avoid.
> There was a time 10 or 15 years ago where Google seemed to be leading the industry in everything
They used to write interesting books and articles about software engineering. It felt that they were maintaining high quality standards and were an industry reference. Nowadays, I wouldn't go as far as saying it's a red flag to have Google on one's resume, but definitely not the same appeal as before.
A product that either moves the needle as far as revenue and/or makes the ecosystem better. It also needs to be a product that gets continuously better as long as there is a market for it and not abandoned quickly.
- “a connected TV device”. How many cancelled lines of products have they abandoned?
How many market failures have they had in their own line of phones? The Pixel’s aren’t taking the world by storm and they spent billions on Motorola and then sold it off for scraps
They have been releasing a cancelling their own tablet initiatives for years.
At one point they had 5 separate messaging initiatives going on simultaneously.
Even today they have three operating system initiatives that are not on the same codebase - Fuscia, Android and ChromeOS.
They have basically abandoned Flutter and don’t use it for any of their high profile apps.
What have they actually done besides ads?
And the obvious evidence is their money losing “other bets”
Site Reliability Engineering (SRE) is a discipline in the field of Software Engineering that monitors and improves the availability and performance of deployed software systems, often large software services that are expected to deliver reliable response times across events such as new software deployments, hardware failures, and cybersecurity attacks[1]. There is typically a focus on automation and an Infrastructure as code methodology. SRE uses elements of software engineering, IT infrastructure, web development, and operations[2] to assist with reliability. It is similar to DevOps as they both aim to improve the reliability and availability of deployed software systems.
SREs are programmers who specialize in writing programs that manage complex distributed systems.
If you hire SREs and have them doing sysadmin work, then (1) you're massively over-paying and (2) they'll get bored and leave once they find a role that makes better use of their skills.
If you hire sysadmins for SRE work, they'll get lost the first time they need to write a kernel module or design a multi-continent data replication strategy.
> If you hire sysadmins for SRE work, they'll get lost the first time they need to write a kernel module or design a multi-continent data replication strategy.
Ah yes, the old (incorrect) mantra of "sysadmins couldn't code". Which is ironic, as the vast majority of the abstractions that you'll interface with are written by sysadmins.
IDK, writing things like kernel modules to improve the reliability of a complex system doesn't really sound like a task sysadmins get paid for.
Yes, a lot of coding (mostly in scripting languages) is normal, mostly to automate tasks and improve visibility into the system, to make data digestible for tools like Grafana, but other optimizations seem to be out of bounds.
I’ve written kernel code to do various anti-ddos stuff, however its the exception for sure.
Debugging complex systems is more in the wheelhouse of sysadmins. When I came up it was a requirement for sysadmins to be proficient in C, a commandline debugger (usually gdb), the unix/linux syscall interface (understanding everything that comes out of strace for example) and perl.
Usually those perl scripts ended up becoming an orchestration/automation platform of some kind- ruby replaced perl at some point. I guess it’s python and Go now?
The modern “kernel module” requirement is more likely to be a kubernetes operator or terraform module, and the modern day sysadmin definitely writes those (the rest of the role is essentially identical, just tools got better)
The former focuses on evaluating the system as a whole, and identifying the state of mind of the participants of the accidents and evaluating what led them to believe that they were making the correct decisions, with the understanding that nobody wants to crash a plane.
The latter book talks more about how multiple seemingly independent changes to complex loosely coupled systems can introduce gaps in safety coverage that aren't immediately obvious, and how those things could be avoided.
I think the CAST approach looks appealing. It seems as though it does require a lot of analysis of failures and near-misses to be best utilized, and the hardest part of implementing it will undoubtably be the people, who often take the "there wasn't a failure, why should we spend time and energy investigating a success" mindset.