What I found most fascinating is that Google essentially rediscovered what is important in a sysadmin and codified the contract between feature developers and reliability roles.
Instead of having feature developers feeling like they have no say in operational requirements, and instead of having reliability staff fighting unstable mess: properly making the contract means everyone gets heard.
Contrasted to devops which despite coming out later was in vogue when this book came out; which caused the muddying of the role of sysadmin to meaning either:
* a sysadmin practicing agile (the original definition btw)
* a software engineer with enough OS skills to carry the pager (the popular one), or
* a team consisting of sysadmins and software developers with no barrier between them (10+ deploys per day style).
Everyone had their own definition of DevOps. So when SRE came clear: sysadmins are needed, stop trying to push everything into one person, heres how we fix the tension between teams: it was a breath of fresh air.
The only revisionist history (that even google seems to forget) is that sysadmins could indeed write code, though it wasnt pretty and didnt have the nice things like mocks and tests. This has changed a little since 2010 at least but it is still dire, even with Cloud making things much easier.
*EDIT:* I've gone from +4 to 0 points in a very, very rapid amount of time. If I have offended you; how?
Sysadmin originally meant Unix greybeards who were competent at C and could write and implement, say, Kerberos. Then in the 90s-2000s the term came to primarily refer to Windows admins clicking around Active Directory and Group Policy for large enterprises. The "sysadmins can't code" thing came grom that time and the early 2010s when all the cool startups were building on Linux and the available pool of sysadmins were largely Windows specialists. Then DevOps came along trying to get Windows sysadmins that were dabbling in Linux into modern development practices, and SRE came along trying to revitalise the old Unix greybeard style with modern software development practices.
I’m a software engineer that also does operations work. But I really love Go for operations type tasks. Simple, direct, code that’s not too verbose, with predictable performance, and fast compilation to a single executable. I’ve even taken to baking an editor and the compiler into an image with a small library of code for operational tasks, that I can quickly edit and run in the deployed environment for ad how operational tasks.
Sorry, docker image. We deploy everything in Kubernetes, so I can spin up this image as a container in the cluster, open a shell on it, and use the Go tools we wrote for interacting with our services to debug and diagnose what’s going on.
That may what sysadmin meant back in the 80s and 90s but it's not what it means now. I would never describe myself, a SRE-SE, as a sysadmin because I would be describing someone whose primary job is to operate software and iterate on configuration.
SRE-SEs on the other hand are SWEs, they just have a focus in systems adjacent software whereas a SRE-SWE is someone who can dig into compiler level issues and optimization. Both write application code, do analysis, and write policy. A sysadmin of today would be out of place on a team like that.
I don't think that's how the SRE role is described in the book. It shifts the boundaries between sysadmins and developers to give developers much more freedoms in deciding the operational parameters of how their softwane is run/released but also gives SREs the ability to push back if they get handed over some piece of software that's not sufficiently following recommended patterns that make actually operating it mostly automated.
there's a sort of bifurcation of SRE responsibilities with one branch focusing on software-enabled automation and the other on "systems engineering" aka sysadmin. Both are called SREs at Google which seems to cause widespread confusion externally (and even within Google). Also see so called "Ben Treynor Curve"[0]
Within Google: Treynor’s curve is a hiring concept. Once in, you’re doing literally the same job. Being in a team doing greenfield development it took me three years to notice that my TL, with whom we’ve been defining the mathematical model, designing and implementing the system, is on the SysEng ladder.
ok, but you don't tho (i'm sure there are exceptions). if you do, you will have issues with promo (ask me how i know) so if you don't care about that then sure. also if you're se, you had to interview to transfer out of sre to a pure swe role. i personally didn't have this particular problem but saw a few folks that did.
This might be an age-related perception. I think if you're over 40, you'd consider this complementary to SREs. The role of sysadmin, as it existed in 2000, is almost unimaginable now.
thats unfortunate, I think that means they arent aware that the job of SRE is functionally identical to a sysadmin from
2005 in terms of responsibilities and required knowledge.
We just live on a higher level of abstraction and have better tools & processes now.
Compared to 2005, we live in a society where a lot of people are very sensitive to words.
For example, "developer" is considered offensive, because for some people it's very important to be called "software engineer".
Really good developers don't care about titles.
They don't have time to worry about such, or they have so much money / experience that even if you call them "smart monkeys" they'll be happy with it.
Same goes with sysadmins, SREs, devops, or whatever role you choose.
For some people they have shitty jobs: they don't have such recognition (whether for a good reason or not).
No recognition from work, no recognition from colleagues, no recognition financially, etc, that, if you remove them the title / prestige, obviously they would feel bad.
Source: my experience in a school calling itself "engineering school", and all other schools calling it a "place where to pee code"
> Really good developers don't care about titles. They don't have time to worry about such, or they have so much money / experience that even if you call them "smart monkeys" they'll be happy with it.
That's about the money, not about being good at your work. Ask anyone on the street if you can call them a rat in exchange for a million dollar salary and they'll say yes. It's quite simple.
As someone who always knew I didn't really need to know code for what I wanted to do, I'd posit that a developer comes across as someone who may not be formally trained. Maybe they're a hacker, maybe they know a language or two and dabble. A software engineering is someone who is comfortable at various levels and understanding machine level code in so far as also being comfortable with software patterns.
-shrug-. That's what I feel like in my org at least.
I was a sysadmin (at uni, in the early 2000s) and I am an SRE today (at Google).
The two jobs are nothing alike, at all, whatsoever.
Sysadmins are support roles. Their functional role is to provide a healthy substrate to run the application layer on top of.
SREs work at the application layer itself. If the system can't scale due to internal architecture, an SRE would be expected to propose a new, scalable design. That would be in addition to maintaining the substrate.
To be clear, there is also nothing inferior about performing a support role. No org can succeed without support.
But the two roles are not the same, and if a job's set of responsibilities don't include shared ownership over application layer architecture, then it can be a great job but it's not an SRE role.
This seems unlikely to me solely from a growth perspective. That is, it is possible to reach "Principal" or "Distinguished" SRE as an IC, the equivalent of a director (or similarly: SRE Director). I don't recall companies having, or desiring to have large organizations of sysadmins.
I think one of the key differences to highlight is that usually SREs are engaged early in the design process of new features, and are often driving their own feature changes to the product for reliability or scalability reasons. Those aren't responsibilities or expectations that I've really ever seen in the context of a sysadmin.
One one hand there were definitely support-staff system administrators, but on the other hand there were people writing USENIX papers. Large system administration was system engineering. The people who wrote Nagios, RRDtool, bcfg3 and cfengine, and so on, were solving their own problems first.
I think people like Evi Nemeth, Tom Limoncelli, Æleen Frisch or David Blank-Edelman would have been the equivalent of Distinguished Sysadmins at the time. But they weren't at startups. The places that needed that level were universities, research facilities, telecommunications companies, and the like.
I was fortunate to work under Geoff Halprin early in my career, who while not as well known as those names was a SAGE and USENIX board member and who definitely planted the "don't just do task, engineer yourself out of a job" seed in me.
I think you’re right, but I want to relate a story where the agile team leader decided that devops meant “developer operations”, or in other words the development team.
> One continual challenge Google faces is hiring SREs: not only does SRE compete for the same candidates as the product development hiring pipeline, but the fact that we set the hiring bar so high in terms of both coding and system engineering skills means that our hiring pool is necessarily small.
I was thinking, ok so does this mean the book is completely useless for most companies in the world, since they don't have such standards for hiring people or run DevOps this way? How much of the rest of the book is still applicable?
Read this, even if you are far away from any actual operation of the systems you work on. Read it especially then.
Learning the princples and philosophy conveyed in that book helped me tremendously in my career (as a software engineer). Thanks people at Google for writing and open sourcing it.
The important bit to remember when reading is to understand the origin (why) of the concepts. I've seen Engineers being too dogmatic about the book, saying that "Google does it this way" and not being able to apply the concepts to one's own organizational context. Even at Google, there will be different teams who will deviate from the "described process" given their business context, setup, or stakeholders.
> Even at Google, there will be different teams who will deviate from the "described process" given their business context, setup, or stakeholders.
Seriously. A lot of the book was influenced by Social SRE who had opinions all out of proportion to their own importance and success. At the time, there was some doubt about whether Social's pet theories belonged in the book, considering the varying practices and beliefs of other SRE groups supporting products that people actually use.
This is related to my rule that anybody can title their doc "Best Practices" even if nobody subscribes to them.
I wasn't around Google when the book was written but after having read it and then managing a SRE team at Google my takeaway was: the book was not a documentary on how SRE works day to day. It was a loose collection of aspirational essays and fun historical stories. The reason those stories were so fun is because they are in fact still pretty unusual.
None off the top of my head but for the engineering topics in Part III I think it pays to read them as historical background material, then read the last decade of conference papers, talks, and articles.
Example: the section on backend subsetting in distributed systems is not current. If you wanted the current Google practice you need to read "Reinventing Backend Subsetting at Google"[1], and there are other interesting publications from other organizations.
Yeah, it is, but there's also a lot more to being a SRE than this book. This book more or less tells you how to stand up a reliability program, what it doesn't really indicate is what SREs do. A lot of people I meet think SRE is just the new title for "operator" which can't be farther from the truth. Whether you're doing an embedding model, like is referenced in the book, or you have a central org - both are made up of software and systems software engineers that are focused on performance and reliability. They build software, do analysis, and write policy that improve the bottom line reliability of the organization.
Not an SRE, but I think the main contribution from this book was to popularize terminology of operations (eg SLAs) and to give an opinionated perspective on how to handle operations at scale.
More practically, I don’t think the book is as useful, as it generally only makes sense when you reach a certain scale that few organizations ever do (imo).
However, we are heading into a future where computing will be everywhere and sensors in everything so in maybe a decade even the “smallest” of organizations may be responsible for large scale distributed systems and operating that would require concepts that are provided in the book.
As a non Googler myself, it still is if you want to know how to set up an SRE team and introduce SRE (ie good sysadmin, for lack of a better word) best practices. The focus on actual indicators such as SLI and SLO, the importance of reducing "toil" (boring repetitive tasks) and automating,... these are all valid concerns.
yes, but not as a checklist of things you have to do, instead it's a valuable discussion of lots of problems and how they were solved in specific circumstances.
The front half is for introducing ideas. The back chapters where never that great IMHO. They get both too in the weeds and at the same time missing actionable advice.
Google is anyway planning to shut down SRE role & transitioning them to SWE role predominantly. A few months back there has been announcement already & one of the reflections is to start with reducing the numbers - https://archive.ph/YWp4O
It has indeed been a strange time for Google SRE recently. However, they're definitely not planning on shutting down SRE - at least, if you can trust what Google leadership's actual explanation of what that meant.
Supposedly, the ratio of SRE to product eng had been growing slowly over the years. The KR to "readjust" that ratio was to bring it back in line with historical norms, i.e., to ensure that SRE continued to scale sub-linearly with SWE/systems. This had (primarily) two facets.
First, it gave SRE teams an effectively-blank check to reevaluate their existing dev engagements and jettison the ones that weren't working well.
Second, it pushed to eliminate old tools/systems/platforms and converge onto the more modern stuff, like Annealing [1]. Fewer crufty platforms means fewer teams needed to run them, and improvements in those platforms have broad impact.
Anecdotally, my own sub-org (within SRE) is growing at the moment. Not by a huge amount, but growing nonetheless.
This doesn't say SRE is shutting down, it says that they're changing the ratio of SRE to SWE. One thing to realise about Google is that the technology is increasingly unified across the company. 10 years ago everything worked in different ways, but now there are very standard technologies and paths, and naturally this requires fewer SREs to the SWEs developing the products. I don't think this is a bad thing, and in the layoffs SREs have not to my knowledge been hit any harder than SWEs.
I can't read tea leaves but I'm fairly confident they're not shutting down SRE. They want to get back to sublinear scaling and move away from the "devs create crap, bribe SRE (headcount) to babysit it in production". It was a major anti-pattern for the role.
In looking at the Book Updates section (https://sre.google/resources/book-update/) there's a bunch of companion articles and resources but has there been any actual updates to the book since 2017?
Google is anyway reducing SRE task force & is planning to completely eliminate SRE role. There has been recent announcement already & have already started the move - https://archive.ph/YWp4O
The archive link is busted for me, but that sounds like a bad move. 90+% of SWEs are bad at and hate SRE work, and vice versa.
The rare ones that can do a mediocre job at both (and that won't burn out and switch jobs if told to do both) are usually not capable of doing an excellent job at either.
Using analogies from pretty much any other field shows how dumb it is to combine SRE and SWE, or fuse DevOps (or, god forbid DevSecOps) into one rule:
- Would you have a surgeon drive an ambulance?
- An expert car mechanic manage fleet scheduling and logistics?
- Tell a salesman to design marketing graphics, and have your graphic designer manage high-value customer accounts?
Most companies completely missed the point of SRE/PE/DevOps and keep them on separate teams doing sysadmin toil work and oncall thrown over the wall by engineers who are only concerned with feature deadlines. They regress them back to sysadmin duties and get none of the value of a true SRE program.
SRE should always be a subtitle for a SWE and not a separate position, and they should always be embedded with SWEs into one team either building products of infrastructure. The shared ownership and toil reduction only works if you have these two things.
All this said, I think the regression is also due to the fact that real SREs are rare. A solid SWE that also has deep systems domain knowledge, understanding how to sift through dashboards and live data, and root cause complex performance problems is a master of many domains and is hard to find.
The regression is also due to that a real SRE is expensive. It's cheaper to just get some newly grads to react to alarms following a set runbook of what to do if that alarm triggers.
VERY few companies operate at googles scale. For 99.99% of companies it makes sense to investigate single machine issues.
It's a totally different experience when you have the people who technically own the hardware side of the operations taking no responsibility for the well-being of it, and the people who own the software developing elaborate workarounds for bad machines, and the SREs maintaining blacklists of individual nodes.
I'm curious whether the success of Google in launching software that seems not fully developed can be attributed to their Site Reliability Engineering (SRE) practices.
Not really, the company is massive and until recently very motivated (promo) to launch new things. SREs probably helped get things across the finish line but likely didn't start those projects.
this is a dumb comment, but yes, part of the role of SREs was helping people make (and then implement) trade-offs around system deployment while deploying things that basically worked as intended.
As I understand it (from friends who were SREs in the 2010s) the really clever bit was that projects basically had a budget for "how much SRE attention your deployment needed" - so there was payoff for getting more deployment details right the first time, and structural pushback for just throwing things over the wall. Sounded like an interesting way to connect up the levers...
It seems that there may be issues with accountability within their development teams. The reliability of Google Cloud is in question, as encountering 500 errors appears to be a frequent problem. It has been observed that if one persists in retrying a request, it may eventually succeed. This suggests that their teams may have an error budget and might not take action until the issue is flagged by their Site Reliability Engineering (SRE) team.
Most of the tools I’ve been using when my colleagues were writing the book are either gone, or half-forgotten abandonware. The new tools were built for different processes, system layout and organisational structures.
Book barely talks about tools. It wasn't about tools. The epiphany for many was the concept of an error budget and establishing SLOs. Then, basing investment in reliability on data.
Instead of having feature developers feeling like they have no say in operational requirements, and instead of having reliability staff fighting unstable mess: properly making the contract means everyone gets heard.
Contrasted to devops which despite coming out later was in vogue when this book came out; which caused the muddying of the role of sysadmin to meaning either:
* a sysadmin practicing agile (the original definition btw)
* a software engineer with enough OS skills to carry the pager (the popular one), or
* a team consisting of sysadmins and software developers with no barrier between them (10+ deploys per day style).
Everyone had their own definition of DevOps. So when SRE came clear: sysadmins are needed, stop trying to push everything into one person, heres how we fix the tension between teams: it was a breath of fresh air.
The only revisionist history (that even google seems to forget) is that sysadmins could indeed write code, though it wasnt pretty and didnt have the nice things like mocks and tests. This has changed a little since 2010 at least but it is still dire, even with Cloud making things much easier.
*EDIT:* I've gone from +4 to 0 points in a very, very rapid amount of time. If I have offended you; how?