Hacker News new | past | comments | ask | show | jobs | submit login
Google's SRE Book (2017) (sre.google)
179 points by udev4096 10 months ago | hide | past | favorite | 84 comments



What I found most fascinating is that Google essentially rediscovered what is important in a sysadmin and codified the contract between feature developers and reliability roles.

Instead of having feature developers feeling like they have no say in operational requirements, and instead of having reliability staff fighting unstable mess: properly making the contract means everyone gets heard.

Contrasted to devops which despite coming out later was in vogue when this book came out; which caused the muddying of the role of sysadmin to meaning either:

* a sysadmin practicing agile (the original definition btw)

* a software engineer with enough OS skills to carry the pager (the popular one), or

* a team consisting of sysadmins and software developers with no barrier between them (10+ deploys per day style).

Everyone had their own definition of DevOps. So when SRE came clear: sysadmins are needed, stop trying to push everything into one person, heres how we fix the tension between teams: it was a breath of fresh air.

The only revisionist history (that even google seems to forget) is that sysadmins could indeed write code, though it wasnt pretty and didnt have the nice things like mocks and tests. This has changed a little since 2010 at least but it is still dire, even with Cloud making things much easier.

*EDIT:* I've gone from +4 to 0 points in a very, very rapid amount of time. If I have offended you; how?


Sysadmin originally meant Unix greybeards who were competent at C and could write and implement, say, Kerberos. Then in the 90s-2000s the term came to primarily refer to Windows admins clicking around Active Directory and Group Policy for large enterprises. The "sysadmins can't code" thing came grom that time and the early 2010s when all the cool startups were building on Linux and the available pool of sysadmins were largely Windows specialists. Then DevOps came along trying to get Windows sysadmins that were dabbling in Linux into modern development practices, and SRE came along trying to revitalise the old Unix greybeard style with modern software development practices.


  * SRE are sysadmins with Go
  * DevOps are sysadmins with Python
  * Sysadmins are sysadmins with Perl
(stolen from twitter, can't find the source)


I’m a software engineer that also does operations work. But I really love Go for operations type tasks. Simple, direct, code that’s not too verbose, with predictable performance, and fast compilation to a single executable. I’ve even taken to baking an editor and the compiler into an image with a small library of code for operational tasks, that I can quickly edit and run in the deployed environment for ad how operational tasks.


Image?


Sorry, docker image. We deploy everything in Kubernetes, so I can spin up this image as a container in the cluster, open a shell on it, and use the Go tools we wrote for interacting with our services to debug and diagnose what’s going on.


I took their question to be shorthand for “Can you please share the image?”


Well no, if that’s the question. It’s the IP of the company I work for.


What if they have PowerShell?


That may what sysadmin meant back in the 80s and 90s but it's not what it means now. I would never describe myself, a SRE-SE, as a sysadmin because I would be describing someone whose primary job is to operate software and iterate on configuration.

SRE-SEs on the other hand are SWEs, they just have a focus in systems adjacent software whereas a SRE-SWE is someone who can dig into compiler level issues and optimization. Both write application code, do analysis, and write policy. A sysadmin of today would be out of place on a team like that.


I don't think that's how the SRE role is described in the book. It shifts the boundaries between sysadmins and developers to give developers much more freedoms in deciding the operational parameters of how their softwane is run/released but also gives SREs the ability to push back if they get handed over some piece of software that's not sufficiently following recommended patterns that make actually operating it mostly automated.


That's exactly right.

Can you please point to the part of my comment that you think disagrees with this?


I think I might have misclicked and responded to a different comment than I intended to. You were basically saying the same thing :)


i would think most people object to calling SREs sysadmins.


As someone that has had those titles for many years in the past: I'm only going to object to the one that's paying less at the time.

A good sysadmin was always doing the same thing as a good SRE.


there's a sort of bifurcation of SRE responsibilities with one branch focusing on software-enabled automation and the other on "systems engineering" aka sysadmin. Both are called SREs at Google which seems to cause widespread confusion externally (and even within Google). Also see so called "Ben Treynor Curve"[0]

[0] - https://www.usenix.org/system/files/login/articles/login_jun...


Within Google: Treynor’s curve is a hiring concept. Once in, you’re doing literally the same job. Being in a team doing greenfield development it took me three years to notice that my TL, with whom we’ve been defining the mathematical model, designing and implementing the system, is on the SysEng ladder.


> Once in, you’re doing literally the same job

ok, but you don't tho (i'm sure there are exceptions). if you do, you will have issues with promo (ask me how i know) so if you don't care about that then sure. also if you're se, you had to interview to transfer out of sre to a pure swe role. i personally didn't have this particular problem but saw a few folks that did.


This might be an age-related perception. I think if you're over 40, you'd consider this complementary to SREs. The role of sysadmin, as it existed in 2000, is almost unimaginable now.


thats unfortunate, I think that means they arent aware that the job of SRE is functionally identical to a sysadmin from 2005 in terms of responsibilities and required knowledge.

We just live on a higher level of abstraction and have better tools & processes now.


Compared to 2005, we live in a society where a lot of people are very sensitive to words.

For example, "developer" is considered offensive, because for some people it's very important to be called "software engineer".

Really good developers don't care about titles.

They don't have time to worry about such, or they have so much money / experience that even if you call them "smart monkeys" they'll be happy with it.

Same goes with sysadmins, SREs, devops, or whatever role you choose.

For some people they have shitty jobs: they don't have such recognition (whether for a good reason or not).

No recognition from work, no recognition from colleagues, no recognition financially, etc, that, if you remove them the title / prestige, obviously they would feel bad.

Source: my experience in a school calling itself "engineering school", and all other schools calling it a "place where to pee code"


> Really good developers don't care about titles. They don't have time to worry about such, or they have so much money / experience that even if you call them "smart monkeys" they'll be happy with it.

That's about the money, not about being good at your work. Ask anyone on the street if you can call them a rat in exchange for a million dollar salary and they'll say yes. It's quite simple.


As someone who always knew I didn't really need to know code for what I wanted to do, I'd posit that a developer comes across as someone who may not be formally trained. Maybe they're a hacker, maybe they know a language or two and dabble. A software engineering is someone who is comfortable at various levels and understanding machine level code in so far as also being comfortable with software patterns.

-shrug-. That's what I feel like in my org at least.


I was a sysadmin (at uni, in the early 2000s) and I am an SRE today (at Google).

The two jobs are nothing alike, at all, whatsoever.

Sysadmins are support roles. Their functional role is to provide a healthy substrate to run the application layer on top of.

SREs work at the application layer itself. If the system can't scale due to internal architecture, an SRE would be expected to propose a new, scalable design. That would be in addition to maintaining the substrate.

To be clear, there is also nothing inferior about performing a support role. No org can succeed without support.

But the two roles are not the same, and if a job's set of responsibilities don't include shared ownership over application layer architecture, then it can be a great job but it's not an SRE role.


This seems unlikely to me solely from a growth perspective. That is, it is possible to reach "Principal" or "Distinguished" SRE as an IC, the equivalent of a director (or similarly: SRE Director). I don't recall companies having, or desiring to have large organizations of sysadmins.

I think one of the key differences to highlight is that usually SREs are engaged early in the design process of new features, and are often driving their own feature changes to the product for reliability or scalability reasons. Those aren't responsibilities or expectations that I've really ever seen in the context of a sysadmin.


One one hand there were definitely support-staff system administrators, but on the other hand there were people writing USENIX papers. Large system administration was system engineering. The people who wrote Nagios, RRDtool, bcfg3 and cfengine, and so on, were solving their own problems first.

I think people like Evi Nemeth, Tom Limoncelli, Æleen Frisch or David Blank-Edelman would have been the equivalent of Distinguished Sysadmins at the time. But they weren't at startups. The places that needed that level were universities, research facilities, telecommunications companies, and the like.

I was fortunate to work under Geoff Halprin early in my career, who while not as well known as those names was a SAGE and USENIX board member and who definitely planted the "don't just do task, engineer yourself out of a job" seed in me.


I think you’re right, but I want to relate a story where the agile team leader decided that devops meant “developer operations”, or in other words the development team.

Sometimes all you can do is shake your head.


The SRE book is a little more advertisement of Google's internal systems than real actionable advice outside of Google for SREs.

There is some generally useful stuff in there, but it probably fits in a few pages vs a full book.


Only a few paragraphs into the book:

> One continual challenge Google faces is hiring SREs: not only does SRE compete for the same candidates as the product development hiring pipeline, but the fact that we set the hiring bar so high in terms of both coding and system engineering skills means that our hiring pool is necessarily small.

I was thinking, ok so does this mean the book is completely useless for most companies in the world, since they don't have such standards for hiring people or run DevOps this way? How much of the rest of the book is still applicable?


Most SWEs are SREs, they just don't realize it. If you're on call, you're a SRE.


The second book is actually actionable, according to coauthors.


By second book, do you mean the SWE book? https://abseil.io/resources/swe-book

It was written by titans with the SWE ladder at Google, fairly disconnected from the SRE book.



This seems easier to digest, but it's a long way off from The Practice of System and Network Administration of yesterday.


It’s really an incredible marketing piece


Related bunch of discussion from a few months ago:

Lessons Learned from Twenty Years of Site Reliability Engineering

https://news.ycombinator.com/item?id=38037141


Thanks! Macroexpanded:

Lessons Learned from Twenty Years of Site Reliability Engineering - https://news.ycombinator.com/item?id=38037141 - Oct 2023 (124 comments)

Google Online SRE Books - https://news.ycombinator.com/item?id=31373170 - May 2022 (11 comments)

What Is ‘Site Reliability Engineering’? - https://news.ycombinator.com/item?id=14153545 - April 2017 (86 comments)

Site Reliability Engineering - https://news.ycombinator.com/item?id=13503161 - Jan 2017 (111 comments)

Notes on Google's Site Reliability Engineering Book - https://news.ycombinator.com/item?id=11474002 - April 2016 (93 comments)


Read this, even if you are far away from any actual operation of the systems you work on. Read it especially then.

Learning the princples and philosophy conveyed in that book helped me tremendously in my career (as a software engineer). Thanks people at Google for writing and open sourcing it.


The important bit to remember when reading is to understand the origin (why) of the concepts. I've seen Engineers being too dogmatic about the book, saying that "Google does it this way" and not being able to apply the concepts to one's own organizational context. Even at Google, there will be different teams who will deviate from the "described process" given their business context, setup, or stakeholders.


> Even at Google, there will be different teams who will deviate from the "described process" given their business context, setup, or stakeholders.

Seriously. A lot of the book was influenced by Social SRE who had opinions all out of proportion to their own importance and success. At the time, there was some doubt about whether Social's pet theories belonged in the book, considering the varying practices and beliefs of other SRE groups supporting products that people actually use.

This is related to my rule that anybody can title their doc "Best Practices" even if nobody subscribes to them.


I wasn't around Google when the book was written but after having read it and then managing a SRE team at Google my takeaway was: the book was not a documentary on how SRE works day to day. It was a loose collection of aspirational essays and fun historical stories. The reason those stories were so fun is because they are in fact still pretty unusual.


Can you please expound on what Social SRE is? Is some public handle of an engineer, a name for an internal group, or something else?


the SRE teams that worked around various Google "Social" [media] products


Can you name a better book?


None off the top of my head but for the engineering topics in Part III I think it pays to read them as historical background material, then read the last decade of conference papers, talks, and articles.

Example: the section on backend subsetting in distributed systems is not current. If you wanted the current Google practice you need to read "Reinventing Backend Subsetting at Google"[1], and there are other interesting publications from other organizations.

1: https://queue.acm.org/detail.cfm?id=3570937


Can any SREs tell us how applicable this book is today? Is it still a useful read?


Yeah, it is, but there's also a lot more to being a SRE than this book. This book more or less tells you how to stand up a reliability program, what it doesn't really indicate is what SREs do. A lot of people I meet think SRE is just the new title for "operator" which can't be farther from the truth. Whether you're doing an embedding model, like is referenced in the book, or you have a central org - both are made up of software and systems software engineers that are focused on performance and reliability. They build software, do analysis, and write policy that improve the bottom line reliability of the organization.


Not an SRE, but I think the main contribution from this book was to popularize terminology of operations (eg SLAs) and to give an opinionated perspective on how to handle operations at scale.

More practically, I don’t think the book is as useful, as it generally only makes sense when you reach a certain scale that few organizations ever do (imo).

However, we are heading into a future where computing will be everywhere and sensors in everything so in maybe a decade even the “smallest” of organizations may be responsible for large scale distributed systems and operating that would require concepts that are provided in the book.


As a non Googler myself, it still is if you want to know how to set up an SRE team and introduce SRE (ie good sysadmin, for lack of a better word) best practices. The focus on actual indicators such as SLI and SLO, the importance of reducing "toil" (boring repetitive tasks) and automating,... these are all valid concerns.

If you want more about system design and how to design reliability, I suggest reading https://google.github.io/building-secure-and-reliable-system...


yes, but not as a checklist of things you have to do, instead it's a valuable discussion of lots of problems and how they were solved in specific circumstances.

learn from it, don't copy from it.


The front half is for introducing ideas. The back chapters where never that great IMHO. They get both too in the weeds and at the same time missing actionable advice.


Google is anyway planning to shut down SRE role & transitioning them to SWE role predominantly. A few months back there has been announcement already & one of the reflections is to start with reducing the numbers - https://archive.ph/YWp4O


It has indeed been a strange time for Google SRE recently. However, they're definitely not planning on shutting down SRE - at least, if you can trust what Google leadership's actual explanation of what that meant.

Supposedly, the ratio of SRE to product eng had been growing slowly over the years. The KR to "readjust" that ratio was to bring it back in line with historical norms, i.e., to ensure that SRE continued to scale sub-linearly with SWE/systems. This had (primarily) two facets.

First, it gave SRE teams an effectively-blank check to reevaluate their existing dev engagements and jettison the ones that weren't working well.

Second, it pushed to eliminate old tools/systems/platforms and converge onto the more modern stuff, like Annealing [1]. Fewer crufty platforms means fewer teams needed to run them, and improvements in those platforms have broad impact.

Anecdotally, my own sub-org (within SRE) is growing at the moment. Not by a huge amount, but growing nonetheless.

[1]: https://www.usenix.org/publications/loginonline/prodspec-and...


This doesn't say SRE is shutting down, it says that they're changing the ratio of SRE to SWE. One thing to realise about Google is that the technology is increasingly unified across the company. 10 years ago everything worked in different ways, but now there are very standard technologies and paths, and naturally this requires fewer SREs to the SWEs developing the products. I don't think this is a bad thing, and in the layoffs SREs have not to my knowledge been hit any harder than SWEs.


I can't read tea leaves but I'm fairly confident they're not shutting down SRE. They want to get back to sublinear scaling and move away from the "devs create crap, bribe SRE (headcount) to babysit it in production". It was a major anti-pattern for the role.


Ofc everyone downsizing... smh


In looking at the Book Updates section (https://sre.google/resources/book-update/) there's a bunch of companion articles and resources but has there been any actual updates to the book since 2017?


The other books.


Does it have a chapter on how to deal with end-user support?


For more information on Google's end-user support, please post on the Community Forum.


SREs are not support agents. That said greater signalling from that area over to us would be a good thing.


I think googles advice is: don't


Google is anyway reducing SRE task force & is planning to completely eliminate SRE role. There has been recent announcement already & have already started the move - https://archive.ph/YWp4O


The archive link is busted for me, but that sounds like a bad move. 90+% of SWEs are bad at and hate SRE work, and vice versa.

The rare ones that can do a mediocre job at both (and that won't burn out and switch jobs if told to do both) are usually not capable of doing an excellent job at either.

Using analogies from pretty much any other field shows how dumb it is to combine SRE and SWE, or fuse DevOps (or, god forbid DevSecOps) into one rule:

- Would you have a surgeon drive an ambulance?

- An expert car mechanic manage fleet scheduling and logistics?

- Tell a salesman to design marketing graphics, and have your graphic designer manage high-value customer accounts?


That’s not what your link says, did you read what you posted?


Most companies completely missed the point of SRE/PE/DevOps and keep them on separate teams doing sysadmin toil work and oncall thrown over the wall by engineers who are only concerned with feature deadlines. They regress them back to sysadmin duties and get none of the value of a true SRE program.

SRE should always be a subtitle for a SWE and not a separate position, and they should always be embedded with SWEs into one team either building products of infrastructure. The shared ownership and toil reduction only works if you have these two things.

All this said, I think the regression is also due to the fact that real SREs are rare. A solid SWE that also has deep systems domain knowledge, understanding how to sift through dashboards and live data, and root cause complex performance problems is a master of many domains and is hard to find.


The regression is also due to that a real SRE is expensive. It's cheaper to just get some newly grads to react to alarms following a set runbook of what to do if that alarm triggers.

VERY few companies operate at googles scale. For 99.99% of companies it makes sense to investigate single machine issues.


Google SREs also end up investigating single machine issues, fyi.


Yes, but At Scale®

It's a totally different experience when you have the people who technically own the hardware side of the operations taking no responsibility for the well-being of it, and the people who own the software developing elaborate workarounds for bad machines, and the SREs maintaining blacklists of individual nodes.


In my experience it's fun to do that but only worth it when SLOs are on the line (so a significant number of bad machines).


I'm curious whether the success of Google in launching software that seems not fully developed can be attributed to their Site Reliability Engineering (SRE) practices.


Not really, the company is massive and until recently very motivated (promo) to launch new things. SREs probably helped get things across the finish line but likely didn't start those projects.


> Not really, the company is massive and until recently very motivated (promo) to launch new things.

What changed?


Layoffs and a change in the performance management system (moving away from perf).


> change in the performance management system (moving away from perf)

This sounds... confusing. They moved away from performance?


this is a dumb comment, but yes, part of the role of SREs was helping people make (and then implement) trade-offs around system deployment while deploying things that basically worked as intended.


As I understand it (from friends who were SREs in the 2010s) the really clever bit was that projects basically had a budget for "how much SRE attention your deployment needed" - so there was payoff for getting more deployment details right the first time, and structural pushback for just throwing things over the wall. Sounded like an interesting way to connect up the levers...


It seems that there may be issues with accountability within their development teams. The reliability of Google Cloud is in question, as encountering 500 errors appears to be a frequent problem. It has been observed that if one persists in retrying a request, it may eventually succeed. This suggests that their teams may have an error budget and might not take action until the issue is flagged by their Site Reliability Engineering (SRE) team.


There's still good advice in the book, but be aware it was published in 2016, with folks likely having started writing it around 2014.

Both Google and SRE/DevOps have advanced greatly since then, and following the book blindly would be cargo culting.

Edit: apparently this is a controversial opinion?


How has it advanced greatly since then?


Most of the tools I’ve been using when my colleagues were writing the book are either gone, or half-forgotten abandonware. The new tools were built for different processes, system layout and organisational structures.


Book barely talks about tools. It wasn't about tools. The epiphany for many was the concept of an error budget and establishing SLOs. Then, basing investment in reliability on data.

That's as applicable today as it was then.


So, what exactly do you propose as an alternative?


What contemporary book do you recommend?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: