Hacker News new | past | comments | ask | show | jobs | submit login
A few ops lessons we all learn the hard way (2020) (netmeister.org)
197 points by Tomte on Aug 22, 2022 | hide | past | favorite | 63 comments



If a post-mortem follow-up task is not picked up within a week, it's unlikely to be completed at all.

This one is literally a law of physics.

People give talks at conferences not to convince others that their work is awesome and totally worth the time and effort they put in, but themselves.

I would add: "If you see a big name company give a talk about some cool thing they made, it's probably already been abandoned by that company."

Turning things off permanently is surprisingly difficult.

If you don't have a plan to sunset whatever you're building, you're basically telling your future self to go fuck himself. Unless you plan to quit first, in which case you're telling your successor to go fuck himself.

The source you're looking at is not the code running in production.

((cries))

Mandatory code reviews do not automatically improve code quality nor reduce the frequency of incidents.

The primary purpose of mandatory code reviews - without a sensible plan of who, when, why or how - is just for people to nitpick your code.


> I would add: "If you see a big name company give a talk about some cool thing they made, it's probably already been abandoned by that company."

Watching this happen was insane. Some team kept giving talks about how great their thing is that no one in the company was using. Major conferences immediately lost credibility seeing that happen. Sometimes it feels like everybody is just bullshitting, maybe because it really is like that.


[unpopular opinion] It's not probably true for everybody and everything, but (tech) conferences tend to attract a lot of bullshitters on both sides (speakers and public).


AWS re:Invent just read your comment and quietly backed away...


A "plan to sunset" something looks like an incredibly alien concept to me. How do you do it when people keep finding new uses for whatever it is?


With careful change management and access control. It's easy to turn off a system if you know for sure that nobody is using it anymore. Large companies and militaries do it all the time just fine so there is no reason a dynamic young startup shouldn't be able to do it. :)

(As a non-sarcastic response: back when I was an officer in the navy, how we would get rid of systems was most definitely taken into account from the very start. Even before we started building ships or radars or whatever, budgets and dock space would be reserved ~30 years into the future to do the decommissioning work. I do realize that this works much better for established organizations that can be reasonably sure that they will exists in 30 years, after all most startups are hard pressed to last even two. That said, completely disregarding any planning on how to grow out of your current systems does seem to have been the case at all the startups I have consulted for and was a major part of technical debt. IMO a technical leader should know where the skeletons are hidden in the setup for their organization and know roughly in how many months/years they will no longer suffice. Then they can plan the replacement and/or upgrading of said skeletons accordingly)


> How do you do it when people keep finding new uses for whatever it is

Assuming there is an upgrade path it does make sense to plan to turn a thing off. The thing may consume resources / facilities that could be redeployed, operators could become available for other tasks (or require retraining, etc), or the controlling organisation might need to be restructured. There might be regulatory implications / costs, such as recycling or disposal of controlled substances etc.

If a retired thing is 'pure software', disposal might be simplified, but if it has physical or facilities elements (as per military capabilities mentioned by a sibling poster), disposal can be decidedly non-trivial.


It's very hard, for that very reason that more of an organization will depend on it over time, making it harder to extricate from it. But there's a number of things that help.

0. Ownership. Try to manipulate the business to put this project under a part of the org where it's very hard for anyone to have leverage over you, so you can fight back when they try to pressure you to keep it going with no budget or staff. (Can you put it under "finance" or "admin" or "HR"? They won't give a shit about your project and aren't responsible to the tech leadership. Sometimes "IT" is the same.)

1. Money. Assign a fixed budget that runs out after X time. Put yourself in a position that you have no way to ask for more money, so they can't keep trying to stretch your team out with no additional funding. Calculate how much it'll cost the business to try to support it past EOL and put that figure where everyone can see it.

2. Limits. Design in very specific quotas and limits that give very reliable but limited functionality. If somebody wants this to scale 100x, show them how it literally can't. Prevent stakeholders from trying to do more than is possible. If they want fewer limits, tell them to give you money out of their budget to build and staff a single-tenant version of it. They will quickly go away, probably to (poorly) build their own version of it.

3. Disclosure. Tell all stakeholders what this thing can and can't do, that you won't be able to scale, what your SLA is, when this thing will be EOL and that they need to put on their calendars now to work to move off of it in time. Do not tell them the actual EOL date, tell them a date 6 months before the actual cut-off date. Communicate often and via various means in public places, because most people will never read anything they aren't interested in.

4. Stakeholder management. Tightly control who is using your system and what they're using it for. Document the downstream business risk. Make a big stink if somebody starts using your dinky little project with no funding for something mission critical. Remind them of how your limits and budget and SLA and design are all tied together and can't be worked around without redesigning the whole thing.

5. Transition planning. When your system goes away, something needs to take its place. At design phase, incorporate a timeline that includes a large chunk of time just for supporting getting people off the platform. Also plan for how you could offload the entire system onto some other system. Create a document that lists what a new system will need to have, so whoever is tasked with that will not build something that is impossible to transition to. At sunset time, redirect work towards the transition. Have a solid change management plan and get stakeholder sign-off.

6. Rigorously track the value created by this thing, or the value lost by trying to maintain it past its sunset date, and all business risks. Collect hard data. You will need it later to argue to senior leadership why keeping this thing online is a terrible idea.


For more on sunsetting or replacing legacy systems, check out Marianne Bellotti’s "Kill It With Fire: Manage Aging Computer Systems (and Future Proof Modern Ones)". Here's a review of the book:

https://www.usenix.org/publications/loginonline/kill-it-fire


> I would add: "If you see a big name company give a talk about some cool thing they made, it's probably already been abandoned by that company."

Alternatively, the talk should be viewed as the company recruiting someone to replace said cool thing.


> you're telling your successor to go fuck himself.

or themselves* :)


> 3. The severity of an incident is measured by the number of rules broken in resolving it.

We got around this general truth by enumerating the list of people (some by name, some by role) who could declare that a prod issue was emergent enough to authorize any action.

We based it on the concept in the aviation law 14 CFR 91.3, in relevant part: “In an in-flight emergency requiring immediate action, the pilot in command may deviate from any rule of this part to the extent required to meet that emergency.“

https://www.law.cornell.edu/cfr/text/14/91.3

The intent of adding our own version of part a and b of this law (including their origin) was to drive home the sense of responsibility and trust we placed in those employees.

It also allows us to pass SOX control audits because our stated policies have a “these people can authorize anything they think is needed”, which means when we do that, we’re still following our controls.


> was to drive home the sense of responsibility and trust we placed in those employees.

This seems like the opposite of a blameless culture, putting the onus on individuals vs the system that caused/can prevent it


Disagree, but it might be because I phrased/explained it badly. There is a set of employees E, who might be the cause of the production problem. There is a set of employees P, who might be in a position to use this emergency authorization. E contains about 1000 employees. P contains about 5 employees (including me, as it happens). Members of E can do anything the normal rules allow during an emergency. Members of P can authorize anything they think is appropriate during an emergency. Only in that limited sense is the onus on them, but that's clearly time-bound to the emergency. P's responsibility is to restore service to our customers, not to worry during the incident about how service was originally lost.

Later, in figuring out what happened, we do get to the point of who did what, when, and if we're able, why they thought that was the best course of action. In 19 years here and 17 since I designed this process, I've never seen a production incident nor a mistake in handling one held against a member of E nor P, even though we often learn something from our mistakes leading up and during prod incidents.


> Your network team has a way into the network that your security team doesn't know about.

This one is true, and even if you right now go and try to make it not true for your company...it will become true again later.


This, along with the obvious intersection of skillsets and interests, is why I often now see either network ops and security bundled together on the same team/org, or DevOps people doing the work of both teams.


DevNetSecOps is much more sensible than most of the other ways. The audit team needs to be independent, of course.


Really good article. Some of these are subtle, and really must be learned the hard way. The only one I found myself thinking I disagreed with was "85. Multithreading is rarely worth the added complexity." Maybe I simply have yet to learn it the hard way, but of all the ways to add complexity, I have tended to find multithreading as one of the more legitimate. That being said, it has to be done in a simple, easy to reason about way. Usually for me, this means fork-joining homogeneous tasks.


A. However well you understand multithreading, you only need one coworker who doesn't understand multithreading to make your life an unending hell. B. You always have at least one coworker who doesn't completely understand multithreading. :/


multithreading to A. understand coworker who multithreading, you only need one multithreading. :/ doesn't coworker who However understand well you life an unending hell. B. You always have at least one doesn't make your completely understand


"But I only need to lock when I write!" O_o


I need this on a shirt.


Yes. If you're going to do multi-threading, let the framework/language handle the hard parts[1].

[1] It's _all_ hard parts.


> Usually for me, this means fork-joining homogeneous tasks

I think the article doesn't make the right distinction: parallelism is often worth it, concurrency is what causes the headaches. Fork-join is parallelism and generally safe and relatively easy.


Many that are absolutely true here. But also one that's flat out wrong:

37. Nobody knows how git works; everybody simply rm -fr && git checkout's periodically.

I know how it works. Because the first time I ever heard of git was when I had started a new job and was told the dev team were switching to it. So I spent a couple of days reading up on it and learned exactly how it works.

If you work in Ops, I suggest you do your job properly and do likewise.


I learned how git works but without my cheat sheet and scripts I am a helpless angry baby. (guess where I keep them? git)

Still I happily award the Cryptic Weirdo Savant of the Year Trophy to anyone who can convincingly lie about memorizing that gibberish.


You clearly didn't learn how it works since you're helpless without a cheat sheet. Sounds like you memorized some stuff.

I didn't memorize it, I learned how it worked. These are very different things.


Okay let me actually return to the point. I practically spelled it out already, but why would I need a cheat sheet if I "memorized some stuff"? Why would I accuse anyone claiming to memorize this "stuff" of lying if I actually memorized it?

Let's think hard here. cue jeopardy theme

(i just had to google "jeopardy" because I can't remember how to spell it! But I know how to play!) (but they won't let me)

I mean yes I am a lazy idiot but I'm not that self-contradictory.

Okay i am done with this. Done done. Go flame.


That was mildly convincing but it's been a fairly competitive year


Cute.

I typically use an array of Git commands per hour of working. I consult the manual or a how-to maybe once every other week. Re-cloning a repo happens less than once a year for me. Can't remember the last time I had to do that.

And it's not just me. I see the same for the people I work with.

Of course, for many people, learning Git is not worth the effort. But that doesn't mean people who handle it fluently don't exist.


It's not because I'm super smart, you're just super lazy.


Personal attacks are against the site rules here.


Feel like sharing your cheat sheet? I've been using git for over a decade but I find that personal cheat sheets are still a valuable resource.


An easy request to fulfill so https://github.com/zaboople/techknow/blob/master/git/git.txt

Although there is some garbage in there that I just ignore instead of cleaning up. Perhaps useful as a meta-tutorial on maintaining a cheat sheet though: Store it in a github-like repo of its own so that you can download it anywhere, update it, etc.

I also have an equally unhelpful but meta-helpful series of scripts prefixed with "gip-" (pig backwards) in here: https://github.com/zaboople/bin/ This is cheat sheet plus one: At some point you might as well make a script for it rather than digging it out of your cheat sheet. In fact I'm kind of pursuing this idea of "instead of one command with 1000 options, why not 1000 commands?" I think there's maybe a way out of git hell in that idea...


Some real gems in there. I could only imagine how you got to the point about worrying about umask - maybe reusing some directory that was holding old code experiments before?

As for all the grep scripts, you might want to take a look at rip-grep, it lets you filter on file type.

Thank you!


A lot of junior people have no idea how to use git. No doubt, it is confusing especially if you've never used source control before. I've seen some seriously screwed up git "flows". I've seen people who have no idea what a merge conflict is, or how to resolve one, so they wind up committing the conflicts.

It wasn't much different in the subversion days, or before (CVS, anyone?)


I know, and what I'm saying is that it's not confusing if you learn how it works. If you just try to figure it out as you go along you'll end up with a mental model which is vastly more complicated than git itself is!


> The source you're looking at is not the code running in production.

My first boss favorite words were: check check check check check. That's also the first thing I teach to new engineers: most of your assumptions are completly wrong, double check everything.


Most debugging is simply a series of validations of which of your assumptions is incorrect.


I do a lot of mentoring with younger engineers. They're surprised the first few times that when we start debugging a problem, the first thing we do is check all of the things that "have to be working" instead of just trying to dig for a bug right where the error looks like it's coming from.


> That janky script you put together during the outage -- the one that uses expect(1) and 'ssh -t -t' -- now is the foundation of the entire team's toolchest.

The only thing that separates tactical from strategic in an enterprise setting is time.

Due to my frustrations with various constraints with running things in parallel (and my complete ignorance of GNU parallel), I wrote a script that does similar work in Perl (with some added custom features we needed). Against all my expectations, 16 years later, that script is still being used (unmodified).

> There's nothing wrong with Perl

Indeed.


> and my complete ignorance of GNU parallel [...] 16 years later

Parallel was created 2002-01-06: https://www.gnu.org/software/parallel/20th-birthday.html

It did, however, only become a GNU tool in 2011, so I think it is completely fair that you were ignorant about it 16 years ago.


Ah that is interesting - yeah, we were on Sun Solaris at that time (which AFAIK remember, didn't have the GNU toolkit on it). Thanks for making me feel better about it :)


Quite interesting but I don't really understand the framing with self-signed certs

It can and (often does) go poorly, but the only thing you really gain from an external CA is a [minor] reduction in responsibility


Ok lets spin up the ca and create a cert for this service. Validity 15 years.

<15 years later>

Why did this thingy break? How do I get a new cert? CA was on a Virtul Machine? The old server? What is ESXi? Ok we will just spin up this docker container, create a new cert and make it last 20 years...


This is the main reason let’s encrypt gives you three month certs - so you setup a way of automating it.

It’s not too hard but I wish it could be even easier. The best I’ve found is a main system that gets the wildcard cert and then transfers that around where needed.


I'm not doubting it can be done poorly

I'm just curious why an org would do that when they have a domain controller/login infrastructure providing a CA, or one of the many secret-storing-engines backing the vast majority of their sensitive data (eg: Vault)?

Policies can be defined and enforced to serve as guard rails -- we choose the nightmares we accept


Ahhh, yes I see where you got lost. You are assuming that they either:

* Have a team willing to commit to managing CAs as part of the DC/Login infrastructure.

or

* Are actually using something like Vault in their infrastructure.

Common mistake, really.

In all seriousness you would be amazed how many places neither of those things are true for, and additionally how much sheer effort it would take to make them true at those orgs.


If you have a team that manages a domain, it will take a month to get anything done with them, if they even allow you to get what you need done. If there is a way to make it easier, they aren't interested.


Instead we get to deal with finance for two months to order from a public CA!

Before anyone says it, yes we know about LetsEncrypt, and no we cannot use it


Indeed, I am a bit spoiled... I neglected smaller shops where 'not enough people' rings even more true.

Where I'm at, we have tons and tons of people, enough to dedicate (and form) teams for things. It admittedly skews my view quite a bit. Here... we don't have enough capable people.

I'm also a bit sour at the alternatives. We have certain (publicly visible) certificates that must go through a CA. That's an absolutely painful process that requires about six levels of finance approval -- every year!



No, 11. Took the last place I worked about 5 years to rollout a pretty good solution to a massive legacy set of servers. Though I gotta say, the super privileged automated certificate renewer thingy does seem like a real honey pot.


I can point to examples for at least 80% of these.

Mostly not in public, though.


  Absence of a signal is itself a signal.
I’ve built a business around this one!


Once got a job offer around it (I declined).


“CAPEX budget always increases, OPEX budget always decreases.” is a great synopsis of how capitalism works.


TAI > UTC. Except for the fact that no-one uses TAI.


Almost everyone is quite ok with whatever, will treat whatever they get as UTC, will be happy to ignore the difference.

Everybody that is not on the above paragraph uses TAI.


> Everybody that is not on the above paragraph uses TAI.

To be fair, even many of us TAI users are quite ok with whatever, will treat whatever we get as TAI, will be happy to ignore the difference.

After all, if you use UTC, you're wrong, and anything that goes wrong as a result is your fault.


> and anything that goes wrong as a result is your fault

Honestly, that doesn't give me the "not care about it" vibes.


> "not care about it"

Never said that. Rather, some of us are commited to attributing blame to what actually caused the problem (namely, leap seconds and anyone who listens to the IAU).


As per #53, #37 could probably use a link to https://xkcd.com/1597/




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: