A practical guide to incident management

badrabbit · on June 18, 2022

It should be noted that you should not handle security incidents as you would outages. I struggle to get this across to folks that came from an IT background all the time.

A lot of things feel unintuitive with security incidents because you are not solving a technical problem but addressing compromised risk and malicious human intent. For example, it is ok depending on the scenario to both cause an outage for seemingly trivial reasons or let a hacker roam free as you monitor their progress. KPIs by management that move from IT to security also have similar issues like measuring how long it takes to eradicate a threat as if it was an IT outage when you should take as much time as you feel to eradicate a threat (but ideally, but not always, the threat should be contained) because understanding scope of a compromise or attacker intent is more important. While more important KPI like dwell time are ignored.

lawrjone · on June 18, 2022

Hey Hackernews!

I work at incident.io, where we’re building an incident management platform that helps your organisation deal with incidents at scale.

Today, we’re launching a set of resources intended to help those building their own incident processes that we’re calling the practical guide to incident management.

It includes chapters on:

- On-call, what it means, who’s involved, and how it should be compensated

- Foundations, defining an incident, and advising on how you measure them

- Response, for how to nail actually responding to the incident

- Learning & Improving, with pragmatic tips on how you can level up and learn from your past incidents

If you want to adopt pragmatic, battle-tested incident processes, or are looking to fine-tune an existing process, this guide is for you.

For those familiar with the content, this guide differs from the Atlassian and PagerDuty response content in that it:

- Acknowledges the way companies are working today: distributed teams, reliant on tools like Slack.

- Focuses on process and practice outside of just responding to an incident

- Provides actionable advice: real steps you can take right now

- Extends beyond engineering, reflecting our belief that incidents are an entire company affair

Very complementary content, but with different aims.

We hope you find it useful!

https://incident.io/guide

mrloba · on June 18, 2022

Last time I worked on call I was compensated 2 hours overtime for every time I opened the laptop, as well as a fixed rate for the week. I was also able to convert some of the fixed rate to days off the week after. Still, having to wake up multiple times each night wasn't worth any compensation to me. I suppose you're correct that compensation should be tailored to each situation, but the examples you give seem way too low to me. 350$ is nothing if there's any meaningful amount of incidents to handle. My company paid twice that, 10 years ago. I guess it would be ok if nothing really ever happens.

lawrjone · on June 18, 2022

Someone said similar in another thread: https://www.reddit.com/r/devops/comments/ve9jge/how_do_you_c...

My answer is yes, the on-call compensation is more about keeping healthy dynamics of on-call than it is about materially impacting your total comp.

I've worked in places where on-call pay was material before, and while it's nice, it comes with disadvantages too. As an example, you try hard to keep shifts balanced because if someone covers you and you don't do a shift that month, you notice it in your paycheck.

That's not always a good thing, if you want to encourage people to switch shifts to prioritise their health and reliable on-call cover.

hwntw · on June 18, 2022

In my most recent role I wasn't on-call, but would keep an eye on monitors out of hours and if around, fix issues as they arose (I guess it was unpaid, but would just take hours off in-lieu). That's perfectly fine to me, I'm well paid, should (probably!) have written more resilient code the first time around, and if I'm not doing anything much anyway then it's not that big of a deal to hop online to fix an issue.

I would never accept on-call work (even if paid), even if no incidents ever happened because of the infringement that has on your lifestyle. Realistically it means I won't be able to attend church the Sundays I'm on, go for a long run, on a hike (if we have one of the only nice Saturdays of the year), or even visit family (some of my family have a terrible internet connection!), because even if nothing happens then there's the risk something _might_ happen: and if it did, I wouldn't be able to respond if I were doing one of those activities. And for me, no amount of money would be enough to compensate me for that.

evnsio · on June 18, 2022

It's hard to assess on-call comp without the context of pager load, stage of company and a myriad of other factors. We've done some research from the market though, and have some interesting results we'll be sharing soon! (see https://twitter.com/incident_io/status/1526191169054597120)

I take your point though, and if being on-call is having any meaningful impact on the time you spend working outside of hours, I agree $350 is low. For us, we're an early-stage startup with low numbers of alerts (long may it stay that way!) and the impact is low.

Not to try and defend it too much, but it's worth pointing out that the $350 is there to cover the inconvenience of being 30 mins from a laptop, rather than compensating for the time. We also give folks time off in lieu for any time they spend working outside of hours.

hyperman1 · on June 18, 2022

As an example: I love to go for a swim with my son, but this means I'm unreachable for 1.5 hours. Even with 0 incidents in oncall time, it severely limits these choices. This kind of thing is normal for most people

tomerbd · on June 18, 2022

Even without waking up or handling an issue, a silent on-call also has a psychological tension and should be compensated with time off.

HL33tibCe7 · on June 18, 2022

If you were having to wake up multiple times per night, your on-call system is completely out of whack.

adrianmsmith · on June 18, 2022

We are discussing this at the place I work at the moment.

There is something I've never known the answer to, perhaps someone here who knows more about incidents can help.

I have joined a small team. There are perhaps 2-3 devs who can actually fix issues. There are lots of issues, e.g. one incident a week. We're obviously trying to fix the software but every time something goes wrong it's a new problem. We have 100+ microservices interacting and 1k req/second, on a 20-year old codebase where most of the services are written by people who are no longer at the company, so although we're trying and making improvements, and I'm doing my best to help, it's not going to happen overnight.

Which brings us back to incident support. We're too small a team to have an on-call rota (i.e. with 3 people everyone would have to be on call once every 3 nights, apart from when people are on vacation or sick when the remaining people would have to be on-call more). So the management's idea is just that everyone have slack on their phone and whoever is available jumps in either when an alert goes off, or when someone (e.g. support) writes something isn't working. "Professional jobs aren't just 9-5, they do involve overtime sometimes". To be this just feels like being permanently on-call, which I don't want (burnout risk etc.). It might be different if there was an incident once a year but it's much more frequent than that.

So I know what I think we shouldn't do, but what should I advise the company that we should do?

We can't fix all the issues overnight, every incident seems different, we have too few people to have a rota, the software is too complex to outside operations to an external team (e.g. last fix last night was for one of my colleagues to deploy a quick hack to the codebase to turn off one feature, then build the code and deploy it), yet it's a international consumer website that should be available 24/7.

sigwinch28 · on June 18, 2022

1. Collect data. It will help your case when you talk to management. How many incidents, when, what they relate to, how much of team time is spent on firefighting, etc.

2. Read Google’s SRE book for ideas. For example, read about error budgets in this chapter: https://sre.google/sre-book/embracing-risk/

If the goal is to have a reliable system, then the ultimate creators of unreliable systems must be made to feel the pain themselves.

For example, product can be made to “feel the pain” by not being allowed to have any new features while an engineering team fixes user-facing or business-critical bugs ultimately caused by an endless plethora of new features being requested by product on short timescales.

In one word: backpressure.

Of course, this relies on product and/or management buying into the idea, which is why data can help you argue your case. The rest of your organisation might be more receptive if you can show them how much time is being spent on incidents and bug whack-a-mole.

raffraffraff · on June 18, 2022

I agree with collecting data. Monitoring tooling is extremely important. With microservices you need metrics (requests, errors, latency), logs, traces. Every service will have specific signals that the service owner should understand.

The SRE book is an OK idea but to be honest, it's hard to get developers to focus on Google-level theory of reliable engineering via error budgets when they're so operationally immature that nobody is oncall (so everybody is oncall). Sounds like the gap is too wide. If these alerts are 50% "customer reported an incident" instead of a SLI, then that SRE book could spawn a cargo cult that creates work for everybody and achieves nothing because the gap is too wide. I've seen clueless managers latch onto theory that's too far ahead of the stone-age. Like "Hey guys, this month I want everybody to record their SLOs", when nobody is even recording metrics because they don't know how, or what to record.

I'd do extremely boring stuff like keep a spreadsheet of incidents with columns for severity, trigger (eg release, operator error, customer load, hidden bug, system error), root cause (eg internal software bug, bad config, system resources etc). If patterns emerge, they should point to gaps in knowledge, testing, automation, monitoring etc. If there's no pattern at all, then your engineering org is in a bad way. When an engineering department is extremely dysfunctional, it needs external help. Or sometimes it needs a clueless manager / lead to quit, allowing suppressed talent too take over and fix stuff.

evnsio · on June 18, 2022

When you say there's 2-3 devs who can actually fix issues, is that because there are only 2-3 devs, or do you have more people available but they don't know how to fix?

Assuming the latter, my suggestion would be:

1) Setup a weekly on-call rotation, where one of the 3 people who can fix issues is dedicated on-call (this means the other 2 can properly switch off) 2) Find some other folks who might not be able to support immediately, but are happy to learn. 3) Configure a shadow rotation so there's always a primary responder who know what they're doing and a shadow responder who can learn on the job. 4) Make it clear the shadow person has the same expectations in terms of turning up, but their job is to ask questions, write things down, etc.

I did this in a former job (and wrote about it here https://monzo.com/blog/2018/09/20/on-call) and it worked pretty well. After a few months we had a healthy rotation of 8 on-callers, and a waitlist of shadow folks who wanted to join too.

This advice obviously came from a point of near-zero context, so if you'd like to chat more I'd be happy to!

sokoloff · on June 18, 2022

Did something similar and that program turned into a milestone on many engineers’ promotion journey. (It wasn’t mandatory, but most everyone who made a certain level had successfully completed the program because you learned so much in and got such broad exposure from it.)

Normal_gaussian · on June 18, 2022

On call rota - 1 week on, n weeks off, but make sure its compensated.

Primary duty of on call is triage, pushing all non-critical issues to work hours and only bringing issues they deal with through mitigation and not resolution.

janci · on June 18, 2022

Define a SLA. Classify incidents by priority. Not everything must be solved immediately. Hire a developer from different time zone to cover your night time by his day time. Make your architecture more resilient. Identify most impactful improvements and stop implementation of new features until they are solved. Compensate well for the overtime. Ultimately, run away.

raffraffraff · on June 18, 2022

> So the management's idea is just that everyone have slack on their phone and whoever is available jumps in either when an alert goes off

Is this a 24x7 thing? If so, this is the worst idea ever. I would quit immediately. Screw that manager. Honestly. This is clueless, and it shows a lack of care for the whole team.

tomerbd · on June 18, 2022

The one who handled an error external to working hours on day x gets a day off on day x+1 otherwise you will burn out.

evnsio · on June 18, 2022

Hey folks! One of the founders of incident.io here, and the one who's face is on the front cover :) I've written this content so many times at companies I've worked at in the past (though not to this quality!) so it's really nice to be able to share this and avoid others having to start from scratch.

zinodaur · on June 18, 2022

This tool looks great - it automates a lot of the process we have for major incidents at my company!

One feature I was hoping this would have, related to on-call, but for minor week-to-week incidents:

My team gets many low priority alerts every week that need to be triaged, worked on, or ignored (something on the scale of 10 - 20 alerts per week, often re-occurring, often firing with slightly different tags).

These alerts are "symptoms" of something wrong with the system, and the oncall always has to manually sort them into "causes". A "cause" is then tracked and worked on, with investigation notes etc.

I've seen this workflow at a number of companies, and it is always handled with poor automation or none. Oncall engineers manually copy-paste the pagerduty - alert url into a spreadsheet (or google doc), manually assign a status tag, manually group re-occurring alerts into the same row, manually assign engineers (and the assigned engineers have to manually ack being assigned!).

Most of the time, the "symptom-grouping" decisions have to be made by a human, since a lot of the time it requires investigation to determine the link between two symptoms. But just because that one decision has to be made by a human, doesn't mean that all the bookkeeping shouldn't be automated!

With all these manual steps, there are so many places for human error.

We would pay so much money for a tool that did this for us

tyingq · on June 18, 2022

The industry terms are "incident management" (symptoms) and "problem management" (root causes), which might be helpful searching for existing tools.

BoboDupla · on June 18, 2022

This is all nice and great, but what needs to be said, that this guide is tailored for small-ish companies, mostly tech oriented, primarily facing customers. As someone who is working in a relatively big company (1500 employees in Europe) and worked in a corporate before, you need to have much more defined processes and definitions. Anyone who ever dealt with ITIL and is reading this must be very confused.

What also needs to be said - in my time I have seen orthodox ITIL bros, who got all their certificates and wanted to transform every ITSM process to the letter of ITIL and these people would probably get a panic attack from this guide. Nowadays I am seeing Agile bros, who want to do everything the Agile way, replacing one mess with another, just for the sake to be up to date with the newest frameworks.

I like that people at incident.io tailored the Incident Management to their needs and are not sticking to some predefined "rules" which are popular now. What works for them might of course not work for the rest.

evnsio · on June 18, 2022

I don't disagree with your points, and having worked at both massive IT organisations, tiny startups and scale-ups in-between, it's clear that different organisations need different levels of rigour and process!

I see the guide less as a set of dogmatic rules to follow, and more of: a) a set of sensible defaults for small to medium size orgs who are starting with little-to-no process and b) a source of inspiration for larger folks who maybe want to bring their processes out of the ITIL ages and into a world that's a little more applicable to the way folks work today.

As you say, the extremes of the ITIL <---> Agile spectrum is likely to be a mess, and where you should target on that spectrum is highly dependent on your starting point, your culture, and your appetite for change :)

lawrjone · on June 18, 2022

My experience is mostly in a company of about ~700, working in a regulated environment (payments).

This guide is how we ran incidents, or at least at the top end of the complexity of each section.

So if by small you mean companies under 1500, then I'd agree. But also, that's a lot of companies, in fact the majority of them!

We definitely see success with our customers up to the 1500 range adopting these practices, and often throwing out more convoluted or obscure processes that came before.

Animats · on June 18, 2022

Oh, that kind of incident management. That's just routine task management.

FEMA has something called the National Incident Management System, which is used when large scale bad stuff happens.[1] Many municipal employees get training on this, so they know what to do in a crisis. Incident commanders require substantial training. There's a certification system and even wallet cards, to keep clueless officials from trying to run things. It's not that complicated, but it's been thought through by people with dozens of disasters behind them.

There's always a single incident commander. It's usually somebody like a fire chief, not a political official.

[1] https://www.fema.gov/emergency-managers/nims

oDot · on June 18, 2022

I wouldn't take FEMA as the standard. Thankfully, it does not have enough incidents a year to facilitate experience. Instead look for organizations that practice the methods daily even without experiencing catastrophic incident.

A good example is how Walmart was among the first to bring water during Katrina:

https://youtu.be/_djmIfcLTBQ

tomerbd · on June 18, 2022

As part of this guide should include a day off for on-calls who handled on-call tasks outside of standard working hours.

evnsio · on June 18, 2022

This is something we spoke about in https://incident.io/blog/on-call-at-incident-io. Specifically:

> On-call payment is not expected to cover any time you spend working outside of hours. If you're paged and end up working in your evening, you should take time off in lieu. We trust you to manage this time yourself.

I'll make sure this is added this to the guide. Thanks for the nudge!

jen729w · on June 18, 2022

> At incident.io, our engineering team runs on-call Friday-to-Friday, with a 5pm handover. This works great for us: as an engineer coming off the pager you get to finish work nice and early on a Friday and enjoy your evening. [0]

[0]: https://incident.io/guide/on-call/compassionate-on-call/

Is this an American thing? 5pm is not “nice and early on a Friday”! 4pm is about what time you should be finishing, and 3pm is “nice and early”.

Love, Australia x

tuxie_ · on June 18, 2022

Yeah same here in Germany, it's considered very rude to schedule a meeting that spills over 5pm. I may still be working past that time, but I will decide that.

bluehatbrit · on June 18, 2022

Same here in the UK, if you scheduled a meeting at 5pm on most days people would be pissed by on a Friday of all days you'd make a lot of enemies.

My old job where I was last on call did Tuesday - Tuesday with a 1pm handover. It gave you time to deal with any issues that were still hanging around from the weekend but then you were done. Handover was happily during the work day so it could be moved back or forth a bit if needed and the new person on call had a couple hours to ask extra questions if needed.

hackermanve · on June 18, 2022

oh god this is funny because that link give me error 502 :(

lawrjone · on June 18, 2022

You've got to laugh or cry: the new traffic caused us to hit some Netlify edge function issues and we went down for a bit :(

Happy to say - having opened an incident - that we're now back up, and don't expect to hit any more issues.

Please try again, I promise it's worth another look!

higeorge13 · on June 18, 2022

Is it fixed? First page keeps loading forever.

joaquincabezas · on June 18, 2022

Right now it is working

jjtang1 · on June 18, 2022

[flagged]

evnsio · on June 18, 2022

JJ, please :) I'm sure you folks have better things to contribute to the conversation than a sales pitch?!