Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Dead Air on the Incident Call (danslimmon.com)
132 points by nalgeon on March 19, 2024 | hide | past | favorite | 97 comments


> When there are more than 10 people, the verbal approach stops working. It becomes necessary to have a shared document of some sort, continuously updated by a “scribe.” It’s not sufficient for this document to be merely a timeline of events: it must highlight the current state of the joint diagnostic effort. I recommend clinical troubleshooting for this.

My previous company used Conditions, Actions, Needs (CAN) reports to maintain consistent understanding. This compares differently to their recommended "clinical troubleshooting" (symptoms, hypothesis, actions) by having a "Needs" section. I think the Needs section is super helpful because many times, the right people haven't joined a war room yet and so you can just specify the needs and as people join, they can immediately jump into whatever their expertise is.

https://www.fireengineering.com/firefighter-training/drill-o...


> “Oscar, do you mind sharing your screen so Deepak and Deanna can see the weird log messages too?”

it seems so obvious from an Incident Commander perspective but so much goes into this workflow during an incident

* what if the person is a fresher, you are asking him to share screen, debug and perform actions in front of 100 people in the incident call and the anxiety that comes with it

* While IC has much more practice with handling fires continuously, for instance, if there is a fire every week in a 50-team organisation, a specific team would only be seeing their first incident once a year

* Self-consciousness/awareness instantly triggers a flight or fight response from even the most experienced folks

I don't know how other industries handle such a thing, I'm pretty sure even in non-tech there would be a hierarchy for the anomaly response and sometimes leaf level teams might be called to answer questions at top level of the incident response (like a forest fire response, might have a state wide response team and they pulling local response team and making them answer questions) probably they get much more time to prepare than in tech where its a matter of minutes


In a previous job, I had a critical incident crop up and we were dealing with the offshore parent company. All the senior management had been cc’ed into the emails about the problem.

Result: nobody was willing to say anything for fear of looking bad in front of those people. This was frustrating to say the least.

I solved this by replying all, but I took out all the senior people. I said something along the lines “hey guys, I’m the guy who needs this fixed. I can see you are la working hard. I’m removing a number of people from the cc lost and we will communicate with them in a seperate email. Just keep me up to date with how it’s going and tell me what you need from my end.”

This worked wonders. They worked the issue, and though it took some time it was to be expected.

When it was solved, I found the original email, replied all (including management) and explained that the problem was solved, and made a point of highlighting the excellent work the team fixing the problem had done on resolving the issue.

I never had any issues with the patent company’s dev team after that :-) in fact, they went through our incident reports and fixed 80% of the longstanding issues within the next week! Which I wasn’t expecting…

Moral of the story - take as much pressure off the incident team as you can.


Thanks so much, that was good, practical, wise, in-real-life experience.


> 100 people in the incident call

Well, there's your first problem...


I took a high enough number to showcase the problem, for a fresher it doesn't change much even if that number is as low as 15 or 20, or even if 5 people that they don't know or at higher levels

also I feel like, the number of people that hop on the incident call are almost always related to the category of the incident, sure you can always break out to a separate room, but often the person would have already realised the impact and the weight of the incident


And the point is that both of these are problems that an incident commander is there in part to solve, both in the sense of making sure that those investigating have what they need including the ability to focus, and in that of handling communications with stakeholders including leadership.

If whoever feels like it can "hop on" the incident call and stay on it, regardless of whether or not they can contribute to the investigation, then the IC needs to do a better job. Granted, usually this is for lack of institutional competence; I've been one place where the IC role was taken seriously, and incident response there ranged from solid to legendary, where most places never rise above "cautionary tale." But nonetheless.


In my exp people will get pulled in then never let go for the rest of the incident. The coordinator needs to be 'do we need XYZ anymore if not they can go and we can call them back if needed'. That is how you end up with 30+ people on a call. Not letting anyone go. Dont hold them hostage.


Can you comment on why you think it is a issue for anyone to hop on a incident call, whether or not they can contribute?

It is one thing if they are being disruptive, but I don't see a problem with observers.

For this thread, the fact that some people may feel scared to share a screen or participate if the group is too large, again that is for the IC to control. But I wouldn't kick anyone else just for lurking, there may be a good reason and I'm not going to call out every one on the call asking why they are there, that is just as disrupting.

TIA


An ongoing major incident is already stressful enough for everyone involved, and looky-loos don't help that at all. Nobody does a better job of debugging for having to fight a helmet fire at the same time, and one of the IC role's responsibilities is to proactively minimize that risk as far as possible.

It does depend somewhat on the situation and the organization, and on the role; IC engineers observing for familiarization is fine, VPs joining never is. My approach is that the incident call is for those actively involved in the investigation or who have been invited to join by those who are, including engineering ICs who wish to observe for familiarization. Meanwhile, stakeholders not directly participating in response receive updates from the incident commander via a separate (usually Slack) channel. Managing that communication is also part of the IC role, whether directly or by delegation.


I've been on an incident call that Jeff Bezos hopped on to listen into. The "IC" (we had some different name like problem management engineer or something like that) did not ask him to get off it.


This makes sense. Amazon's corporate culture is famous for its deficits.


Surely you'd want to instead share a link to the logs being investigated so others can investigate concurrently, instead of having 2 backseat drivers observe someone observing logs.


Depends. In some situations it would in fact be better to have everyone discuss one person's shared screen, instead of having to constantly coordinate what they are talking about.


+1 Depending on how complex the system/tooling is, it is rarely just one log file to share in a text editor.

If you have logs, metrics, tracing, other dashboards for context you want to see how they are debugging.

Some of these tools are very complex and other eyes can help pinpoint inefficiencies.


Ideally, wouldn't the IC's / Group of ICs' responsibility to introduce blameless culture before the incident, right?

I've worked in blameful places, always without ICs; just shouting HIPPOs.

I hope that an org evolved enough to create IC roles would back that up with culture, but I could be wrong.


Indeed - in that kind of environment an important role is "managing upwards", preventing the people who are actually doing the work from being overwhelmed by constant requests for status and explanations.


What is a fresher?


Recently graduated, just entered the workforce.


Fresher is not a good term for this example.

There are engineers that are great coders but bad in a incident environment. They may not be fresh, but also need the same help as a "fresher"


It's a very US centric term, in the UK we'd just call them graduates, for example.


Nope, not a US term. I've found it in a couple dictionaries as a UK term for "freshman", which is a similar idea but not quite the usage in OP.

The equivalent that I've usually heard in the US is "recent graduate", rather than just "graduate".

https://dictionary.cambridge.org/us/dictionary/english/fresh...


As a US developer for nearly 25 years, I've never heard this term used in business context. I'd call them a graduate as well.


Recent (this generation) Indian immigrants to the US use the term in my experience. I've never heard anyone else say it.


It's mostly a South Asian centric term.


It's a very US centric term

You've never heard of "freshers week"? That being said, I've never heard the term used to refer to anything other than university students.


I live in the US and have never heard of it.


not a US term. SE Asian.

"Fresher" + "100 people on the call" immediately makes me think Tata or Cognizant.


> There is, however, a healthy kind of dead air

This is the thing that drives me nuts. I was really hoping the article would be about the value of dead air, or at least expound on it more, instead there is barely a paragraph.

What continues to frustrate the hell out of me is that Incident Commanders keep taking silence as inaction (or ineffective action), even when you tell them in advance you need to dive into through logs and think for a few minutes.

I've now switched to taking my headset off when I need to do it (after letting them know and giving them a chance to respond).

It is practically impossible to debug complex scenarios, especially when you need intuition and your subconscious mind involved, while being pestered with questions.


Culture doesn't seem to be mentioned in TFA. Likely because come an incident it probably can't be influenced much at the time. But attitude can be. People as a team are working together to solve an issue. Humans vs issue. Not teams working to prove it isn't their fault - or is the fault of some other team.

I have been in places where a team can say "Mea culpa" and the worst thing that happens is next incident people grin and give them friendly jibes. Of course reasonable actions (workplaces can be unreasonable too...) are taken to ensure it doesn't happen again but that is simply part of the learning process.

I have also been in places where vast majority think the issue points at one team. They are silent on comms despite being present. Then miraculously the issue is gone. The response to the question of what changed? "Nothing." And we all go to bed having suspicions but no concrete answer...

Attitude is also related to many comments here expressing concerns over "people watching my screen" or "over my shoulder".

In times of crisis, if I am running a line of investigation then having a second pair of eyes is reassuring! If I think "maybe this thing is related" and someone more experienced can simply glance at it and say "Nope" then great. My idea had it's day in the sun and the group can move on.

And if you really really think it is still related then you can keep investigating without people looking - but as a second priority to group.


> I have also been in places where vast majority think the issue points at one team. They are silent on comms despite being present. Then miraculously the issue is gone. The response to the question of what changed? "Nothing."

I’m currently in a multi-day troubleshooting issue where some key SSO component isn’t functioning correctly. This component is operated by an offshore outfit we can call Total Computing Screwups, and the entire troubleshooting process is a whole bunch of incredibly expensive folks sitting on a mostly silent call, hitting refresh on login sessions that will suddenly and miraculously work, obviously without any changes being made.

Every single person in the call, except for the outsourced operator, is an expert in the field, and none are allowed to see the logs or configuration of the malfunctioning system. (Which isn’t officially malfunctioning, because they refuse to acknowledge there is actually a failure, which means the issue cannot be escalated)

It is one of the dumbest destructions of capital I have been forced to take part in, and it is all in the name of “cheaper”. It is so stupidly frustrating.


I am glad to hear I am not alone in such an experience.

I had a burn out in the past. I eventually came to the conclusion that sometimes the situation is the result of the next managerial level up from me failing their RealTimeStrategy game and not committing enough peons/wizards/engineers/diplomacy-with-subcontractedCompanies.

Maybe it's not their fault per se because the level above them failed their strategy - and so on.

But while I've learned it isn't my problem/fault - it is indeed damn frustrating. Good luck and best wishes for any clarity.


I'm so glad I work at a company with culture like one of the former and not the latter.

As someone that has worked in real life high stakes physical scenarios (people can and have died in companies I worked with), being able to blamelessly own your mistakes is critical. Lowering the stakes doesn't change that. As long as your intentions weren't weren't intentional or knowingly and needlessly reckless, you will keep your job. Even when people did exceptionally stupid or willful things we allowed them to leave with all earned pay and some semblance of fairness. Nothing makes a situation more dangerous, or harder to manage than when people hide things.


> actions are taken to ensure it doesn't happen again

That sentence captures most incident related problems well. While I love a culture where mistakes are owned, care must be taken to not end up in a culture where nothing is ever getting better.

If something wasn't properly tested, or the test environment was lacking, that policy must be permanently repaired. Not laughed at in a "everyone makes mistakes" kind of way. Everyone does make mistakes, and that must be taken into account.

What I guess I'm trying to say is that a failure to work professionally is not like operational failures. A culture of owning mistakes is good, but not all mistakes are alike.


This reminds be of doing WOW raids with Ventrilo back in the day, and how much I miss that, but something missing from back then.

It didnt have screens but it had multiple rooms, so full-party/group leaders/tanks/healers/dps/etc...each had rooms, and you could still 1on1 with someone.

Sometimes I feel like a team/department would like to discuss, or maybe even someone 1on1 wants to talk, and it seams all moderen meeting software misses that today.

I hadn't actually thought about this in a while, but there are few things more stressful than the entire company/raid party watching over every breath and movement, and being able to talk to a coworker or someone/team can't really be done with todays meeting software because its 1... ONE... shared room, vs even in recent memory at least in office teams were in the own spaces/buildings/etc, and they could mute the confrence call and talk amongst themselves.


That's actually exactly what I thought of when reading the article: the sinking silence where everyone knows we're gonna wipe, but nobody wants to be the first to say it. Or, conversely, the busy silence where everybody's concentrating on their part and nothing needs to be said because everyone can see what needs doing is being done, especially on well-trodden but still touchy content. Or even the nervous silence as the pulling hunter tries really hard to thread the needle and not get the patrol too. Filling (or not filling) those silences was a big part of good raid leading.

Even more than raiding, large-scale pvp with hundreds of people had some bad silences: if it's quiet, nobody's calling, so the callers are all either dead or tanking, so your Lanchester's square law effects go out the window as piles are lost and damage is dispersed for want of direction.

But specifically to your point: yeah, Discord is probably the closest to that ideal of lots of purpose-specific channels, but it doesn't have the channel and bulk user management features ts3/etc have. It was really useful to be able to programatically bulk move people by role, or give roles priority speaker/mute other roles at certain times - Huhu's dead so we can go back to not caring about the hunters, etc.

MMO leadership and tech work really do have an awful lot in common.


I think about this a lot having been both on corporate meetings and voice chat in WoW raids and how much any sort of teleconferencing software is missing stuff like:

* Mandatory PTT so you don't have people's eating/talking/background noise

* Priority speaker (or being able to turn individual people down)

* The ability to leave the main voice chat for a moment and then return.

Its so much worse than what I got with vent or mumble or discord.


Being able to have individuals at a different volume is something I almost desperately want in work meetings.

I'd also love something that would allow me to be in several "breakout" chats at once, but not silence everything, just allow me to turn some rooms down and mute myself in them while I interact in another. Bonus if there's an indicator of how much activity there is in the non-focused ones so I can see if something may need attention, or if everyone in it is silent, and not need to listen in.

Most conferencing software seems to treat meetings like individual entities, which works fine when they are just scheduled meetings, but isn't great when you have several groups to work with at once. Being able to have a team know you're in another chat and be able to say "Hey scaryclam, we found something, can you drop into the conversation for a minute", from their chat would be awesome in something like an incident call.


> Mandatory PTT so you don't have people's eating/talking/background noise

This can somewhat be mitigated by providing people with proper headsets. Every company should issue good, convenient wireless headsets. Emphasis on "convenient" so that people aren't tempted to substitute them with worse-sounding options.


I disagree that this mitigates the quoted section from GP, even if I agree with you in general.

The benefit of mandatory push to talk cannot be significantly mitigated by any current headset system. There is a "habit of intentionality" (for lack of a better term) that comes with mandatory PTT that is missing in corporate meeting culture. It only takes one bad apple or VP with mad mic hygiene to throw away all the benefits of company-supplied noise-cancelling-dynamic-threshhold-sensing microphones.


Zoom has breakout rooms


So does Teams, and if you want to 1-1 someone just click their name and the call icon. Add more people with two more clicks. Everyone on the bridge will see you’re on hold in the main call, which prevents people dialing you to rejoin and wasting time.

I find that far more often than not, when someone is lamenting the lack of a feature in communication software, the feature actually exists and they’re just not aware of it.


In the Ventrilo scenario you could hear, but not be heard by the parent room. This is the only way Eve online fleets with hundreds of participants could be coordinated for insurance. Your example works for a 1-on-1 but not for hierarchical communication layers.


Interesting, the Eve example. Are the higher-in-command in the root/parent rooms and then it's hierarchial down the line? Or can you just listen in on certain rooms as well as whatever you are talking about on?

Never got big into Ventrilo back in the day.


“Oscar announces, “I’m seeing some log entries from the web server that look a little weird. I’m gonna look at those.” This is the beginning of a 5-minute silence. […] So it’s incumbent on you to interrupt this silence.”

This is “we need to do something, this is something, we need to do it” thinking. The role of the commander imo is to insulate the investigators from exactly this sort of meaningless interruption.

““I need 5 minutes” [...] There is, however, a healthy kind of dead air.”

If you need to be told this, you are being managed by your staff, not managing them.


The problem is that, as incident commander, I don’t know the difference between “I’m not saying anything because I’m stuck” and “I’m not saying anything because it’s going to take me five minutes.”

The correct rule of thumb is to always over communicate, regardless of your role. If you’re troubleshooting, tell the incident commander that you’re doing X and it’ll take around 10 minutes before you see results. Then you’ve set expectations.


That's the sort of anxious micromanagement a good manager is shielding their team from in a situation like this. You need to trust your guys. They're the experts.


Not at all. The way you gain trust is by making clear commitments and either meeting them or updating them as necessary.

I'm not saying that the engineering troubleshooting a problem should be narrating moment by moment. I'm saying they should say, at the outset, something like:

"I'm going to go try flushing the cache and restarting the app servers. I expect this to take about fifteen minutes, and if it works I would expect to see the database load normalize."

That's all I need as an incident manager. Otherwise I'm sitting there not knowing what to say in my status updates, my stakeholders are asking questions about estimated duration that I can't answer, and so on.


And as an example, since we've been dealing with a huge amount of incidents over the last few weeks: To me, it's completely normal if BryantD is like "Yo tetha, how are the caches and restarts coming?" and it doesn't take much for me to mumble something like "16 out of 28 done. database, you see anything changing?" No need to be vastly eloquent or anything.

Or I might throw something like that in on my own if things are dead silent right now and no one else needs the radio. And I think that's a good habit other people on the team are picking up. If someone needs the air for something important, they can talk, but otherwise you have various status updates floating around. And more often than not, those result in "Wait, did you just say xyz changed? Did I just do something useful?"


I see we'd enjoy working together. ;)

Out of curiosity, how do you parallelize incident response? I am fairly picky about only changing one thing at once but I'm not sure I'm always right about that.


To me, this depends on the different failure domains and how they interact with each other and based off of that, it can be decided if the teams should coordinate changes, communicate changes or just run.

For example we've had situations in which... creative user and amazing code managed to fry both the dedicated elasticsearch cluster for the application as well as the dedicated database, and the application servers were also weird. Here, we'd split into three small teams each responsible for each of these pieces.

And looking at these failure domains - ES and postgres don't interact with each other, so both of these teams should just run independently as fast as they can to get their components working again and inform the app team when they are back up. However, the application team should closely coordinate their actions with both of these teams - I've had enough situations in which someone pressed the "hilarious load on ES" button while people were still getting ES up to capacity... and down we go again.

Or in a similar way, we had a central database outage and a couple dozen apps got taken out. Database in the lead, sure, but once that's going again, the different application teams can run free and make changes with loose coordination with the database team.

However, within the same failure domain, I really don't like to make too many changes in parallel. Getting back up ASAP is a priority, for sure, but what about the outage in 2 hours, or tomorrow because we just hit the system with a couple of wrenches and have no idea why it's back up? Here I strongly prefer deliberate, individual changes so we can get an idea why the system failed. 10 more minutes of downtime / degraded service now can safe us many nerves and downtime hours over the next few days.


That makes sense. I am slightly envious of the idea of having clearly discernible failure domains but that’s a me problem.

The coordination can get hairy at a huge company but that’s why (for example) Amazon has a really challenging program for becoming an incident manager, which creates a pool of people who can understand failure domains quickly for incidents which span a lot of the stack/company.


This is a constant architectural struggle, I tell you.

Like, inside the infrastructure, I'm constantly updating, poking teams and such to make sure we have clean disaster recovery layers and we have clean documentation about the dependencies between our infrastructural services. And to make sure we don't have cycles - or at least we should have good documentation about these cycles and how to handle them. And to make sure your services don't intermingle too crazily. We should have our postgres bubble, and this should depend on the consul bubble through a clean interface.

And similarly, I'm constantly telling our dev-teams that even though we have micro services and everyone is entirely free to do whatever they want within their team (within reason), we should have clean boundaries between these services. Clear interfaces at an HTTP, AMQP and gRPC layer - and dependencies between services maintained by different groups of developers should go through these dependencies.

If you want to share something that's currently an internal service of some application or system, don't just depend on it because you can - that will get annoying for /both/ teams. Rather separate it out cleanly into a new service with a defined interface, and share that. Our infrastructure should be a DAG of small-ish, independently fixable, deployable and functional bubbles, not a huge ball of mud with everything going everywhere.

It's a constant struggle, but I think it improves both our architecture at a development level, overall dev experience because of better separations of responsiblities and our stability because we can structure incident responses better. Thank you for reading my ted-talk.


The incident commander isn't necessarily a manager. Indeed, where I work it would be uncommon for a manager to take incident command and unlikely that anyone above the first level of management would join the incident call.


This is all in the context of a incident and the role of a IC.

The author nails all the issues here. Dead silence with no shared screen is not good for a incident (without any other context).

You either communicate constantly on what is happening or say "I need 10 minutes to debug" In which case 10 minutes is given but they must come back with an update at that time.

an IC absolutely cannot just let someone go off and trust they are debugging the issue.


Why? To any of this?

If you aren't fixing/investigating the problem, why get in the way of the people who are on the basis of trust issues, or some ICs idea of what's "good for a incident"?


How do you know you are getting in the way, how do you know the guy just didn't go out to lunch? How long do you wait to hear back, do you just wait forever?

The scenario I have in my head is a conference, the person in the example has no video, no shared screen and no response.

The 5 minutes is just an example, maybe a bad one but the idea still holds.


> how do you know the guy just didn't go out to lunch?

You don't, you hire people you can trust, and with whom what is/isn't appropriate during an incident is clear.

> How long do you wait to hear back, do you just wait forever?

A short message "How's that going?" if it seems to be taking a long time and there is no communication.

What you are describing is micromanaging the actions of the people who are meant to be experts in that domain. Why is this needed? What issues have arisen before, and do these issues arise during normal development.

> person in the example

It's not clear to me the role the people in the example have.

What does being "primary investigator" mean? How does that relate to Ops - why isn't support, or the devs investigating the logs. Who in the example would be knowledgeable about this area? Is the Ops person qualified to know if a web log is strange? Why would they be pushing fixes to a web server?

To me, if people in the know are competent enough to investigate an issue, they should be competent enough to communicate about it.


It would be sort of unprofessional I guess, but it seems like it would actually help if people could sing mindless tunes as they work. If somebody is going “do-do-do, doot dadodo” as they work, you know they are in progress. A sudden “hmm” or silence indicates trouble.


But that's not over communication, that's just pollution of radio space.

For example, in a recent incident I ended up kinda in charge / responsible for the database as the application encountered some weird lifelock with row and table locks. In such a situation, if no one needed the radio space, I ended up announcing the status of the database every few minutes, even if nothing really changed.

Or, if someone does something, give feedback if this has a noticeable effect on your system or not. Or quickly ask if you want to do something that could affect them, no matter how little that is. "App, I'll modify setting xyz, tell me if that has an effect".


>I ended up announcing the status of the database every few minutes, even if nothing really changed.

Yes exactly, in a incident if there is some long task that is running that will clear the incident, this is important.

It's essentially a progress bar.


aka The Everything's Okay Alarm


Yeah this article didn't go where I expected it to go. Silence is nearly always an indication that someone is doing something or thinking about something, and the advice here is to interrupt them?


But how do you know that without any other information?

In a high stakes incident you cannot let radio silence go off without clear communication on status updates... this was called out in the bottom.


You train people, delegate to them, and then trust them to do the right thing.

This is discussed in the article as well, but I think expectations are much better than interruptions. And even then, it is better to set those expectations during training rather than during the response. People should be trained to prioritize communication during a response, but not above their work doing the response. And the people doing the work are individually best suited to make the call of whether what they're doing is more or less important than communicating.

In my view, it's really hard to overstate how hard it is to be reading big volumes of logs, reconstructing the runtime state of some big complex distributed system from the breadcrumbs available, and thinking about what to do next to most quickly mitigate or get more information on the issue, while being bombarded with messages.


I read that part as meaning this probably doesn't need to be a call at all at this point in the investigating process.


The elephant in the room is that these "What is Oscar up to? If only I could glance at their monitor… If only I could see their facial expression… If only I could spitball ideas within earshot of him." problems would also be solved with everyone in office. Don't shoot me tho, I'm just a messenger. I love remote work. But the friction is tough.


While trying to focus and troubleshoot, the only thing I love more than people asking to share my screen and explaining everything I'm scrolling past, is having to do it while three people breathe over my physical shoulder.


Yep, especially because those three people are not sitting at their computers, doing work to advance the investigation themselves.

Even when I was still working in an office, coordinating incidents through a Google Hangout and a Google Doc to keep rough notes was the way to go. Want to show something? Share screen. Want to talk in private? Jump into a private hangout. Want to jot down some thoughts/unfinished ideas? Throw them into the document (the Hangout chat was pretty useless because people joining later couldn't scroll back) or into the dedicated Slack channel for the incident.

If anything, incidents have become much easier to coordinate thanks to all the tooling that we now have - though that requires an active incident commander (who also makes sure that Deanna, Deepak, and Sylvain are not just waiting, but investigating other possibilities). Fortunately, someone has written an article on how to become a better incident commander :)


I rather liked using gather.town during a period of time where our team often needed to pull together to swarm on some outage or performance program or bug. We were up against a very tight deadline for a client and there was a massive feature bumping up against some hard realities.

I absolutely wouldn't want that to be my daily life, but while we needed it, it worked. More than that, it was better than being in an office- when I wasn't needed or wanted to crack down on something without distraction, it was super easy to get away from the noise. When I saw a bunch of avatars in a meeting room, I could pop in without causing a disturbance to see if it was about anything I could pitch in on.

Thankfully, we're long past that point, and it became a lot less useful to the point that our team stopped using it not long after we were out of crunch mode.


If you need everyone working an incident physically around the same table in order to respond effectively, your organization is not equipped to respond effectively to incidents in general.

I'm speaking as someone who worked full time doing nothing but incident management at a F500 for a couple of years before the pandemic. The incident team for literally every single response I ran in that time was effectively remote; most weren't in the same building as me, probably weren't in the same time zone as me, and it wasn't unusual for them to be on a different hemisphere entirely.

Physical proximity to one another has absolutely nothing to do with the ability to work an incident. Effective communication is vital, but this isn't uncharted territory. Large organizations have been doing this effectively for a very long time before the entire in office/remote debate ever became a popular controversy.


I dunno. A lot of the incidents I’ve dealt with have been at ~3 am. No way I’m driving to the office then. So it’s good to be good at doing this remotely, even if it isn’t your first choice. (Full disclosure, it is my first choice. I agree with the sibling poster that having someone stare at me as I debug is not my preferred debugging environment. I need to be able to stare off into space and think “why the hell is this happening?!?!” without judgment.)


Not all of them. This problem isn’t an artifact of remote work or even geographically distributed teams. You might just call someone from your desk and comfortable setup, instead of walking up a flight of stairs.

Or you might be in a different building, in a different city, or different country halfway across the world.

Or it might just be 4 AM for everyone, and there might be no time to go into the office, even if you all normally sit together.

All of these are real things I’ve seen and not hypothetical in the least.


I was in the middle of what I thought could have been an incident the other day. While debugging, with two others at my desk, I was approached no less than 3 other times by groups of people just wanting to say, introduce themselves etc. I told them that we were facing an issue, trying to work through it, but it didn't matter. Trying to break off conversations was more stressful than the incident itself.


When I worked in an office, I had a favorite place I would go when I needed to really focus on getting work done. Now I can just switch to a window that doesn't have any communication apps on it and focus as long as I want. I miss lots of other things about offices, but it's much easier to focus elsewhere.


As someone who has worked on many a network outage, and as someone who values in-person work time to get to know my extended teams and see faces, I will say that live troubleshooting in the same room is not valuable.

A good whiteboard software with incident exec summary, chat history, and "useful links" from troubleshooting chatter is needed.

Maybe I've benefitted from having my own team colocated during an outage. But it has rarely been useful to have a cross-functional outage team in the same room when doing log research.


So, instead of micromanaging your investigation, we micromanage their facial expressions?

If Oscar wants your ideas, maybe he should be able to ask for them, or accept a spitball-session.

Whatever you conclude from his facial expression, maybe he can verbalise that himself, unambiguously if relevant; instead of frantic speculation based on first impressions.


What you are talking about is debugging, this is incident command.

Completely different, no office environment solves this problem.


Half the time, these incident responses are happening in the middle of the night.


Honestly even with all of that it will never be enough. And the anxiety of being watched would likely make you more ineffective.


Simple concept, the author is overthinking it.

I have been "problem manager" for many large outages. I use the term "problem manager" to remind people that an outage is something you manage just like any other kind of project, except on much shorter time scales.

Everything you learned about project management applies to dealing with outages.

> Sometimes an investigator needs to go silent for a while to chase down a hunch, or collect some data, or research some question. As long as such a silence is negotiated in advance, with a specific time to reconvene, it can serve a crucial purpose. I call this functional dead air.

Hey, if you are the kind of project manager that talks and does not listen to your team... that's a problem.

My ideal stance on those occasions is to present myself as somebody who "wants to be educated about the issue". I think it is more helpful and creates less stress. As I am asking questions I am trying to not seem to be interrogating them but instead emphasise I am a noob on the topic but need to learn quickly.

My ideal is this scene from Margin Call: https://youtu.be/Hhy7JUinlu0?t=67

This usually is actually true, btw.

There is no single way to do it right but as a manager it is your job to maintain good information flow between you and your reports and on an outage, your reports are essentially everybody involved.


High level exec or PM/PG I agree with you.

However ICs are usually engineers that are more equipped to help debug the problem. So I sort of disagree with you, at least from the articles examples.

These aren't managers, these are usually Staff+ Commanders that need to fight fires. They don't need to be spoken as if they are young child (example in the clip).


Well... I am an engineer with quarter of century of development experience, just typically not in the know on the particulars of the given part of the system.

What I describe is my personal style which has been described as "detective". When things do not work well, I tend to get into the thick of things to get a sense of what is really happening "on the shop floor".

I remember, at the start of my career, my disdain for the execs. I couldn't really understand how you can go this far and yet don't understand the simplest basics of the business we are doing. Now I know that it is mighty hard to have true sense when everything you are being told is carefully filtered and worded, when every person you talk to is completely focused on how they appear in the discussion rather than about solving the problem.

So to fight this I get into a detective mode and I try to appear friendly to people, genuinely interested (which is not hard because I actually am!) and not trying to sound like all knowing and all powerful. And I do defer to engineers a lot, but I also tell them that they need to be able to support decisions with information.


I understand your point, but do you do this during a Sev1 incident at a FAANG type company?

Meaning company.com just went down, 1M+ users are unable to login. I don't see how your style works, you need fire fighters not detectives.


Maybe it's a personal problem, but I struggle to communicate and investigate at the same time. I'm fine task switching, but it's one or the other. I've been on numerous incidents where an anxious manager is asking for constant updates, ensuring no work is getting done. My favorite is when they ask engineers to stop investigating in order to send a status update to the wider organization. I don't know, how about maybe the person whose sole role on the call is to manage communication, maybe that person could send the update. But I digress. Communication is important, but it's not free. Seek balance.


Anecdotally, had a moment of silence once, after it became apparent the 100x network bill was from a compromised vm due to two human errors combined

I appreciate that Google Cloud refunded the $10k despite our faults in the situation

The errors

1. Spinning up a vm for some experimentaion, with a public ip

2. Setting a weak password on a well-known username

The vm became involved in a ddos network


GC should allow billing restrictions on accounts, say dev envs. If it's not prod, there is no reason to act on credit, or require all resources be available when billing limits are exceeded.


It was actually the billing alert that let us know something was up


But a restriction would halt service once the money is gone. You can go into debt faster than you can respond to an alert.


One thing I found supremely helpful in my varied experiences was having an engineer step up to be the single voice who starts running things and coordinating.

Some companies have a NOC or support person run calls, but they often feel nervous and just ask sheepishly for updates.

Having a principal or eng manager run the call gives it a different, more commanding feel. They better understand the system and start calling people and teams by name. They also aren't talked down to or snapped at like people tend to do with support people, sadly.


A major tag in this guys set of blog posts is medical reasoning:

https://blog.danslimmon.com/tag/medical-reasoning/

In that regulated field, a senior or principal engineer runs CAPA (corrective and preventative action) diagnostics and it sounds like the author has worked that way. Look for it on resumes.


The NOC folks I've worked with have been incredibly sharp. No one with half a brain would talk down to them.


Agreed, but that's what I've noticed. Not necessarily degrading them, but snapping back or saying rude things like 'I just told you 5 minutes ago..'

If someone tried that with the guy who ran our calls, they'd get a very public dressing down if not worse.


CAN I GET AN UPDATE? !?! !! every 60 seconds is the only way


If you have an operation so large that 100 people can be involved in an incident, why isn't there a way to shift to a backup system?


The manager and incident commander should be on their own call, with at most a liaison that checks in with the people actually doing the work every 30 minutes. They should be secure enough in their own people that they can effectively communicate "we are aware of the problem and are working to fix it" to affected parties.

The people doing the work should be left the fuck alone.

A manager should not be involved in troubleshooting, in coordinating multiple nontechnical third parties on the same task, because 100% of the time spent doing anything other than fixing the underlying problem is wasted time. The people doing the work should be comfortable coordinating amongst eachother as needed - having a two or three way conversation or video call, or conference call. The affected parties don't need 30 second blow by blow accounts of the things the troubleshooters are doing. They don't need to constantly stop and interrogate the troubleshooters and recap each step of troubleshooting.

Bring the troubleshooters in after the repair to explain the steps taken, the problems found, what could have gone better, what went well, and any recommendations for prevention, mitigation, or resources needed.

The notion that you're supposed to do highly complex real-time technical repairs while juggling personalities and ass-kissing is counterproductive at best, completely moronic at worst.

"I understand your concerns. I just wanted to let you know I have faith in my team and I know for a fact they're doing the best they can to get you back up and running as fast as humanly possible. We'll hear back from them soon, but I don't want to do anything at all to get in their way, or to take time away from this repair." This is what a good manager might say, being adept in handling customer concerns and having confidence and trust in their team.

Coddling and handholding superfluous non-technical stakeholders by hosting incident calls like this is goddamn stupid.

The notion that you need to get everyone together in a giant group - that you need to pressure the people doing the work by introducing personalities and social issues into the process - is an a move by a manager deliberately intended to show that the manager is doing something. They coordinate these so they can claim credit for the work of the troubleshooters, and place blame on the troubleshooters if anything goes wrong by mischaracterizing the inevitable miscommunications during these boneheaded calls.

If it costs you $10,000 a minute for every minute you're down, then let's do the things that make sense. Giant ass conference calls with a whole bunch of people who aren't involved in fixing the technical problem is stupid. Blitheringly, moronically, stupid. The kind of stupid that picks up a brick and wonders what it would feel like to smash one's own stupid face with the stupid brick.

If you, as a manager, can't cope with this, you shouldn't be managing people. Quit, immediately. Your team will be far better off without your presence if you think this type of incident response is good for anything except politics and shitty games.

If you're a customer and you're treated to one of these giant group calls, know that it's a sign of incompetence, insecurity, toxic office politics, bad corporate culture, top heavy management, and probably high turnover rates.

Fire companies that treat their employees like this, or rewards management for playing stupid games. Find companies with competence and assurance in their products or services, and don't feel the need to trot out their troubleshooters in the middle of a crisis to do talk therapy, customer service, tiktok dances, or anything else other than effectively troubleshooting whatever the technical problem is.

If you're a troubleshooter and you find yourself on these calls frequently, my heart goes out to you. Better jobs exist, you deserve one, and I hope you make it there without too much suffering.


Interesting article. I don't think I agree with some of the points or maybe I just don't follow them exactly.

For example:

> Oscar announces, “I’m seeing some log entries from the web server that look a little weird. I’m gonna look at those.” This is the beginning of a 5-minute silence.

> During the silence, Deanna, Deepak, and Sylvain are all waiting, hoping that these log entries that Oscar just noticed turn out to be the smoking gun. They’re putting their eggs in the basket of Oscar’s intuition. Hopefully he’s seen this issue before, and any minute now he’ll say “Okay, I’m pushing a fix.”

> An incident commander is responsible for keeping the whole problem-solving effort moving forward. So it’s incumbent on you to interrupt this silence.

> Try drawing more information out of Oscar:

> - “Oscar, do you mind sharing your screen so Deepak and Deanna can see the weird log messages too?”

> - “What’s the error message, Oscar? Can you send a link to a log search?”

> - "Do we know when these log events started? Does that line up with when we started receiving these support tickets, Sylvain?”

This is totally a problem that happens during incidents. The problem of the group selecting on the first "I think I see something weird, let me check" idea is a great point made by the author. But having that person share their screen/talk through their thoughts doesn't really solve that problem, it just focuses the group on that idea (leaving any other ideas to be dropped). _Perhaps_ if other investigators are also familiar with the area being investigated, it's helpful to have multiple people looking at Oscar's screen, but that doesn't seem to scale past having ~3 people on the call. It also immediately makes the call be only dedicated to investigating the problem. That's not bad, but if you're in a scenario where support is being involved, you're likely going to be coordinating broader updates, messaging to customers, figuring out who else to pull in, etc. The point of the incident commander (imo) is to do those things, or ensure that all of those things are happening.

> “Let’s see here…”

> In order to keep a problem-solving effort moving forward, an incident commander should ensure that every new participant gets up-to-date knowledge of what the group is doing and why. For example, you could say to Deepak when he joins the call, “Hi Deepak. Right now, Oscar and Deanna are investigating a web server error message that might be related to failed stylesheet loads. You can see the error message in the chat.”

I think this should be done over Slack, and with like any incident response meeting with more than... 3 people. One thing my org does that I'm happy with is creating a thread for an initial issue (and a Slack channel once it's identified as a bigger issue) and a quick 2 sentence summary. People post comments as they discover new things, which provides a timeline of investigation and does a good job of showing what's been checked (and what hasn't). Honestly, unless the person giving the verbal summary is technically familiar with the issue at hand, they frequently will glaze over important things or highlight irrelevant things when trying to give a summary of what's happened so far. Not their fault, it's objectively hard to figure out what's relevant/irrelevant in the spur of the moment.

That said, I'm probably a bit biased because I don't like being on incident response calls in general. When I'm actively investigating an issue, being in a large incident response room makes things much harder for me to think. It feels like there's more pressure when people are waiting on the call for you to solve the problem, or if they're talking about other things it's just a distraction. My org has a culture of people replying to their own comments in Slack as they investigate, which makes the brainstorming over Slack feel a lot more intuitive, and it's easier to share error logs & snippets, or have multiple parallel conversations at once. And once the incident is over, it's a lot easier to have a precise incident timeline when you can use timestamps of comments.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: