We've shifted our oncall incident response over to mostly AI at this point. And it works quite well.
One of the main reasons why this works well is because we feed the models our incident playbooks and response knowledge bases.
These playbooks are very carefully written and maintained by people. The current generation of models are pretty much post-human in following them, performing reasoning and suggesting mitigations.
We tried indexing just a bunch of incident slack channels and result was not great. But with explicit documentation, it works well.
Kind of proves what we already know, garbage in, garbage out. But also, other functions, eg: PM, Design have tried automating their own workflows, but doesn't work as well.
I'm really curious to hear more about what kind of thing is covered in your playbooks. I've often heard and read about the value of playbooks, but I've yet to see it bear fruit in practice. My main work these past few years has been in platform engineering, and so I've also been involved in quite a few incidents over that time, and the only standardized action I can think of that has been relevant over that time is comparing SLIs between application versions and rolling back to a previous version if the newer version is failing. Beyond that, it's always been some new failure mode where the resolution wouldn't have been documented because it's never happened before.
On the investigation side of things I can definitely see how an AI driven troubleshooting process could be valuable. Lots of developers are lacking debugging skills, so an AI driven process that looks at the relevant metrics and logs and can reason around what the next line of inquiry should be could definitely speed things up.
Playbooks that I've found value in:
- Generic application version SLI comparison. The automated version of this is automated rollbacks (Harness supports this out of the box, but you can certainly find other competitors or build your own)
- Database performance debugging
- Disaster recovery (bad db delete/update, hardware failure, region failure)
In general, playbooks are useful for either common occurences that happen frequently (ie every week we need to run a script to fix something in the app) or things that happen rarely but when they do happen need a plan (ie disaster recovery)
Expert systems redux?
Being able to provide the expertise in form of plain written English (or another language), will at least make it much more feasible to build them up. And it can also meaningfully be consumed by a human.
If it works well for incident response, then there are many usecases that are similar - basically most kinds of diagnostics/troubleshooting of systems. At least the relatively bounded ones, where it is feasible to on have documentation on the particular system. Say debugging of a building HVAC system.
Why won't it hit the same limits of frame problem or qualification problem?
Expert systems failed in part because of the inability to learn, while HVAC is ladder logic, that I honestly haven't spent much time in, LLMs are inductive.
It will be a useful tool, but expert systems had a very restricted solution space.
I have found it rare that an organization has incident "playbooks that are very carefully written and maintained"
If you already have those, how much can an AI add? Or conversely, not surprising that it does well when it's given a pre-digested feed of all the answers in advance.
Im really interested in the implied restriction/focus on “code changes.”
IME a very very large number of impacting incidents arent strictly tied to “a” code change, if any at all. It _feels_ like theres an implied solution to tying running version back to deployment rev, to deployment artifacts, and vcs.
Boundary conditions and state changes in the distributed system were the biggest bug bear I ran in to at AWS. Then below that were all of the “infra” style failures like network faults, latency, API quota exhaustion, etc. And for all the cloudformation/cdk/terraform in the world its non trivial to really discover those effects and tie them to a “code change.” Totally ignoring older tools that may be managed via CLI or the ol’ point and click.
From my experience, the vast majority of reliability issues at Meta come from 3 areas:
- Code changes
- Configuration changes (this includes the equivalent of server topology changes like cloudformation, quota changes)
- Experimentation rollout changes
There has been issues that are external (like user behavior change for new year / world cup final, physical connection between datacenters being severed…) but they tend to be a lot less frequent.
All the 3 big buckets are tied to a single trackable change with an id so this leads to the ability to do those kind of automated root cause analysis at scale.
Now, Meta is mostly a closed loop where all the infra and product is controlled as one entity so those results may not be applicable outside.
Interesting. It sounds like “all” service state management (admin config, infra, topology) is discoverable/legible for meta. I think that contrasts with AWS where there is a strong DevTools org, but many services and integrations are more of an API centric service-to-service model with distributed state which is much harder to observe. Every cloud provider I know of also has a (externally opaque) division between “native” cloud-service-built-on-cloud-infra and (typically older) “foundational” services that are much closer to “bare metal” with their own bespoke provisioning and management. Ex EC2 has great visibility inside of their placement and launch flows, but itll never look like/interop with cfn & cloudtrail that ~280 other “native” services use.
Definitely agree that the bulk Of “impact” is back to changes introduced in the SDLC. Even for major incidents infrastructure is probably down to 10-20% of causes in a good org. My view in GP is probably skewed towards major incidents impairing multiple services/regions as well. While I worked on a handful of services it was mostly edge/infra side, and I focused the last few years specifically on major incident management.
Id still be curious about internal system state and faults due to issues like deadlocked workflows, incoherent state machines, and invalid state values. But maybe its simply not that prevalent.
> this leads to the ability to do those kind of automated root cause analysis at scale.
I'm curious how well that works in the situation where your config change or experiment rollout results in a time bomb (e.g. triggered by task restart after software rollout), speaking as someone who just came off an oncall shift where that was one of our more notable outages.
Google also has a ledger of production events which _most_ common infra will write to, but there are so many distinct systems that I would be worried about identifying spurious correlations with completely unrelated products.
> There has been issues that are external (like ... physical connection between datacenters being severed…) but they tend to be a lot less frequent.
That's interesting to hear, because my experience at Google is that we'll see a peering metro being fully isolated from our network at least once a year; smaller fiber cuts that temporarily leave us with a SPOF or with a capacity shortfall happen much much more frequently.
(For a concrete example: a couple months ago, Hurricane Beryl temporarily took a bunch of peering infrastructure in Texas offline.)
> IME a very very large number of impacting incidents arent strictly tied to “a” code change, if any at all
Usually this implies there are bigger problems. If something keeps breaking without any change (config / code) then it was likely always broken and just ignored.
So when companies do have most of the low hanging fruit resolved it's the changes that break things.
I've seen places where everything is duck taped together but BUT it still only breaks on code changes. Everyone learns to avoid stressing anything fragile.
See other child reply upthread, lots of service-to-service style interactions that look more like distributed state than a CR. And my view was across an org scope where even “infrequent” quickly accumulated. AWS is on the order of 50,000 SDEs, running 300 public services (plus a multiple more internal), and each team/microservice with 50 independent deployment targets.
At my place 90% of them are 3rd parties going down, and you can't do much other than leave. But the new 3rd parties are just as bad. All you can do gracefully handle failure.
Interestingly, with the move to IaC, diagnosing at the level of code change makes increasing sense. It's impressive to see their results given that perspective. Not obvious!
Seperately, we have been curious about extending louie.ai to work not just with logs/DBs, but go in the reverse direction ('shift right'): talk directly to a live OSAgent like an EDR or OSQuery, whether on a live system or a cloud image copy. If of interest to any teams, would love to chat.
Yes very interesting potential, it looks like it can be increased in accuracy considerably because Llama 3.1 with 405B parameters has very similar performance with the latest GPT-4o.
We're taking a slightly different angle than what Facebook published, in that we're primarily using tool calling and observability data to run investigations.
What we've released really shines at surfacing up relevant observability data automatically, and we're soon planning to add the change-tracking elements mentioned in the Facebook post.
If anyone is curious, I did a webinar with PagerDuty on this recently.
Personal plug: I'm building a self-service AIOps platform for engineering teams (somewhat similar to this work by Meta). If you're looking to read more about it, visit -- https://docs.drdroid.io/docs/doctor-droid-aiops-platform
I would love if they leveraged AI to detect AI on the regular Facebook feed. I visit occasionally and it’s just a wasteland of unbelievable AI content with tens of thousands of bot (I assume…) likes. Makes me sick to my stomach and I can’t even browse.
Way back in the day on FB Ads we trained a GBDT on a bunch of features extracted from the diff that had been (post-hoc) identified as the cause of a SEV.
Unlike a modern LLM (or most any non-trivial NN), a GBDT’s feature importance is defensively rigorous.
After floating the results to a few folks up the chain we burned it and forget where.
nice to see meta investing in AI investigation tools! but 42% accuracy doesn't sound too impressive to me... maybe there's still some fine-tuning needed for better results? glad to hear about the progress though!
Really, a tool where 42% of incident responses the on call engineers are greeted by a pointer that likely lets them resolve the incident almost immediately and move on, rather than spending potentially hours figuring out which component it is they need to address and how, isn't impressive to you?
It depends on whether it's generating 58% of answers that lead on-call engineers down the wrong path. Honestly, it's more of a question -- I did not read the article deeply.
This is really cool. My optimistic take on GenAI, at least with regard to software engineering, is that it seems like we're gonna have a lot of the boring / tedious parts of our jobs get a lot easier!
Claude 3.5 Sonnet still can’t cut me a diff summary based on the patch that I’m generally willing to hand in as my own work and it’s by far the best API-mediated, investor-subsidized one.
Forget the diff, I don’t want my name on the natural language summary.
Even under the most generous nomenclature, no contemporary LLM understands anything.
They approximate argmax(P_sub_theta(token|prefix)).
This approximation is sometimes useful. I’ve found it to never be useful in writing code or prose about code of any difficulty. That’s my personal anecdote, but one will note that OpenAI and Anthropic still employ a great many software engineers.
I know that, likely everyone here knows that. But understanding is a good approximation for what we mean. Pointing out implementation is needlessly pedantic.
Luckily I subscribe to my own consumer AI service to automate all this for me. To paraphrase The Simpsons: "AI: the cause of and solution to all life's problems."
I will be more interested to understand how they deal with injection attacks. Any alert where the attacker controls some parts of the text that ends up in the model could be used to either evade it worse use it to hack it. Slack had an issue like that recently.
This is exactly what we do at OneUptime.com. Show you AI generated possible Incident remediation based on your data + telemetry + code. All of this is 100% open-source.
I'm going to point out the obvious problem here: 42% RC identification is shit.
That means the first person on the call doing the triage has a 58% chance of being fed misinformation and bias which they have to distinguish from reality.
Of course you can't say anything about an ML model being bad that you are promoting for your business.
No. Youre missing the UX forest for the pedantry trees here. Ive worked on a team that did similar change detection with little to no ML magic. It matters how its presented as a hint (“top five suggested”) and not THE ANSWER. In addition its VERY common to do things like present confidence or weight to the user. And why theres a huge need for explainability.
And this is just part of the diagnosis process. The system should still be providing breadcrumbs or short cuts for the user to test the suggested hypothesis.
Which is why any responsible system like this will include feedback loops and evaluation of false positive/negative outcomes and tune for sensitivity & specificity over time.
I have about 30 years experience both on hard engineering (electronics) and software engineering particularly on failure analysis and reliability engineering. Most people are lazy and get led astray with false information. This is a very dangerous thing. You need a proper conceptualisation framework like a KT problem analysis to eliminate incorrect causes and keep people thinking rationally and get your MTTR down to something reasonable.
Sounds like you're projecting your own laziness and shortcomings on others. This is a tool that seems really helpful considering the alternative is 0%.
Calling things 'shit' and 'crap,' and then claiming that the authors actually feel the same but can't say it, is ridiculous and undermines any authority you think you have.
One of the main reasons why this works well is because we feed the models our incident playbooks and response knowledge bases.
These playbooks are very carefully written and maintained by people. The current generation of models are pretty much post-human in following them, performing reasoning and suggesting mitigations.
We tried indexing just a bunch of incident slack channels and result was not great. But with explicit documentation, it works well.
Kind of proves what we already know, garbage in, garbage out. But also, other functions, eg: PM, Design have tried automating their own workflows, but doesn't work as well.