Hacker News new | past | comments | ask | show | jobs | submit login
Leveraging AI for efficient incident response (fb.com)
110 points by Amaresh 4 months ago | hide | past | favorite | 55 comments



We've shifted our oncall incident response over to mostly AI at this point. And it works quite well.

One of the main reasons why this works well is because we feed the models our incident playbooks and response knowledge bases.

These playbooks are very carefully written and maintained by people. The current generation of models are pretty much post-human in following them, performing reasoning and suggesting mitigations.

We tried indexing just a bunch of incident slack channels and result was not great. But with explicit documentation, it works well.

Kind of proves what we already know, garbage in, garbage out. But also, other functions, eg: PM, Design have tried automating their own workflows, but doesn't work as well.


I'm really curious to hear more about what kind of thing is covered in your playbooks. I've often heard and read about the value of playbooks, but I've yet to see it bear fruit in practice. My main work these past few years has been in platform engineering, and so I've also been involved in quite a few incidents over that time, and the only standardized action I can think of that has been relevant over that time is comparing SLIs between application versions and rolling back to a previous version if the newer version is failing. Beyond that, it's always been some new failure mode where the resolution wouldn't have been documented because it's never happened before.

On the investigation side of things I can definitely see how an AI driven troubleshooting process could be valuable. Lots of developers are lacking debugging skills, so an AI driven process that looks at the relevant metrics and logs and can reason around what the next line of inquiry should be could definitely speed things up.


Playbooks that I've found value in: - Generic application version SLI comparison. The automated version of this is automated rollbacks (Harness supports this out of the box, but you can certainly find other competitors or build your own) - Database performance debugging - Disaster recovery (bad db delete/update, hardware failure, region failure)

In general, playbooks are useful for either common occurences that happen frequently (ie every week we need to run a script to fix something in the app) or things that happen rarely but when they do happen need a plan (ie disaster recovery)


Expert systems redux? Being able to provide the expertise in form of plain written English (or another language), will at least make it much more feasible to build them up. And it can also meaningfully be consumed by a human.

If it works well for incident response, then there are many usecases that are similar - basically most kinds of diagnostics/troubleshooting of systems. At least the relatively bounded ones, where it is feasible to on have documentation on the particular system. Say debugging of a building HVAC system.


Why won't it hit the same limits of frame problem or qualification problem?

Expert systems failed in part because of the inability to learn, while HVAC is ladder logic, that I honestly haven't spent much time in, LLMs are inductive.

It will be a useful tool, but expert systems had a very restricted solution space.


I have found it rare that an organization has incident "playbooks that are very carefully written and maintained"

If you already have those, how much can an AI add? Or conversely, not surprising that it does well when it's given a pre-digested feed of all the answers in advance.


Meanwhile, we’ve tried AI products just for assigning incidents and are forced to turn them off because of how shitty of a job they do.


That's great to hear. What is your current tool chain in the effort? Do you have a structure for Playbooks and KBs you would recommend


Curious if you explored any external tools before building in-house? Looking to do something similar at my company


What does AI add to your playooks ?


I'm guessing the being awake and fresh at 3am within a few seconds of the incident occuring part.


I can execute a playbook without AI at 3am in a few seconds using some orchestration tools. Without any AI.


Are you happy about waking up to do so?


If you get compensation for being on-call then why not? Unless it’s on Holiday eve


Been automatically executing playbooks (Ansible) since before you were born. I sleep fine.

This is standard SRe/ Ops practice. Monitoring system detects failures and automatically runs remediation.

You didn’t read the part where I said “using orchestration to tools”.


> Been automatically executing playbooks (Ansible) since before you were born.

This made me look up how old Ansible was.

> Initial release: February 20, 2012; 12 years ago

https://en.m.wikipedia.org/wiki/Ansible_(software)


Im really interested in the implied restriction/focus on “code changes.”

IME a very very large number of impacting incidents arent strictly tied to “a” code change, if any at all. It _feels_ like theres an implied solution to tying running version back to deployment rev, to deployment artifacts, and vcs.

Boundary conditions and state changes in the distributed system were the biggest bug bear I ran in to at AWS. Then below that were all of the “infra” style failures like network faults, latency, API quota exhaustion, etc. And for all the cloudformation/cdk/terraform in the world its non trivial to really discover those effects and tie them to a “code change.” Totally ignoring older tools that may be managed via CLI or the ol’ point and click.


From my experience, the vast majority of reliability issues at Meta come from 3 areas:

- Code changes

- Configuration changes (this includes the equivalent of server topology changes like cloudformation, quota changes)

- Experimentation rollout changes

There has been issues that are external (like user behavior change for new year / world cup final, physical connection between datacenters being severed…) but they tend to be a lot less frequent.

All the 3 big buckets are tied to a single trackable change with an id so this leads to the ability to do those kind of automated root cause analysis at scale.

Now, Meta is mostly a closed loop where all the infra and product is controlled as one entity so those results may not be applicable outside.


Interesting. It sounds like “all” service state management (admin config, infra, topology) is discoverable/legible for meta. I think that contrasts with AWS where there is a strong DevTools org, but many services and integrations are more of an API centric service-to-service model with distributed state which is much harder to observe. Every cloud provider I know of also has a (externally opaque) division between “native” cloud-service-built-on-cloud-infra and (typically older) “foundational” services that are much closer to “bare metal” with their own bespoke provisioning and management. Ex EC2 has great visibility inside of their placement and launch flows, but itll never look like/interop with cfn & cloudtrail that ~280 other “native” services use.

Definitely agree that the bulk Of “impact” is back to changes introduced in the SDLC. Even for major incidents infrastructure is probably down to 10-20% of causes in a good org. My view in GP is probably skewed towards major incidents impairing multiple services/regions as well. While I worked on a handful of services it was mostly edge/infra side, and I focused the last few years specifically on major incident management.

Id still be curious about internal system state and faults due to issues like deadlocked workflows, incoherent state machines, and invalid state values. But maybe its simply not that prevalent.


> this leads to the ability to do those kind of automated root cause analysis at scale.

I'm curious how well that works in the situation where your config change or experiment rollout results in a time bomb (e.g. triggered by task restart after software rollout), speaking as someone who just came off an oncall shift where that was one of our more notable outages.

Google also has a ledger of production events which _most_ common infra will write to, but there are so many distinct systems that I would be worried about identifying spurious correlations with completely unrelated products.

> There has been issues that are external (like ... physical connection between datacenters being severed…) but they tend to be a lot less frequent.

That's interesting to hear, because my experience at Google is that we'll see a peering metro being fully isolated from our network at least once a year; smaller fiber cuts that temporarily leave us with a SPOF or with a capacity shortfall happen much much more frequently.

(For a concrete example: a couple months ago, Hurricane Beryl temporarily took a bunch of peering infrastructure in Texas offline.)


> IME a very very large number of impacting incidents arent strictly tied to “a” code change, if any at all

Usually this implies there are bigger problems. If something keeps breaking without any change (config / code) then it was likely always broken and just ignored.

So when companies do have most of the low hanging fruit resolved it's the changes that break things.

I've seen places where everything is duck taped together but BUT it still only breaks on code changes. Everyone learns to avoid stressing anything fragile.


See other child reply upthread, lots of service-to-service style interactions that look more like distributed state than a CR. And my view was across an org scope where even “infrequent” quickly accumulated. AWS is on the order of 50,000 SDEs, running 300 public services (plus a multiple more internal), and each team/microservice with 50 independent deployment targets.


At my place 90% of them are 3rd parties going down, and you can't do much other than leave. But the new 3rd parties are just as bad. All you can do gracefully handle failure.


Interestingly, with the move to IaC, diagnosing at the level of code change makes increasing sense. It's impressive to see their results given that perspective. Not obvious!

Seperately, we have been curious about extending louie.ai to work not just with logs/DBs, but go in the reverse direction ('shift right'): talk directly to a live OSAgent like an EDR or OSQuery, whether on a live system or a cloud image copy. If of interest to any teams, would love to chat.


> The biggest lever to achieving 42% accuracy was fine-tuning a Llama 2 (7B) model

42% accuracy on a tiny, outdated model - surely it would improve significantly by fine-tuning Llama 3.1 405B!


Yes very interesting potential, it looks like it can be increased in accuracy considerably because Llama 3.1 with 405B parameters has very similar performance with the latest GPT-4o.


We've open sourced something with similar goals that you can use today: https://github.com/robusta-dev/holmesgpt/

We're taking a slightly different angle than what Facebook published, in that we're primarily using tool calling and observability data to run investigations.

What we've released really shines at surfacing up relevant observability data automatically, and we're soon planning to add the change-tracking elements mentioned in the Facebook post.

If anyone is curious, I did a webinar with PagerDuty on this recently.



Can we see the recording of this webinar somewhere?



The paper goes out of its way not to compare the 42% figure with anything. Is "42% within the top 5 suggestions" good or bad?

How would an experienced engineer score on the same task?


Interesting. Just a few weeks back, I was reading about their previous work https://atscaleconference.com/the-evolution-of-aiops-at-meta... -- didn't realise there's more work!

Also, some more researches in the similar space by other enterprises:

Microsoft: https://yinfangchen.github.io/assets/pdf/rcacopilot_paper.pd...

Salesforce: https://blog.salesforceairesearch.com/pyrca/

Personal plug: I'm building a self-service AIOps platform for engineering teams (somewhat similar to this work by Meta). If you're looking to read more about it, visit -- https://docs.drdroid.io/docs/doctor-droid-aiops-platform


I would love if they leveraged AI to detect AI on the regular Facebook feed. I visit occasionally and it’s just a wasteland of unbelievable AI content with tens of thousands of bot (I assume…) likes. Makes me sick to my stomach and I can’t even browse.


I do think AI will automate a lot of the grunt work involved with incidents and make the life of on-call engineers better.

We are currently working on this at: https://github.com/opslane/opslane

We are starting by tackling adding enrichment to your alerts.


Way back in the day on FB Ads we trained a GBDT on a bunch of features extracted from the diff that had been (post-hoc) identified as the cause of a SEV.

Unlike a modern LLM (or most any non-trivial NN), a GBDT’s feature importance is defensively rigorous.

After floating the results to a few folks up the chain we burned it and forget where.


PSA:

9 times out of 10, you can and should write "using" instead of "leveraging".


Given how AI can automate and scale bad decisions, isn’t leveraging the right word here?


nice to see meta investing in AI investigation tools! but 42% accuracy doesn't sound too impressive to me... maybe there's still some fine-tuning needed for better results? glad to hear about the progress though!


Really, a tool where 42% of incident responses the on call engineers are greeted by a pointer that likely lets them resolve the incident almost immediately and move on, rather than spending potentially hours figuring out which component it is they need to address and how, isn't impressive to you?


It depends on whether it's generating 58% of answers that lead on-call engineers down the wrong path. Honestly, it's more of a question -- I did not read the article deeply.


This is really cool. My optimistic take on GenAI, at least with regard to software engineering, is that it seems like we're gonna have a lot of the boring / tedious parts of our jobs get a lot easier!


Claude 3.5 Sonnet still can’t cut me a diff summary based on the patch that I’m generally willing to hand in as my own work and it’s by far the best API-mediated, investor-subsidized one.

Forget the diff, I don’t want my name on the natural language summary.


You mean it doesn't understand the change you've made based on the diff?


Even under the most generous nomenclature, no contemporary LLM understands anything.

They approximate argmax(P_sub_theta(token|prefix)).

This approximation is sometimes useful. I’ve found it to never be useful in writing code or prose about code of any difficulty. That’s my personal anecdote, but one will note that OpenAI and Anthropic still employ a great many software engineers.


I know that, likely everyone here knows that. But understanding is a good approximation for what we mean. Pointing out implementation is needlessly pedantic.


AI 1: This user is suspicious, lock account

User: Ahh, got locked out, contact support and wait

AI 2: The user is not suspicious, unlock account

User: Great, thank you

AI 1: This account is suspicious, lock account


Luckily I subscribe to my own consumer AI service to automate all this for me. To paraphrase The Simpsons: "AI: the cause of and solution to all life's problems."


I will be more interested to understand how they deal with injection attacks. Any alert where the attacker controls some parts of the text that ends up in the model could be used to either evade it worse use it to hack it. Slack had an issue like that recently.


This is exactly what we do at OneUptime.com. Show you AI generated possible Incident remediation based on your data + telemetry + code. All of this is 100% open-source.


I'm going to point out the obvious problem here: 42% RC identification is shit.

That means the first person on the call doing the triage has a 58% chance of being fed misinformation and bias which they have to distinguish from reality.

Of course you can't say anything about an ML model being bad that you are promoting for your business.


No. Youre missing the UX forest for the pedantry trees here. Ive worked on a team that did similar change detection with little to no ML magic. It matters how its presented as a hint (“top five suggested”) and not THE ANSWER. In addition its VERY common to do things like present confidence or weight to the user. And why theres a huge need for explainability.

And this is just part of the diagnosis process. The system should still be providing breadcrumbs or short cuts for the user to test the suggested hypothesis.

Which is why any responsible system like this will include feedback loops and evaluation of false positive/negative outcomes and tune for sensitivity & specificity over time.


No I'm not. It's crap.

I have about 30 years experience both on hard engineering (electronics) and software engineering particularly on failure analysis and reliability engineering. Most people are lazy and get led astray with false information. This is a very dangerous thing. You need a proper conceptualisation framework like a KT problem analysis to eliminate incorrect causes and keep people thinking rationally and get your MTTR down to something reasonable.


Sounds like you're projecting your own laziness and shortcomings on others. This is a tool that seems really helpful considering the alternative is 0%.


Personal insults aside, "seems" requires no evaluation if the success rate is outside what could be considered a sane confidence interval on trust.

I would literally be fired if I implemented this tool.


Calling things 'shit' and 'crap,' and then claiming that the authors actually feel the same but can't say it, is ridiculous and undermines any authority you think you have.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: