Are We Ready to Kill Thresholds?

Pewpewarrows · on June 26, 2013

Forgive me if this is a dumb comment to make, as I'm just barely starting to get into monitoring and the statistics knowledge that goes along with it, but adaptive fault detection does tend to scare me a bit. In the event that a problem isn't a spike, and instead gradually builds up over hours/days/weeks, I wouldn't be confident in something picking a dynamic threshold for me. I'd be afraid of it deeming the ever-rising resource usage as normal behavior, if it happens slow enough, and me not being alerted before it's too late (servers becoming unresponsive).

obfuscurity_ · on June 26, 2013

That's not at all a dumb comment. As I alluded to in the post, I think it's important that we understand how these systems determine what is - or isn't - an abnormality or fault. Unfortunately, that often means revealing their "secret sauce" and risk exposing their product differentiation. It's going to be interesting to see how these products earn our trust.

jonlives · on June 26, 2013

Absolutely - this is one of the reasons that we made Kale open sourced so that people can see what we consider an anomaly, and adapt for their own use cases if needed. If your anomaly detection contains secret sauce, it'll be very hard for people to have confidence in it.