A Benchmark for Streaming Anomaly Detection

acobster · on April 14, 2017

I highly recommend Jeff Hawkins' (the founder of Numenta) book On Intelligence to anyone interested in anomaly detection and AI in general. Although HTMs haven't been as successful as neural networks, his concept of what intelligence is is very lucid and well-reasoned. Even if he turns out to be wrong about the technology side of things, I'm happy to have learned a thing or two about the neocortex.

graycat · on April 15, 2017

Okay, I used to work in anomaly detection, right, for more important server farms and networks, in particular for real-time, zero day detection, that is, detecting problems, anomalies, never seen before.

The OP has 6 "ideal characteristics of a real-time anomaly detection algorithm". My work seems to do relatively well on the criteria.

A challenging criterion is the 6th one:

"The algorithm should minimize false positives and false negatives."

So, no false positives and no false negatives would be a perfect detector, and in practice they are rarely possible.

A more realistic formulation of the criterion is that for a given rate of false positives (false alarms) achieve the lowest rate possible for false negatives (missed detections of actual anomalies).

How to do this is in the now classic Neyman-Pearson result. Right, in some cases the computations are a knapsack problem where that problem is, IIRC, in NP-complete.

Really, another important criterion is to have the false alarm rate adjustable, that is, can set the false alarm rate in advance and get that rate exactly in practice.

Yup, that's possible.

Why? Really, anomaly detection is nearly necessarily some continuously (continually) applied statistical hypothesis tests where the null hypothesis is that the system being monitored, that is, the source of the data, is healthy and the data normal and not an anomaly or a symptom of a problem.

So, we are into statistical hypothesis testing, and that is the context of the classic Neyman-Pearson result.

The literature on statistical hypothesis testing goes back to at least the Neyman-Pearson result before 1950; the field is old with a lot known.

In that field, it is usual to be able to set the rate of false alarms (type I error) in advance and get that rate exactly in practice.

For monitoring server farms and networks, there are two additional criteria that are just crucial -- right, in particular for getting a low rate of missed detections (false negatives, type II errors) for a given rate of false positives:

(1) Distribution-free.

The easy way to be able to do the calculations of an hypothesis test is to assume that know the probability distribution of the input data when the null hypothesis holds. There, sure, the most common distribution assumed is the Gaussian.

But there are also statistical hypothesis tests that are distribution-free (also non-parametric, e.g., not using the parameters of the Gaussian distribution) that is, that make no assumptions about the probability distribution of the data when the null hypothesis holds.

For monitoring server farms and networks, too often should be using distribution-free tests.

(2) Multi-dimensional

The data readily available from monitoring server farms and networks is horrendously large, like flowing oceans of data.

So, it's easy to get data on each of dozens of variables at rates from one point each few seconds up to hundreds of points a second.

Treating such jointly multi-dimensional data one variable at a time must typically throw away a huge fraction of the relevant information. So, the data should be handled jointly. So, need statistical hypothesis tests that are multi-dimensional, and those are not so easy to find in the literature.

So, the work I did for detecting zero-day problems in server farms and networks was to create a large collection of statistical hypothesis tests that are both distribution-free and multi-dimensional.

E.g., if the data consists of a time series of points on a checkerboard and normal data is on the red squares, then my work will detect, with known and adjustable false alarm rate achieved exactly in practice, points on black squares. In this case, treating the data on each of two perpendicular edges (axes) of the checkerboard separately yields exactly nothing. So, this example illustrates the extra power of being multi-dimensional. Also the distribution of the data when the null hypothesis holds is just the red squares, that is, far from anything common in statistics.

Moreover, can argue that there are some good qualitative reasons to believe that, really, a good detector for zero-day problems in server farms and networks should be able to know when it is in a fractal such as the Mandelbrot set and when not. Yes my work does that.

But, I have to say, I did my work -- at IBM's Watson lab to improve on our approaches via AI -- and published it decades ago, and since then I've seen essentially no interest at all in the work.

For more, for years I jerked the chains of a huge fraction of all the venture capital firms in the US with essentially no interest.

My conclusion is good work in anomaly detection and a dime won't cover a ten cent cup of coffee.

But, if by now anyone really is interested, then the remarks here should help.

srean · on April 15, 2017

Assumptions of stationarity does limit things a bit, but it might be possible to extend the argument to weak or limited dependence. Once I was looking at this problem with some differential geometry lenses. Adapting to your setting, the nearest neighbors would be computed according to a data dependent metric not just Euclidean. The working hypothesis was that under normal circumstances the process is constrained to be in the vicinity of some lower dimensional manifold. This is reasonable because of the many "near invariants" among the dimensions.

Wonder why no one really bites this. I think the only way to monetize this would be to build one's own where monitoring is done better, and not sell better monitoring to others. Unfortunately that's a very capital intensive option.

Someday we should chat more on this.

graycat · on April 15, 2017

"Should chat more"? Likely yes. Except anomaly detection and a dime won't cover a 10 cent cup of coffee.

Ah, maybe no one cares about aspirin until 3 AM with a really bad headache.

srean · on April 15, 2017

Heh! I know what you are saying, but don't know why the state of affairs has been this way.

d215 · on April 15, 2017

I would be really interested to read the publication you did on this. We are trying to do anomaly detection from a large stream of traceroutes coming on from approx. 10,000 locations continuously over the internet.

whatwasmypwd · on April 15, 2017

I would also like to read this.

graycat · on April 15, 2017

As in my

https://news.ycombinator.com/user?id=graycat

here, somehow let me have an e-mail address.

whatwasmypwd · on April 23, 2017

I've included my email in the profile desc. Could you please also send this paper to me?

d215 · on April 17, 2017

I left my email address in the `about` field of my hn profile.

graycat · on April 17, 2017

Now you should have both the reference to the paper and a PDF of the paper.

Thanks for your interest. Hope the paper is useful.

My posts in this thread should help in reading the paper.

graycat · on April 15, 2017

Part I

Here at HN I'm trying to be anonymous but on anomaly detection make a contribution.

The basic idea of my research is for positive integer k, use k-nearest neighbors or, really, any of a wide variety of such measures, metrics, functions, windows however want to regard them.

So, right, are on the way to some approaches to multi-dimensional probability density estimation.

The main contribution of my research is the applied probability derivations to show how to calculate, adjust, set, etc. the false alarm rate. In practice, false alarm rate unknown, out'a control, often too high is a biggie problem -- e.g., can cause staff just to ignore too many alarms. E.g., see the movie Tora, Tora, Tora about the 12/7/1941 attack on Pearl Harbor where the US actually had just what the heck it needed for early detection, early enough to get fighter airplanes in the air, maybe early enough to get the ships out of the harbor, etc. -- radar on top of a hill. And the radar operators DID get the detection and DID report it, but the chain of command assumed that the detection was a false alarm. Big, huge, gigantic bummer. The US came within the thickness of an onion skin of getting the US airplanes and ships out of harm's way, shooting down much of the attacking airplanes, and having the US carriers, already at sea, attacking the attacking carriers when all their airplanes were off over Hawaii. The US could have sustained little damage and sunk all six of the attacking carriers.

Having a good anomaly detector and ignoring it can be a bummer.

But if know what the false alarm rate is and know that it's low, then maybe the chain of command will take a detection more seriously and, then, take the next step -- diagnosis.

Sometimes the spook business, say, Soviets in England, would watch the comings, goings, lights on late, etc. at British military HQs and try to detect anomalies that would indicate war. Sure, for such detection need both distribution-free and multi-dimensional with good control over false alarm rate and good evidence of a relatively high detection rate. So, sure, the US CIA should want my work. I wrote them -- never heard back.

Once I did give a talk on my work at the Institute for Defense Analysis (IDA), think tank resource for the Joint Chiefs. Some in the audience mentioned credit card fraud detection. Okay. But there should also be national security applications.

Once gave a talk on the work at the Johns Hopkins University Applied Physics Lab (JHU/APL) -- I might as well have talked about a recipe for pancakes.

Once I gave a talk on the work at the NASDAQ site in Trumbull, CT. I got a nice tour, a nice lunch, and a nice tote bag.

I gave some talks on this work to my management chain at IBM's Watson lab -- the suits were afraid I might have some good work they didn't understand, couldn't control, and couldn't throttle. They were PISSED.

I presented my work to some IBM people outside of Watson Research -- they liked the work, but Watson Research was PISSED.

For the applied probability stuff in my research, for a detection of an anomaly can also report the lowest false alarm rate at which the data would still be a detection. Then right along can graph these lowest rates and use them as a heuristic about detection seriousness. Net, getting a handle, knob to turn, knowledge of, etc. false alarm rate is important.

The applied probability stuff is based on some cute symmetry in the calculations. The symmetry, then, gives some measure preserving transformations (transformations that don't change the probability distribution). So, end up with a finite algebraic group of such transformations. Then sum over the group and, presto, bingo, get what need to know false alarm rate.

To adjust the false alarm rate, adjust the diameter, cut-off, whatever, of the k nearest neighbors or other metric, convolution window, etc. Of course all of this is conditional probabilities conditioned on the historical data used where assume that the system being monitored when the historical data was collected was healthy. How to confirm healthy? Let the data age and see if there are any other, later symptoms of problems from other means, e.g., angry customer complaints, malware, attacks and intrusions, etc.

So, have a case of, essentially, being brief here, the conditional expectation E[Y|X] where Y is the random variable of detections and X is the historical data. Then of course E[E[Y|X]] = E[Y] which says that have set false alarm rate "exactly" in the long-run or, if you will, conditionally exactly in the short run.

A result is that the naive way to evaluate false alarm rate from the historical data does, for lots of data, converge to the right answer. Proving that, however, was a bit tricky.

Achieving the Neyman-Pearson best result needs lots of data from both healthy and sick systems, and in practice the second is in short supply.

The Neyman-Pearson result is intuitive, like investing in real estate: Buy plots of real estate that give the best return per dollar invested until have invested all the available money.

Well, suppose the data observed is, for the set of real numbers R and some positive integer n, in R^n. Then partition R^n into little patches. Then in each little patch, assume that data in that patch would raise an alarm, and, then, from the data (need lots of data, especially as n grows) get the ratio of the detection rate over the false alarm rate. Then false alarm rate is really like the money you have to invest, and detection rate is your return on that investment. So, sure, you buy the patches with the highest ratio until you have spent, invested, all your false alarm rate, money.

Can find proofs in terms of freshman calculus. I wrote out a quite general proof in terms of the Hahn decomposition from the Radon-Nikodym theorem in measure theory.

So, right, if have only finitely many patches in R^n, then have essentially a knapsack problem, IIRC, in NP-complete (although likely not a concern in practice).

Can write out a little derivation, mostly just Fubini's theorem (the measure theory version of the calculus interchange order of integration) to show in a cute sense of the distribution of sick systems being translation invariant (right, a bit much to assume -- but here I'm typing quickly from memory of what I did 20+ years ago) then my techniques give, for the assumed false alarm rate, the highest detection rate possible. So, this is a simple, quick and dirty, data-free, crude approach to the full Neyman-Pearson result of best detection.

graycat · on April 15, 2017

Part II

There is another cute point: The usual way to use the historical data has its order ignored. So, should be able to get some cute theorems about this that permit assuming much less than that the input data is independent and identically distributed (i.i.d.). That is, get to assume that the historical data has been permuted randomly, and, I'm guessing that, from this could get a cute approximation result. One start would be the work of Michel Talagrand on "a new look at independence". Why have I not pursued this research? Did I mention that I have 20+ years of evidence that anomaly detection and a dime won't cover a 10 cent cup of coffee?

Then, with the math clear, the data collected, etc., need to write some code. I put my feet up and thought of a way. Soon I learned that part of my work reinvented k-D trees, that is, for positive integer k, search trees in k-dimensions. IIRC, k-D trees are in

Robert Sedgewick and Kevin Wayne, Algorithms, FOURTH EDITION, Addison-Wesley, New York, 2011.

So, k-D trees are a lot like a k dimensional generalization of the usual one dimensional binary search.

But, there, also need some cutting planes and a little backtracking in the tree. For this, for the computer hardware in serious production, the new several TB solid state disk (SSD) drives would be just terrific: Load up one of those with a k-D tree of historical data, and then in production deployment many times a second query that data. Since are using the data -- write, say, once a week or month and read hundreds or thousands of times a second -- the SSD would give fantastically fast data rates and not wear out.

Of course, more could be done:

(1) How much historical data is really needed?

(2) Of the data on several variables, which are needed? Or, maybe we should prove some theorems that show when too many variables without enough historical data hurt the results?

(3) Should we think about scaling the data on some of the variables? If so, then how to know what variables and how much scaling? Could want some useful theorems here.

(4) For a system level view, could we be hierarchical, that is, say that this one server is sick, with another detector drill-down and say that this one virtual machine on that server is sick, drill down and say that this one applications program is sick, or some such? In all cases? No. In some cases, maybe?

(5) After detection with a low false alarm rate, the next step is diagnosis. But detectors that in some useful sense localize the source of the sickness should be able to help with diagnosis and, indeed, the third step, correction. So, a question is, at what scales and where to deploy detectors? Could use some useful theorems, analysis, etc. here.

(6) Networks and server farms are changing constantly, but need some good historical data that still describes a healthy system. Okay, could use some work to show when the changes have been enough to need new historical data.

(7) Maybe of high interest to the suits in the C-suite, do some on decision theory, that is, essentially cost minimization. So, have costs for false alarms and costs for missed detections and try to set the false alarm rate, and/or the number of detectors, etc., to minimize the sum of all these costs.

Since a missed detection, i.e., a problem detected too late instead of ASAP, can be headlines for IIRC Sony, Target, the NYSE, Yahoo, etc., the suits might be eager for relatively a lot of such monitoring.

(8) My work was for problems never seen before. The 50,000 foot view of that is that, once have seen a problem, detected it, diagnosed it, and corrected it, then, with the corrections, really shouldn't see the problem again. So, really what we should be looking for are problems never seen before and should slap our own wrists for any problem we keep seeing over and over.

Given some e-mail addresses, I'll send a reference to my published paper. But I'm trying to be anonymous here at HN.

Again, I'm no longer motivated by money to pursue anomaly detection: For 20+ years I got overwhelming evidence that work in anomaly detection and a dime won't cover a 10 cent cup of coffee. I tried to make a startup out of this work, but gotta tell you, after hundreds of e-mail messages to venture capital firms, about 98% said nothing back, one gave a weak reply with no indication of any real understanding of the market, technical challenges, my work, what such a company would look like, etc., and the remaining 2%, or maybe 1%, gave only some boiler plate about "not in our focus area".

So, I have another startup in progress, easier to do, easier to sell, much more promising and close to going live and making money.

So, here I'm willing to do an anomaly detection technology give away! Get your free anomaly detection tutorial and technology!

Uh, when I did the research at IBM's Watson lab, the Watson lab patent office was getting excited by my work -- "there are several projects in the lab attacking anomaly detection, and your work is the only one taking your approach.".

I had my research in a nicely written paper, complete with theorems and proofs and some good tests on some real data (from a computer cluster at Allstate -- the cluster was part of the motivation to be multi-dimensional, e.g., to detect some of the failures due to chain reactions in clustering) and also some hypothetical data that would make a severe test (the checkerboard example is a simplification), and a guy at IBM claimed to have my work "reviewed" and pronounced the work as "not publishable," and I got fired, walked out the door. Not good. He didn't give me a chance to submit my work for publication.

But, out of IBM, with a copy of my paper and a letter from IBM permitting me to publish, I submitted the paper, and it was accepted without revision (except something about indenting the first paragraph after a subheading!) by the first journal where I submitted, Information Sciences, a good enough journal.

I've been burned enough pursuing anomaly detection. I don't like being burned; it's no fun.

My conclusion is that any very good work in the field is just way too darned difficult for essentially everyone else that would be involved -- management chain, journal reviewers, computer science departments, venture capital partners, suits at target customers, etc.

Right, as suggested elsewhere in this thread, try to sell a solution or a service, maybe cloud based. But setting that up and getting it to good traction would take some cash, more than I'm going to put forward.