I highly recommend Jeff Hawkins' (the founder of Numenta) book On Intelligence to anyone interested in anomaly detection and AI in general. Although HTMs haven't been as successful as neural networks, his concept of what intelligence is is very lucid and well-reasoned. Even if he turns out to be wrong about the technology side of things, I'm happy to have learned a thing or two about the neocortex.
Okay, I used to work in anomaly
detection, right, for more important
server farms and networks, in particular
for real-time, zero day detection, that
is, detecting problems, anomalies, never
seen before.
The OP has 6 "ideal characteristics of a
real-time anomaly detection algorithm".
My work seems to do relatively well on the
criteria.
A challenging criterion is the 6th one:
"The algorithm should minimize false
positives and false negatives."
So, no false positives and no false
negatives would be a perfect detector,
and in practice they are rarely possible.
A more realistic formulation of the
criterion is that for a given rate of
false positives (false alarms) achieve the
lowest rate possible for false negatives
(missed detections of actual anomalies).
How to do this is in the now classic
Neyman-Pearson result. Right, in some
cases the computations are a knapsack
problem where that problem is, IIRC, in
NP-complete.
Really, another important criterion is to
have the false alarm rate adjustable, that
is, can set the false alarm rate in
advance and get that rate exactly in
practice.
Yup, that's possible.
Why? Really, anomaly detection is nearly
necessarily some continuously
(continually) applied statistical
hypothesis tests where the null
hypothesis is that the system being
monitored, that is, the source of the
data, is healthy and the data normal
and not an anomaly or a symptom of a
problem.
So, we are into statistical hypothesis
testing, and that is the context of the
classic Neyman-Pearson result.
The literature on statistical hypothesis
testing goes back to at least the
Neyman-Pearson result before 1950; the
field is old with a lot known.
In that field, it is usual to be able to
set the rate of false alarms (type I
error) in advance and get that rate
exactly in practice.
For monitoring server farms and networks,
there are two additional criteria that are
just crucial -- right, in particular for
getting a low rate of missed detections
(false negatives, type II errors) for a
given rate of false positives:
(1) Distribution-free.
The easy way to be able to do the
calculations of an hypothesis test is to
assume that know the probability
distribution of the input data when the
null hypothesis holds. There, sure, the
most common distribution assumed is the
Gaussian.
But there are also statistical hypothesis
tests that are distribution-free (also
non-parametric, e.g., not using the
parameters of the Gaussian distribution)
that is, that make no assumptions about
the probability distribution of the data
when the null hypothesis holds.
For monitoring server farms and networks,
too often should be using
distribution-free tests.
(2) Multi-dimensional
The data readily available from monitoring
server farms and networks is horrendously
large, like flowing oceans of data.
So, it's easy to get data on each of
dozens of variables at rates from one
point each few seconds up to hundreds of
points a second.
Treating such jointly multi-dimensional
data one variable at a time must typically
throw away a huge fraction of the relevant
information. So, the data should be
handled jointly. So, need statistical
hypothesis tests that are
multi-dimensional, and those are not so
easy to find in the literature.
So, the work I did for detecting zero-day
problems in server farms and networks was
to create a large collection of
statistical hypothesis tests that are both
distribution-free and multi-dimensional.
E.g., if the data consists of a time
series of points on a checkerboard and
normal data is on the red squares, then
my work will detect, with known and
adjustable false alarm rate achieved
exactly in practice, points on black
squares. In this case, treating the data
on each of two perpendicular edges (axes)
of the checkerboard separately yields
exactly nothing. So, this example
illustrates the extra power of being
multi-dimensional. Also the distribution
of the data when the null hypothesis holds
is just the red squares, that is, far from
anything common in statistics.
Moreover, can argue that there are some
good qualitative reasons to believe
that, really, a good detector for zero-day
problems in server farms and networks
should be able to know when it is in a
fractal such as the Mandelbrot set and when
not. Yes my work does that.
But, I have to say, I did my work -- at
IBM's Watson lab to improve on our
approaches via AI -- and published it
decades ago, and since then I've seen
essentially no interest at all in the
work.
For more, for years I jerked the chains of
a huge fraction of all the venture capital
firms in the US with essentially no
interest.
My conclusion is good work in anomaly
detection and a dime won't cover a ten
cent cup of coffee.
But, if by now anyone really is
interested, then the remarks here should
help.
Assumptions of stationarity does limit things a bit, but it might be possible to extend the argument to weak or limited dependence. Once I was looking at this problem with some differential geometry lenses. Adapting to your setting, the nearest neighbors would be computed according to a data dependent metric not just Euclidean. The working hypothesis was that under normal circumstances the process is constrained to be in the vicinity of some lower dimensional manifold. This is reasonable because of the many "near invariants" among the dimensions.
Wonder why no one really bites this. I think the only way to monetize this would be to build one's own where monitoring is done better, and not sell better monitoring to others. Unfortunately that's a very capital intensive option.
I would be really interested to read the publication you did on this. We are trying to do anomaly detection from a large stream of traceroutes coming on from approx. 10,000 locations continuously over the internet.
Here at HN I'm trying to be anonymous but
on anomaly detection make a contribution.
The basic idea of my research is for
positive integer k, use k-nearest
neighbors or, really, any of a wide
variety of such measures, metrics,
functions,windows however want to
regard them.
So, right, are on the way to some
approaches to multi-dimensional
probability density estimation.
The main contribution of my research is
the applied probability derivations to
show how to calculate, adjust, set, etc.
the false alarm rate. In practice, false
alarm rate unknown, out'a control, often
too high is a biggie problem -- e.g., can
cause staff just to ignore too many
alarms. E.g., see the movie Tora, Tora,
Tora about the 12/7/1941 attack on Pearl
Harbor where the US actually had just what
the heck it needed for early detection,
early enough to get fighter airplanes in
the air, maybe early enough to get the
ships out of the harbor, etc. -- radar on
top of a hill. And the radar operators
DID get the detection and DID report it,
but the chain of command assumed that the
detection was a false alarm. Big, huge,
gigantic bummer. The US came within the
thickness of an onion skin of getting the
US airplanes and ships out of harm's way,
shooting down much of the attacking
airplanes, and having the US carriers,
already at sea, attacking the attacking
carriers when all their airplanes were off
over Hawaii. The US could have sustained
little damage and sunk all six of the
attacking carriers.
Having a good anomaly detector and
ignoring it can be a bummer.
But if know what the false alarm rate is
and know that it's low, then maybe the
chain of command will take a detection
more seriously and, then, take the next
step -- diagnosis.
Sometimes the spook business, say, Soviets
in England, would watch the comings,
goings, lights on late, etc. at British
military HQs and try to detect anomalies
that would indicate war. Sure, for such
detection need both distribution-free and
multi-dimensional with good control over
false alarm rate and good evidence of a
relatively high detection rate. So, sure,
the US CIA should want my work. I wrote
them -- never heard back.
Once I did give a talk on my work at the
Institute for Defense Analysis (IDA),
think tank resource for the Joint Chiefs.
Some in the audience mentioned credit card
fraud detection. Okay. But there should
also be national security applications.
Once gave a talk on the work at the Johns
Hopkins University Applied Physics Lab
(JHU/APL) -- I might as well have talked
about a recipe for pancakes.
Once I gave a talk on the work at the
NASDAQ site in Trumbull, CT. I got a nice
tour, a nice lunch, and a nice tote bag.
I gave some talks on this work to my
management chain at IBM's Watson lab --
the suits were afraid I might have some
good work they didn't understand, couldn't
control, and couldn't throttle. They were
PISSED.
I presented my work to some IBM people
outside of Watson Research -- they liked
the work, but Watson Research was PISSED.
For the applied probability stuff in my
research, for a detection of an anomaly
can also report the lowest false alarm
rate at which the data would still be a
detection. Then right along can graph
these lowest rates and use them as a
heuristic about detection seriousness.
Net, getting a handle, knob to turn,
knowledge of, etc. false alarm rate is
important.
The applied probability stuff is based on
some cute symmetry in the calculations.
The symmetry, then, gives some measure
preserving transformations
(transformations that don't change the
probability distribution). So, end up
with a finite algebraic group of such
transformations. Then sum over the group
and, presto, bingo, get what need to know
false alarm rate.
To adjust the false alarm rate, adjust the
diameter, cut-off, whatever, of the k
nearest neighbors or other metric,
convolution window, etc. Of course all of
this is conditional probabilities
conditioned on the historical data used
where assume that the system being
monitored when the historical data was
collected was healthy. How to confirm
healthy? Let the data age and see if
there are any other, later symptoms of
problems from other means, e.g., angry
customer complaints, malware, attacks and
intrusions, etc.
So, have a case of, essentially, being
brief here, the conditional expectation
E[Y|X] where Y is the random variable of
detections and X is the historical data.
Then of course E[E[Y|X]] = E[Y] which says
that have set false alarm rate "exactly"
in the long-run or, if you will,
conditionally exactly in the short run.
A result is that the naive way to evaluate
false alarm rate from the historical data
does, for lots of data, converge to the
right answer. Proving that, however, was
a bit tricky.
Achieving the Neyman-Pearson best result
needs lots of data from both healthy and
sick systems, and in practice the second
is in short supply.
The Neyman-Pearson result is intuitive,
like investing in real estate: Buy plots
of real estate that give the best return
per dollar invested until have invested
all the available money.
Well, suppose the data observed is, for
the set of real numbers R and some
positive integer n, in R^n. Then
partition R^n into little patches. Then
in each little patch, assume that data in
that patch would raise an alarm, and,
then, from the data (need lots of data,
especially as n grows) get the ratio of
the detection rate over the false alarm
rate. Then false alarm rate is really
like the money you have to invest, and
detection rate is your return on that
investment. So, sure, you buy the
patches with the highest ratio until you
have spent, invested, all your false
alarm rate, money.
Can find proofs in terms of freshman
calculus. I wrote out a quite general
proof in terms of the Hahn decomposition
from the Radon-Nikodym theorem in measure
theory.
So, right, if have only finitely many
patches in R^n, then have essentially a
knapsack problem, IIRC, in NP-complete
(although likely not a concern in
practice).
Can write out a little derivation, mostly
just Fubini's theorem (the measure theory
version of the calculus interchange order
of integration) to show in a cute sense of
the distribution of sick systems being
translation invariant (right, a bit much
to assume -- but here I'm typing quickly
from memory of what I did 20+ years ago)
then my techniques give, for the assumed
false alarm rate, the highest detection
rate possible. So, this is a simple,
quick and dirty, data-free, crude approach
to the full Neyman-Pearson result of best
detection.
There is another cute point: The usual
way to use the historical data has its
order ignored. So, should be able to get
some cute theorems about this that permit
assuming much less than that the input
data is independent and identically
distributed (i.i.d.). That is, get to
assume that the historical data has been
permuted randomly, and, I'm guessing that,
from this could get a cute approximation
result. One start would be the work of
Michel Talagrand on "a new look at
independence". Why have I not pursued
this research? Did I mention that I have
20+ years of evidence that anomaly
detection and a dime won't cover a 10 cent
cup of coffee?
Then, with the math clear, the data
collected, etc., need to write some code.
I put my feet up and thought of a way.
Soon I learned that part of my work
reinvented k-D trees, that is, for
positive integer k, search trees in
k-dimensions. IIRC, k-D trees are in
Robert Sedgewick and Kevin Wayne,
Algorithms, FOURTH EDITION,
Addison-Wesley, New York, 2011.
So, k-D trees are a lot like a k
dimensional generalization of the usual
one dimensional binary search.
But, there, also need some cutting planes
and a little backtracking in the tree.
For this, for the computer hardware in
serious production, the new several TB
solid state disk (SSD) drives would be
just terrific: Load up one of those with
a k-D tree of historical data, and then
in production deployment many times a
second query that data. Since are using
the data -- write, say, once a week or
month and read hundreds or thousands of
times a second -- the SSD would give
fantastically fast data rates and not wear
out.
Of course, more could be done:
(1) How much historical data is really
needed?
(2) Of the data on several variables,
which are needed? Or, maybe we should
prove some theorems that show when too
many variables without enough historical
data hurt the results?
(3) Should we think about scaling the
data on some of the variables? If so,
then how to know what variables and how
much scaling? Could want some useful
theorems here.
(4) For a system level view, could we be
hierarchical, that is, say that this one
server is sick, with another detector
drill-down and say that this one virtual
machine on that server is sick, drill
down and say that this one applications
program is sick, or some such? In all
cases? No. In some cases, maybe?
(5) After detection with a low false alarm
rate, the next step is diagnosis. But
detectors that in some useful sense
localize the source of the sickness
should be able to help with diagnosis and,
indeed, the third step, correction. So, a
question is, at what scales and where to
deploy detectors? Could use some useful
theorems, analysis, etc. here.
(6) Networks and server farms are changing
constantly, but need some good historical
data that still describes a healthy
system. Okay, could use some work to show
when the changes have been enough to need
new historical data.
(7) Maybe of high interest to the suits in
the C-suite, do some on decision theory,
that is, essentially cost minimization.
So, have costs for false alarms and costs
for missed detections and try to set the
false alarm rate, and/or the number of
detectors, etc., to minimize the sum of
all these costs.
Since a missed detection, i.e., a problem
detected too late instead of ASAP, can be
headlines for IIRC Sony, Target, the NYSE,
Yahoo, etc., the suits might be eager for
relatively a lot of such monitoring.
(8) My work was for problems never seen
before. The 50,000 foot view of that is
that, once have seen a problem, detected
it, diagnosed it, and corrected it, then,
with the corrections, really shouldn't see
the problem again. So, really what we
should be looking for are problems never
seen before and should slap our own wrists
for any problem we keep seeing over and
over.
Given some e-mail addresses, I'll send a
reference to my published paper. But I'm
trying to be anonymous here at HN.
Again, I'm no longer motivated by money to
pursue anomaly detection: For 20+ years I
got overwhelming evidence that work in
anomaly detection and a dime won't cover a
10 cent cup of coffee. I tried to make a
startup out of this work, but gotta tell
you, after hundreds of e-mail messages to
venture capital firms, about 98% said
nothing back, one gave a weak reply with
no indication of any real understanding of
the market, technical challenges, my work,
what such a company would look like, etc.,
and the remaining 2%, or maybe 1%, gave
only some boiler plate about "not in our
focus area".
So, I have another startup in progress,
easier to do, easier to sell, much more
promising and close to going live and
making money.
So, here I'm willing to do an anomaly
detection technology give away! Get your
free anomaly detection tutorial and
technology!
Uh, when I did the research at IBM's
Watson lab, the Watson lab patent office
was getting excited by my work -- "there
are several projects in the lab attacking
anomaly detection, and your work is the
only one taking your approach.".
I had my research in a nicely written
paper, complete with theorems and proofs
and some good tests on some real data
(from a computer cluster at Allstate --
the cluster was part of the motivation to
be multi-dimensional, e.g., to detect some
of the failures due to chain reactions
in clustering) and also some hypothetical
data that would make a severe test (the
checkerboard example is a simplification),
and a guy at IBM claimed to have my work
"reviewed" and pronounced the work as "not
publishable," and I got fired, walked out
the door. Not good. He didn't give me a
chance to submit my work for publication.
But, out of IBM, with a copy of my paper
and a letter from IBM permitting me to
publish, I submitted the paper, and it was
accepted without revision (except
something about indenting the first
paragraph after a subheading!) by the
first journal where I submitted,
Information Sciences, a good enough
journal.
I've been burned enough pursuing anomaly
detection. I don't like being burned;
it's no fun.
My conclusion is that any very good work
in the field is just way too darned
difficult for essentially everyone else
that would be involved -- management
chain, journal reviewers, computer science
departments, venture capital partners,
suits at target customers, etc.
Right, as suggested elsewhere in this
thread, try to sell a solution or a
service, maybe cloud based. But setting
that up and getting it to good traction
would take some cash, more than I'm going
to put forward.