Anomaly detection in graphs is a critical problem for finding suspicious behavior in innumerable systems, such as intrusion detection, fake ratings, and financial fraud. But most of the systems in place focus either on static graphs or on entire graph snapshots if they consider dynamic (time-evolving) graphs.
However, to minimize the effect of malicious activities and start recovery as soon as possible, we need to detect anomalies in real-time or near real-time i.e. to identify whether an incoming edge is anomalous or not, as soon as we receive it. In addition, since the number of vertices can increase as we process the stream of edges, we need an algorithm which uses constant memory in graph size. Moreover, fraudulent or anomalous events in many applications occur in microclusters or suddenly arriving groups of suspiciously similar edges e.g. denial of service attacks in network traffic data and lockstep behavior.
In this work, we propose MIDAS, which detects microcluster anomalies, or suddenly arriving groups of suspiciously similar edges, in edge streams, using constant time and memory. In addition, by using a principled hypothesis testing framework, MIDAS provides theoretical bounds on the false positive probability, which earlier methods do not provide. Also, we are up to 644 times faster and 48% more accurate than previous state of the art approaches.
Before I read your paper, I didn't realize the power of hypothesis tests. I was surprised to see theoretical bounds on anomaly detection with such weak constraint through simple Chi-Squared goodness-of-fit. Delicate idea!
1) Do you use sliding window approach or exponential decay?
2) When you say groups of similar edges, do they have to be spatially close to each other? or can they be equally distributed in the graph?
3) Can an attacker figure out the optimum time difference between his attacks such that MIDAS doesn't detect it as an anomaly? The time gap is just sufficient enough for the algorithm to weed out the potentially malicious micro-cluster as obsolete.
Seems like an interesting extension to your work. Best of luck!
Thank you :)
1. We use a temporal decay (alpha).
2. Good question! We consider similar edges as those having at least one of source and destination node as the same.
3. Very interesting direction for future work! We can try using a variable decay instead of a fixed one to tackle the adversary.
We handle spatial locality in terms of not just the source but also the destination, therefore we should be able to handle DDoS like attacks when simultaneous edges come from several sources trying to deny one particular destination.
Ah alright, that makes sense. But it works only when the destination is the same. In a setting where there are multiple web-hosting servers, you would need to treat a group of source and destination points as micro-clusters themselves.
In Figure 7 of the paper, we show an example of detection when neither edge/source/destination is individually anomalous but as a whole, it is a microcluster anomaly. It can similarly be detected when there are multiple web-hosting servers.
Great work! However, from what I understand, microcluster detection seems to be very similar to subgraph detection. Is there a difference between the two problems? And if yes, then how do their complexities compare?
Thanks! There is a subtle difference. We define microclusters within the category of anomalous edge detection as 'suddenly arriving groups of suspiciously similar edges' e.g. denial of service attacks in network traffic data and lockstep behavior.
Subgraph detection on the other hand is detecting anomalous subgraphs and may also not be able to detect microclusters (suddenly arriving groups of suspiciously similar edges.)
We have extended the work for multi-aspect data i.e. records with multiple features. Currently our approach is capable of handling structured data only. It should be interesting to see how to detect anomalies using this approach in data which does not have a regular structure.
I found the paper to be quite exciting, and it is also solving one of the main tasks in anomaly detection, that is real-time anomaly detection. With other machine/deep learning based algorithms, one can model the data distribution quite efficiently, but there is a tradeoff of flagging time. With the statistical method suggested in the paper, you not only get state-of-the-art results but are also able to report the results in real-time, a big plus.