Show HN: Fast Real-Time Anomaly Detection in Dynamic Graphs

ugh123 · on Feb 8, 2020

The comments in this thread exhibit some serious astroturfing. Also the 6 unique commenters accounts all created in the last 12 days. Just sayin'

adevx · on Feb 8, 2020

Did you use an anomaly detection library for these findings?

ugh123 · on Feb 19, 2020

Hah no. But the amount of exclamation points and softball questions is an easy red flag to find

siddharthbhatia · on Feb 8, 2020

Anomaly detection in graphs is a critical problem for finding suspicious behavior in innumerable systems, such as intrusion detection, fake ratings, and financial fraud. But most of the systems in place focus either on static graphs or on entire graph snapshots if they consider dynamic (time-evolving) graphs.

However, to minimize the effect of malicious activities and start recovery as soon as possible, we need to detect anomalies in real-time or near real-time i.e. to identify whether an incoming edge is anomalous or not, as soon as we receive it. In addition, since the number of vertices can increase as we process the stream of edges, we need an algorithm which uses constant memory in graph size. Moreover, fraudulent or anomalous events in many applications occur in microclusters or suddenly arriving groups of suspiciously similar edges e.g. denial of service attacks in network traffic data and lockstep behavior.

In this work, we propose MIDAS, which detects microcluster anomalies, or suddenly arriving groups of suspiciously similar edges, in edge streams, using constant time and memory. In addition, by using a principled hypothesis testing framework, MIDAS provides theoretical bounds on the false positive probability, which earlier methods do not provide. Also, we are up to 644 times faster and 48% more accurate than previous state of the art approaches.

For more details, read the paper at: https://www.comp.nus.edu.sg/~sbhatia/assets/pdf/midas.pdf

If you are in New York, you are welcome to attend the talk at AAAI in New York Hilton Midtown on 11th February at 2pm.

ZhuoCai · on Feb 16, 2020

Before I read your paper, I didn't realize the power of hypothesis tests. I was surprised to see theoretical bounds on anomaly detection with such weak constraint through simple Chi-Squared goodness-of-fit. Delicate idea!

shivin9 · on Feb 8, 2020

Impressive results! Few questions though:-

1) Do you use sliding window approach or exponential decay?

2) When you say groups of similar edges, do they have to be spatially close to each other? or can they be equally distributed in the graph?

3) Can an attacker figure out the optimum time difference between his attacks such that MIDAS doesn't detect it as an anomaly? The time gap is just sufficient enough for the algorithm to weed out the potentially malicious micro-cluster as obsolete.

Seems like an interesting extension to your work. Best of luck!

siddharthbhatia · on Feb 8, 2020

Thank you :) 1. We use a temporal decay (alpha). 2. Good question! We consider similar edges as those having at least one of source and destination node as the same. 3. Very interesting direction for future work! We can try using a variable decay instead of a fixed one to tackle the adversary.

shivin9 · on Feb 8, 2020

Thanks. It would be nice to have similarity methods that are not dependent on spatial locality to detect DDoS like attacks.

siddharthbhatia · on Feb 8, 2020

We handle spatial locality in terms of not just the source but also the destination, therefore we should be able to handle DDoS like attacks when simultaneous edges come from several sources trying to deny one particular destination.

shivin9 · on Feb 8, 2020

Ah alright, that makes sense. But it works only when the destination is the same. In a setting where there are multiple web-hosting servers, you would need to treat a group of source and destination points as micro-clusters themselves.

Can you extend MIDAS to adapt to that scenario?

siddharthbhatia · on Feb 8, 2020

In Figure 7 of the paper, we show an example of detection when neither edge/source/destination is individually anomalous but as a whole, it is a microcluster anomaly. It can similarly be detected when there are multiple web-hosting servers.

arjitjain · on Feb 8, 2020

Great work! However, from what I understand, microcluster detection seems to be very similar to subgraph detection. Is there a difference between the two problems? And if yes, then how do their complexities compare?

siddharthbhatia · on Feb 8, 2020

Thanks! There is a subtle difference. We define microclusters within the category of anomalous edge detection as 'suddenly arriving groups of suspiciously similar edges' e.g. denial of service attacks in network traffic data and lockstep behavior.

Subgraph detection on the other hand is detecting anomalous subgraphs and may also not be able to detect microclusters (suddenly arriving groups of suspiciously similar edges.)

shingeki · on Feb 8, 2020

HN should use this algo to detect the blatant upvote-hacking going on in this post :)

raghavjajodia · on Feb 10, 2020

Great to see that! The results look really impressive!

sinhayash · on Feb 8, 2020

Hmm... Interesting work!

Can we extend this to find anomalies in other kinds of data like video streaming?

siddharthbhatia · on Feb 8, 2020

We have extended the work for multi-aspect data i.e. records with multiple features. Currently our approach is capable of handling structured data only. It should be interesting to see how to detect anomalies using this approach in data which does not have a regular structure.

shivin9 · on Feb 8, 2020

Anything that can be modelled as a graph should work I suppose.

siddharthbhatia · on Feb 8, 2020

geetika92 · on Feb 8, 2020

Interesting work. Definitely has application potential in various businesses.

ramarohit · on Feb 8, 2020

Interesting work and great paper as anomaly detection in real time is very important now a days. Easy to understand as well.

riteshkumar99 · on Feb 8, 2020

I found the paper to be quite exciting, and it is also solving one of the main tasks in anomaly detection, that is real-time anomaly detection. With other machine/deep learning based algorithms, one can model the data distribution quite efficiently, but there is a tradeoff of flagging time. With the statistical method suggested in the paper, you not only get state-of-the-art results but are also able to report the results in real-time, a big plus.