Hacker News new | past | comments | ask | show | jobs | submit login
Introducing Streaming K-Means in Spark MLlib 1.2 (databricks.com)
69 points by rxin on Jan 28, 2015 | hide | past | favorite | 10 comments



This is a cool feature, and is one of the prime example of what Spark's tight integration of various libraries can enable (in this case Spark Streaming and MLlib). It was originally designed by Jeremy Freeman to handle workloads in neuroscience, which IIRC was generating data at 1TB/30mins.


Sounds similar to an exponential moving average[1], which itself is a one-pole IIR digital filter. [1] http://en.wikipedia.org/wiki/Moving_average#Exponential_movi...


Is it true that this doesn't support dynamic values of k? That is, the algorithm isn't adaptive to a changing number of clusters? That said, I suppose for some small range of k values, you could do this trivially by tracking them all and picking the best.


Now we only detect dying clusters while keeping k constant. Dynamic values of k would be a nice feature to add in later releases:)


This is a common question for me: how to determine dynamically the number of clusters (including the splitting / merging). I've looked into jenks breaks, but it also seems to require a cluster number.

http://en.wikipedia.org/wiki/Jenks_natural_breaks_optimizati...

Do you have any advice or ideas for automatically picking count of clusters in an unknown data set?


One common approach is to look for the elbow in the curve <metric> vs K (number of clusters). This is essentially finding the number of clusters after which the rate of information gained/variance explained/<metric> slows. I believe it's possible to binary search for this point if you can assume the curve is convex.


a general drawback of k-means. You have to select the k. In practice you try a range and see how well they summarize your data (e.g. leave one out cross validation).

You could do that here too, just have a range of k. If only there were a streaming leave-one-out cross validation for k-means to complement this approach ...

(it is possible to do this in a streaming style, see LWPR)


I understand that's a drawback of k-means. I was just wondering if this was something that Spark solved natively. Looks like the answer is no for the time being. Thanks for the pointer to LWPR :)


Maybe using BP-Means: http://arxiv.org/abs/1212.2126

K-Means without preset K.


very interesting post. ironically, hackernews uses a similar type of time-decay algorithm!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: