This is a cool feature, and is one of the prime example of what Spark's tight integration of various libraries can enable (in this case Spark Streaming and MLlib). It was originally designed by Jeremy Freeman to handle workloads in neuroscience, which IIRC was generating data at 1TB/30mins.
Is it true that this doesn't support dynamic values of k? That is, the algorithm isn't adaptive to a changing number of clusters? That said, I suppose for some small range of k values, you could do this trivially by tracking them all and picking the best.
This is a common question for me: how to determine dynamically the number of clusters (including the splitting / merging). I've looked into jenks breaks, but it also seems to require a cluster number.
One common approach is to look for the elbow in the curve <metric> vs K (number of clusters). This is essentially finding the number of clusters after which the rate of information gained/variance explained/<metric> slows. I believe it's possible to binary search for this point if you can assume the curve is convex.
a general drawback of k-means. You have to select the k. In practice you try a range and see how well they summarize your data (e.g. leave one out cross validation).
You could do that here too, just have a range of k. If only there were a streaming leave-one-out cross validation for k-means to complement this approach ...
(it is possible to do this in a streaming style, see LWPR)
I understand that's a drawback of k-means. I was just wondering if this was something that Spark solved natively. Looks like the answer is no for the time being. Thanks for the pointer to LWPR :)