Is it true that this doesn't support dynamic values of k? That is, the algorithm isn't adaptive to a changing number of clusters? That said, I suppose for some small range of k values, you could do this trivially by tracking them all and picking the best.
This is a common question for me: how to determine dynamically the number of clusters (including the splitting / merging). I've looked into jenks breaks, but it also seems to require a cluster number.
One common approach is to look for the elbow in the curve <metric> vs K (number of clusters). This is essentially finding the number of clusters after which the rate of information gained/variance explained/<metric> slows. I believe it's possible to binary search for this point if you can assume the curve is convex.
a general drawback of k-means. You have to select the k. In practice you try a range and see how well they summarize your data (e.g. leave one out cross validation).
You could do that here too, just have a range of k. If only there were a streaming leave-one-out cross validation for k-means to complement this approach ...
(it is possible to do this in a streaming style, see LWPR)
I understand that's a drawback of k-means. I was just wondering if this was something that Spark solved natively. Looks like the answer is no for the time being. Thanks for the pointer to LWPR :)