Is it true that this doesn't support dynamic values of k? That is, the algorithm...

mengxr · on Jan 28, 2015

Now we only detect dying clusters while keeping k constant. Dynamic values of k would be a nice feature to add in later releases:)

ramses0 · on Jan 28, 2015

This is a common question for me: how to determine dynamically the number of clusters (including the splitting / merging). I've looked into jenks breaks, but it also seems to require a cluster number.

http://en.wikipedia.org/wiki/Jenks_natural_breaks_optimizati...

Do you have any advice or ideas for automatically picking count of clusters in an unknown data set?

shazeline · on Jan 28, 2015

One common approach is to look for the elbow in the curve <metric> vs K (number of clusters). This is essentially finding the number of clusters after which the rate of information gained/variance explained/<metric> slows. I believe it's possible to binary search for this point if you can assume the curve is convex.

tlarkworthy · on Jan 28, 2015

a general drawback of k-means. You have to select the k. In practice you try a range and see how well they summarize your data (e.g. leave one out cross validation).

You could do that here too, just have a range of k. If only there were a streaming leave-one-out cross validation for k-means to complement this approach ...

(it is possible to do this in a streaming style, see LWPR)

michaelmior · on Jan 28, 2015

I understand that's a drawback of k-means. I was just wondering if this was something that Spark solved natively. Looks like the answer is no for the time being. Thanks for the pointer to LWPR :)

rxin · on Jan 28, 2015

Maybe using BP-Means: http://arxiv.org/abs/1212.2126

K-Means without preset K.