What’s the best alternatives? Which are also preferably in common libraries, lik...

happy_dog1 · on Nov 9, 2024

Depends on your data and what assumptions you can make. I work with sequence data a lot, and for that type of data the MMSeqs library (https://github.com/soedinglab/MMseqs2) is both very powerful and very popular.

For tabular data as in this blog post, there are a lot of options. For small datasets, hierarchical clustering is very powerful -- you can build and visually inspect a dendrogram and this can give you a lot of insight. It's implemented in Scipy and scikit-learn (e.g. https://docs.scipy.org/doc/scipy/reference/cluster.hierarchy... ). Hierarchical clustering however scales poorly. For relatively low-dimensional data the hdbscan algorithm is really nice and is implemented in Python (https://pypi.org/project/hdbscan/).

If you have reason to think your data is modeled reasonably well as mixture of Gaussians (think lots of elliptical clusters of various sizes) and it's not too high-dimensional, a mixture of Gaussians can work well; unlike k-means, it's probabilistic and doesn't assume all clusters are spherical and roughly equal in size. This too is implemented in scikit-learn (https://scikit-learn.org/stable/modules/generated/sklearn.mi...). If you think a mixture of Gaussians is reasonable but you know there are outliers, a mixture of Student t-distributions will work better; this is not in scikit-learn but there are multiple implementations on github.

It's also possible to improve k-means by using approximate kernel k-means, where you use a random Fourier features (https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.n...) representation for each input datapoint then run k-means on that -- this approximates kernel k-means, so it relaxes some of the unrealistic assumptions of k-means. We no longer assume clusters are spherical, although this method may still work poorly if there are outliers and also still requires us to choose the number of clusters and the lengthscale for the kernel we are approximating, both of which may be hard to choose unless you already have a pretty good intuition for what are appropriate choices for your data.

There are other options that are sometimes useful, in fact, I could easily write a blog post about this (maybe I should), but the thing with clustering is that the "right" choice of algorithm is somewhat dependent on your data and on what assumptions are reasonable to make. People sometimes end up using k-means because it's fast (especially if you use minibatch kmeans) and can scale to crazy large datasets. But it makes very strong assumptions which are usually wrong (most datasets do not subdivide well into some number of spherical Gaussians of roughly equal size), and this can result in truly absurd partitions, especially when there are lots of outliers or clusters with highly irregular shapes.

nickpsecurity · on Nov 10, 2024

Thank you for your helpful advice.