Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What are people using k-means for? I can count on one hand the number of times I’ve had a good a priori rationale for the value of k.



Polis (and Twitter's community notes, I believe)

Participation At Scale Can Repair The Public Square https://www.noemamag.com/participation-at-scale-can-repair-t...

Polis: Scaling deliberation by mapping high dimensional opinion spaces https://scholar.google.com/scholar?q=Polis:+Scaling+delibera...

Restricting clustering to 2-5 groups impacts group aware/informed consensus and comment routing https://github.com/compdemocracy/polis/issues/1289


The kmeans metric is exactly the metric you would want to optimize the performance of an algorithm like [bolt](https://arxiv.org/abs/1706.10283). In that and other discretization routines, the value of k is a parameter related to compression ratio, efficiency, and other metrics more predictable than some ethereal notion of how many clusters the data "naturally" has.


I would recommend checking out DBSCAN as it is similar without having to provide a number k https://en.m.wikipedia.org/wiki/DBSCAN


I've used something similar for tissue segmentation from hyperspectral images of animals where I know there should be K different tissue types I care about.


We used k means clustering on a project used to track fruit fly memory and learning behaviors

http://git.ceux.org/FlyTracking.git/


Sometimes you can use a heuristic to estimate K, or use a variant that terminates at some distance threshold.

That said, something like hdbscan doesn’t suffer from this problem.


Used it in college to downscale an X color image to Y number of colors. Sure, Photoshop does it, but it was informative to do it manually.


Google's Material You uses this to initiate color theming

(n.b. Celebi's, note usage of Lab / notHSL, respect Cartesian / polar nature of inputs / outputs, and ask for high-K, 128 is what we went with but it's arbitrary. Can get away with as few as 32 if you're ex. Doing brand color from favicon)


Ooh this is a much nicer approach than the kind of brute force approach we took at work for theme gen


I did a modified version of this once for a map of auto dealerships, although rather than working with a fixed k, I used a fix threshold for cluster distance. The algorithm I was working with had O(n³) complexity so to keep the pregeneration of clusters manageable, I partitioned data by state. The other fun part was finding the right metric formula for measuring distances. Because clusters needed to correspond to the rectangular view window on the map, rather than a standard Euclidean distance, I used d = max(Δxy) which gives square neighborhoods rather than round ones.


I implemented an algorithm which used k-means to reduce noise in a path tracer.

For each pixel instead of a single color value it generated k mean color values, using an online algorithm. These were then combined to produce the final pixel color.

The idea was that a pixel might have several distinct contributions (ie from different light sources for example), but due to the random sampling used in path tracing the variance of sample values is usually large.

The value k then was chosen based on scene complexity. There was also a memory trade-off of course, as memory usage was linear in k.


> For each pixel instead of a single color value it generated k mean color values, using an online algorithm.

What does online mean here?


Online means it processes the items as they come[1]. This means the algorithm can't consider all the items at once, and has to try to be clever on the spot.

In my case the algorithm uswd would use the first k samples as the initial means, and would then find which of the k means were closest to the current sample, and update that mean[2].

Given that in parh tracing one would typically use a fairly large number of samples per pixel relative to k, this approach did a reasonable job of approximating the k means.

[1]: https://en.wikipedia.org/wiki/Online_algorithm

[2]: https://yokolet.com/2017/05/29/online-algorithm.html


k-means is good for fast unsupervised clustering on an unknown low-dimensional dataset. It's helpful for EDA.

If you want accuracy at an order of magnitude more compute, you can use something like DBSCAN.


Real estate price estimates would be the classic, and frankly still common, example


Measuring multiple physical objects with the same sensor, you can use k-means to separate the measurements from each object, given that you know how many objects are being sensed. I can't get more specific than that.


Sounds like you're describing my carpet-color classification project! [1]

Built as part of a larger carpet based localisation project [2]

1: https://nbviewer.org/github/tim-fan/carpet_color_classificat...

2: https://github.com/tim-fan/carpet_localisation/wiki/Carpet-L...


Baseball pitch types based on physics profiles.


Constantly in remote sensing and GIS work.


Frequently used in e-commerce, such as RFM clustering for targeted marketing.


Additional small molecule pharmaceutical candidates via molecular descriptors


I've used it for identifying dominant colors in images.


data segmentation (e.g Voronoi), grouping (like search query results), anomaly detection. lots of different things


So after re-reading your comment a few times, I am left with this thought: either you don't understand what k-means clustering is, or I don't understand what k-means clustering is. I wouldn't describe myself as a machine learning expert, but I have taken some grad level classes in statistics, analytics, and methods like this/related to this.

So my question is... could you elaborate?


Not GP, but I understood their question as follows.

Assume you collect some kindergartners and top NBA players into a room and collect their heights. Now say you pass these to two hapless grad students and ask them to perform K-means clustering.

Suppose one of the grad students knew the composition of the people you measured and can guess these height should clump into 2 nice clusters. The other student who doesn't know the composition of the class - what should they guess K to be?

I understood the GP's comment to refer to the state of the second grad student. How useful is K-means clustering without knowing K in advance?


>I understood the GP's comment to refer to the state of the second grad student. How useful is K-means clustering without knowing K in advance?

There are several heuristics for this. Googling I see that the elbow method, average sillhouette method and gap statistic method is the most used.

I think you could play around with your own heuristics as well. Simple KDE plots showing the amount of peaks. Maybe, say the variance between clusters should be greater than the variance inside any cluster could maybe work. (Edit: this seems to be the main point of the average sillhouette method).


The first problem is picking k. The second problem is the definition of distance or, equivalently, the uniformity of the space. Naively using the Euclidean distance in an embedding where similarity is non-uniform leads to bad outcomes. This problem is solved by learning a uniform embedding, and this is much harder than running k-means.

k-means assumes these hard parts of the problem are taken care of and offers a trivial solution to the rest. Thanks for the help, I'll cluster it myself.


You have to choose the number of clusters, before using k-means.

Imagine that you have a dataset, where you think there are likely meaningful clusters, but you don't know how many, especially where it's many-dimensioned.

If you pick a k that is too small, you lump unrelated points together.

If k is too large, your meaningful clusters will be fragmented/overfitted.

There are some algorithms that try to estimate the number of clusters or try to find the k with the best fit to the data to make up for this.


Couldn’t you make some educated guesses and then stop when you arrive at a K that gives you meaningful clusters that are neither too high level nor too atomized.


Probably not the best in terms of efficiency.

Easier just to deliberately overshoot (with a too high k) and then merge any clusters with too much overlap.


K is 3.

It’s honestly fine for just finding key differences like a principal component for light storytelling. They don’t need to be distinct clusters


SMOTE


knowledge graphs




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: