What are people using k-means for? I can count on one hand the number of times I...

patcon · on Jan 13, 2024

Polis (and Twitter's community notes, I believe)

Participation At Scale Can Repair The Public Square https://www.noemamag.com/participation-at-scale-can-repair-t...

Polis: Scaling deliberation by mapping high dimensional opinion spaces https://scholar.google.com/scholar?q=Polis:+Scaling+delibera...

Restricting clustering to 2-5 groups impacts group aware/informed consensus and comment routing https://github.com/compdemocracy/polis/issues/1289

hansvm · on Jan 13, 2024

The kmeans metric is exactly the metric you would want to optimize the performance of an algorithm like [bolt](https://arxiv.org/abs/1706.10283). In that and other discretization routines, the value of k is a parameter related to compression ratio, efficiency, and other metrics more predictable than some ethereal notion of how many clusters the data "naturally" has.

web9ed · on Jan 13, 2024

I would recommend checking out DBSCAN as it is similar without having to provide a number k https://en.m.wikipedia.org/wiki/DBSCAN

Konnstann · on Jan 13, 2024

I've used something similar for tissue segmentation from hyperspectral images of animals where I know there should be K different tissue types I care about.

calvinmorrison · on Jan 13, 2024

We used k means clustering on a project used to track fruit fly memory and learning behaviors

http://git.ceux.org/FlyTracking.git/

sshumaker · on Jan 13, 2024

Sometimes you can use a heuristic to estimate K, or use a variant that terminates at some distance threshold.

That said, something like hdbscan doesn’t suffer from this problem.

derwiki · on Jan 13, 2024

Used it in college to downscale an X color image to Y number of colors. Sure, Photoshop does it, but it was informative to do it manually.

refulgentis · on Jan 13, 2024

Google's Material You uses this to initiate color theming

(n.b. Celebi's, note usage of Lab / notHSL, respect Cartesian / polar nature of inputs / outputs, and ask for high-K, 128 is what we went with but it's arbitrary. Can get away with as few as 32 if you're ex. Doing brand color from favicon)

girvo · on Jan 13, 2024

Ooh this is a much nicer approach than the kind of brute force approach we took at work for theme gen

dhosek · on Jan 13, 2024

I did a modified version of this once for a map of auto dealerships, although rather than working with a fixed k, I used a fix threshold for cluster distance. The algorithm I was working with had O(n³) complexity so to keep the pregeneration of clusters manageable, I partitioned data by state. The other fun part was finding the right metric formula for measuring distances. Because clusters needed to correspond to the rectangular view window on the map, rather than a standard Euclidean distance, I used d = max(Δx,Δy) which gives square neighborhoods rather than round ones.

magicalhippo · on Jan 13, 2024

I implemented an algorithm which used k-means to reduce noise in a path tracer.

For each pixel instead of a single color value it generated k mean color values, using an online algorithm. These were then combined to produce the final pixel color.

The idea was that a pixel might have several distinct contributions (ie from different light sources for example), but due to the random sampling used in path tracing the variance of sample values is usually large.

The value k then was chosen based on scene complexity. There was also a memory trade-off of course, as memory usage was linear in k.

le-mark · on Jan 13, 2024

> For each pixel instead of a single color value it generated k mean color values, using an online algorithm.

What does online mean here?

magicalhippo · on Jan 13, 2024

Online means it processes the items as they come[1]. This means the algorithm can't consider all the items at once, and has to try to be clever on the spot.

In my case the algorithm uswd would use the first k samples as the initial means, and would then find which of the k means were closest to the current sample, and update that mean[2].

Given that in parh tracing one would typically use a fairly large number of samples per pixel relative to k, this approach did a reasonable job of approximating the k means.

[1]: https://en.wikipedia.org/wiki/Online_algorithm

[2]: https://yokolet.com/2017/05/29/online-algorithm.html

minimaxir · on Jan 13, 2024

k-means is good for fast unsupervised clustering on an unknown low-dimensional dataset. It's helpful for EDA.

If you want accuracy at an order of magnitude more compute, you can use something like DBSCAN.

trc001 · on Jan 13, 2024

Real estate price estimates would be the classic, and frankly still common, example

bagels · on Jan 13, 2024

Measuring multiple physical objects with the same sensor, you can use k-means to separate the measurements from each object, given that you know how many objects are being sensed. I can't get more specific than that.

tim-fan · on Jan 13, 2024

Sounds like you're describing my carpet-color classification project! [1]

Built as part of a larger carpet based localisation project [2]

1: https://nbviewer.org/github/tim-fan/carpet_color_classificat...

2: https://github.com/tim-fan/carpet_localisation/wiki/Carpet-L...

icelancer · on Jan 13, 2024

Baseball pitch types based on physics profiles.

Waterluvian · on Jan 13, 2024

Constantly in remote sensing and GIS work.

tomnipotent · on Jan 13, 2024

Frequently used in e-commerce, such as RFM clustering for targeted marketing.

theophrastus · on Jan 13, 2024

Additional small molecule pharmaceutical candidates via molecular descriptors

missingrib · on Jan 13, 2024

I've used it for identifying dominant colors in images.

dilyevsky · on Jan 13, 2024

data segmentation (e.g Voronoi), grouping (like search query results), anomaly detection. lots of different things

brianleb · on Jan 13, 2024

So after re-reading your comment a few times, I am left with this thought: either you don't understand what k-means clustering is, or I don't understand what k-means clustering is. I wouldn't describe myself as a machine learning expert, but I have taken some grad level classes in statistics, analytics, and methods like this/related to this.

So my question is... could you elaborate?

dilippkumar · on Jan 13, 2024

Not GP, but I understood their question as follows.

Assume you collect some kindergartners and top NBA players into a room and collect their heights. Now say you pass these to two hapless grad students and ask them to perform K-means clustering.

Suppose one of the grad students knew the composition of the people you measured and can guess these height should clump into 2 nice clusters. The other student who doesn't know the composition of the class - what should they guess K to be?

I understood the GP's comment to refer to the state of the second grad student. How useful is K-means clustering without knowing K in advance?

bjornbsm · on Jan 13, 2024

>I understood the GP's comment to refer to the state of the second grad student. How useful is K-means clustering without knowing K in advance?

There are several heuristics for this. Googling I see that the elbow method, average sillhouette method and gap statistic method is the most used.

I think you could play around with your own heuristics as well. Simple KDE plots showing the amount of peaks. Maybe, say the variance between clusters should be greater than the variance inside any cluster could maybe work. (Edit: this seems to be the main point of the average sillhouette method).

esafak · on Jan 13, 2024

The first problem is picking k. The second problem is the definition of distance or, equivalently, the uniformity of the space. Naively using the Euclidean distance in an embedding where similarity is non-uniform leads to bad outcomes. This problem is solved by learning a uniform embedding, and this is much harder than running k-means.

k-means assumes these hard parts of the problem are taken care of and offers a trivial solution to the rest. Thanks for the help, I'll cluster it myself.

bagels · on Jan 13, 2024

You have to choose the number of clusters, before using k-means.

Imagine that you have a dataset, where you think there are likely meaningful clusters, but you don't know how many, especially where it's many-dimensioned.

If you pick a k that is too small, you lump unrelated points together.

If k is too large, your meaningful clusters will be fragmented/overfitted.

There are some algorithms that try to estimate the number of clusters or try to find the k with the best fit to the data to make up for this.

keenmaster · on Jan 13, 2024

Couldn’t you make some educated guesses and then stop when you arrive at a K that gives you meaningful clusters that are neither too high level nor too atomized.

yarg · on Jan 13, 2024

Probably not the best in terms of efficiency.

Easier just to deliberately overshoot (with a too high k) and then merge any clusters with too much overlap.

jncfhnb · on Jan 13, 2024

K is 3.

It’s honestly fine for just finding key differences like a principal component for light storytelling. They don’t need to be distinct clusters

JackeJR · on Jan 13, 2024

SMOTE

boredemployee · on Jan 13, 2024

knowledge graphs