I bet language models could definitely help here, yeah. Perhaps something like (...

gwern · 2024-04-02T01:50:11 1712022611

Yep. That would be a classic sort of k-means problem. Just throw them all into a standard embedding, like the OA API embeddings, run k-means from sci-kit, then convert them into a list-of-lists: one RSS item (containing a list of title-URLs) per cluster.

woodglyst · 2024-04-02T15:07:57 1712070477

The problem with this approach is determining the what k is for the k-means. But again, we could use the “elbow” technique to determine what’s the optimal k and then start grouping them together. I wonder if there are any automatic sophisticated clustering algorithms?

memhole · 2024-04-02T19:15:36 1712085336

Hierarchical and DBSCAN don’t require upfront knowledge about the number of clusters.