Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There's a second half of a two hour video on YouTube which talks about creating embeddings using some pre transforms followed by SVD with some distance shenanigans,

https://www.youtube.com/watch?v=Z6s7PrfJlQ0&t=3084s

It's 4 years old and seems to be a bit of a hidden gem. Someone even pipes up at 1:26 to say "This is really cool. Is this written up somewhere?"

[snapshot of the code shown]

    %%time
    cooc = vectorizers.TokenCooccurrenceVectorizer(
        window_orientation="after",
        kernel_function="harmonic",
        min_document_occurrences=5,
        window_radius=20,
    ).fit(tokenized_news)
    
    context_after_matrix = cooc.transform(tokenized_news)
    context_before_matrix = context_after_matrix.transpose()

    cooc_matrix = scipy.sparse.hstack([context_before_matrix, context_after_matrix])
    cooc_matrix = sklearn.preprocessing.normalize(cooc_matrix, norm="max", axis=0)
    cooc_matrix = sklearn.preprocessing.normalize(cooc_matrix, norm="l1", axis=1)
    cooc_matrix.data = np.power(cooc_matrix.data, 0.25)

    u, s, v = scipy.sparse.linalg.svds(cooc_matrix, k=160)
    word_vectors = u @ scipy.sparse.diags(np.sqrt(s))

CPU times: user 3min 5s, sys: 20.2 s, total: 3min 25s

Wall time: 1min 26s



That’s Leland McInnes - author of UMAP, the widely-used dimension reduction tool


I know, I mentioned his name in a post last week, Figured doing so again might seem a bit fanboy-ish. I am kind-of a fan but mostly a fan of good explanations. He's just self-selecting for the group.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: