After working with LLMs for long enough, I found myself wanting a lightweight utility for doing various small tasks to prepare inputs, locate information and create evaluators. This library is two things: a very simple model and utilities that inference it (eg. fuzzy deduplication). The target platform is CPU, and it’s intended to be light, fast and pip installable — a library that lowers the barrier to working with strings
semantically. You don’t need to install pytorch to use it, or any deep learning runtimes.
How can this be accomplished? The model is simply token embeddings that are average pooled. To create this model, I extracted token embedding (nn.Embedding) vectors from LLMs, concatenated them along the embedding dimension, added a learnable weight parameter, and projected them to a smaller dimension. Using the sentence transformers framework and datasets, I trained the pooled embedding with multiple negatives ranking loss and matryoshka representation learning so they can be truncated. After training, the weights and projections are no longer needed, because there is no contextual calculations. I inference the entire token vocabulary and save the new token embeddings to be loaded to numpy.
While the results are not impressive compared to transformer models, they perform well on MTEB benchmarks compared to word embedding models (which they are most similar to), while being much smaller in size (smallest model, 32k vocab, 64-dim is only 4MB).
On the utility side, I’ve been adding some tools that I think it’ll be useful for. In addition to general embedding, there’s algorithms for ranking, filtering, clustering, deduplicating and similarity. Some of them have a cython implementation, and I’m continuing to work on benchmarking them and improving them as I have time. In addition to “standard” models that use cosine similarity for some algorithms, there are binarized models that use hamming distance. This is a slightly faster, similarity algorithm, with significantly less memory per embedding (float32 -> 1 bit).
Hope you enjoy it, and find it useful. PS I haven’t figured out Windows builds yet, but Linux and Mac are supported.
But it seems quite dated technically - which I understand is a tradeoff for performance - but can you provide a way to toggle between different types of similarity (e.g. semantic, NLI, noun-abstract)?
E.g. I sometimes want "Freezing" and "Burning" to be very similar (1) as in regards to say grouping/clustering articles in a newspaper into categories like "Extreme environmental events", like on MTEB/Sentence-Similarity, as classic Word2Vec/GloVe would do. But if this was a chemistry article, I want them to be opposite, like ChatGPT embeddings would be. And sometime I want to use NLI embeddings to work our the causal link between two things. Because the latter two embedding types are more recent (2019+), they are where the technical opportunity is, not the older MTEB/semantic similarity ones which have been performant enough for many use cases since 2014 and 2019 received a big boost with mini-lm-v2 etc.
For the above 3 embedding types I can use SBERT but the dimensions are large, models quite large, and having to load multiple models for different similarity types is straining on resources, it often takes about 6GB because generative embedding models (or E5 etc) are large, as are NLI models.