Show HN: Llama2 Embeddings FastAPI Service

I just wanted a quick and easy way to easily submit strings to a REST API and get back the embedding vectors in JSON using Llama2 and other similar LLMs, so I put this together over the past couple days. It's very quick and easy to set up and totally self-contained and self-hosted. You can easily add new models to it by simply adding the HuggingFace URL to the GGML format model weights. Two models are included by default, and these are automatically downloaded the first time it's run.

It lets you not only submit text strings and get back the embeddings, but also to compare two strings and get back their similarity score (i.e., the cosine similarity of their embedding vectors). You can also upload a plaintext file or PDF and get back all the embeddings for every sentence in the file as a zipped JSON file (and you can specify the layout of this JSON file).

Each time an embedding is computed for a given string with a given LLM, that vector is stored in the SQlite database and can be returned immediately. You can also search across all stored vectors easily using a query string; this uses FAISS which is integrated.

There are lots of nice performance enhancements, including parallel inference, db write queue, fully async everything, and even a RAM Disk feature to speed up model loading.

I personally find it annoying when a simple program is split into dozens of tiny files, so everything is in one ~1,000 line .py file for your reading convenience!

I'd love to get some feedback and maybe a few PRs. There is obviously a lot that can be added on to this foundation, like exposing more functionality from Langchain, but I think it's probably better for it to stay focused on embeddings since there are already a lot of more general LLM APIs out there.