I would love to read more about your experience. We need more content with feature, peformance, and architecture comparisons. Currently, there's a lot of developer evangelism hype in the space.
Yep, we're (https://www.definite.app/) using pgvector and I was initially concerned about scaling, but it doesn't seem it will be a problem for our use case. I definitely wouldn't use it if I was building a feature for Slack, but works for us!
Yes, working on that landing page right now (currently it's pretty week)!
We're building an AI data analyst. You can ask questions of your database and get answers immediately. We also auto generate entire dashboards based on common patterns (e.g. a "Sales Dashboard", "Marketing Dashboard", "Finance / Burn" etc.).
If you want to give it a try (there's a demo database embedded in the app), you can use it here: https://ui.definite.app/
Can you search both by an equality comparison and a vector search in weaviate? I’d like to do something along the lines of `SELECT * FROM table t WHERE cosine_dist(:my_embedding, t.doc_embedding) < :x AND some_column = “XYZ”`
I have a ChatGPT session where I have asked it to do a hybrid search using filtering, pg fts and vector search. Looks reasonable just need to test it and write it up somewhere.
Amen. After suffering through many years of people telling me to use document databases when I was much better served with—at most—Postgres with a jsonb field, I feel vindicated enough to feel justified in doing my due diligence before going off the beaten track.
Not that document databases don’t have their place, but…MongoDB is webscale and all that.
Yup. pgvector will do it for a lot of projects, specially if you're just trying things out. It think of it as using PostgreSQL full text search before you need to deploy a decidated solution.
Also plugging my crappy vector database, which you probably shouldn't use for anything but a fun project, however it can be set up and used in seconds. https://github.com/corlinp/Victor
I'm bullish on pgvector as well. Now that RDS supports it as well as plenty of other cloud providers it seems like a no-brainer to be able to stick with your existing stack (assuming it's postgres). Andrew Kane is such a prolific open-source maintainer as well.
It seems like if the goal is to "play around with vector databases", why not just install it on your local machine? Part of using these tools is learning how they work and configuring them yourself.
If the goal is "start developing products using vector data bases" then it seems like you would surely want something a bit more under your control than using replit.
I would say my use case is the same as many other people who use it. Repl.it is fantastic for getting started, for sharing your code, and for creating small applications.
Curious with why you went for an Apache license. Aren’t you worried about copy-cat services? Or does the OSS version lack the scaling/distributed features that would be more difficult to replicate? I think that was ES’s fatal mistake and their licensing games are unlikely to pan out.
The Coral Project [0] (commenting platform used on Washington Post, New York Times, The Verge) uses an Apache 2.0 license [1]. Which doesn't seem to have prevented it from raking in big SaaS customers.
A lot of people worry about copy-cat services, but it's kind of rare that someone will be able to compete with you as the original in hosting your own service as well as you can. Especially when you consider support and maintenance requirements of a new product you aren't personally developing.
I could see copy-cat services being more of an issue in the late stage of a product though? When everyone knows lots about how to stand it up and use it?
> I could see copy-cat services being more of an issue in the late stage of a product though? When everyone knows lots about how to stand it up and use it?
The concern isn't random small companies. The concern is the big cloud providers like AWS, Azure and Google. And you are right, they aren't going build out a hosted version of your product until there is enough traction. But at that point, customers might indeed trust them more than you to run your own software! Redis and Elastic ran into this problem for example.
The most likely scenario though - is never getting traction - so anything to improve traction such as permissive licensing is probably a better tradeoff.
hey ffback, contributors get 100% of the bounty award :) the organization pays the fee on top of the bounty. will update the docs to make this more clear, thank you!
A +1 for qdrant from a happy user. we use qdrant in production with a 50-100MM rows scale. Haven't experienced many bottlenecks thus far, and has performed quite well.
@qdrant_team: perhaps you should look into offering it as a service, a la pinecone.
edit: oops just checked your (updated) website and notice you have an offering already. Congrats! will check it out. ty =)
Well, to work on the core of the Qdrant engine https://github.com/qdrant/qdrant you should have some db knowledge but even more important are Rust skills. However, we have also other products, like the cloud platform https://cloud.qdrant.io there we are looking for different skills.
If anyone wants to try a FOSS vector-relational-graph hybrid database for more complicated workloads than simple vector search, here it is: https://github.com/cozodb/cozo/
Glad I hopped into this thread while your comment was recent enough to be at the top. This is super interesting! Apologies if you went over this in your other post (or the docs, I'll be digging into this over the weekend) but could you share a bit about why you went this route? What you tried, what the hangups were/are with other approaches, and if there are any interesting possibilities with your approach that other vector databases just wouldn't be able to do?
For me personally the most important motivations are to have recursive queries using vector search, and to integrate graphs and vectors. Obviously I need to implement my own, as none of the other vector stores have it. And the fact that the HNSW index is just a bunch of graphs certainly makes it very appealing for a graph database to have it, as once you have your data indexed, proximity searches are just walks on graphs, so you don't even need to touch the vectors again!
Thanks for the links and discussions, I’m keeping an eye on this one it looks really promising, at least in the hybrid area compared to the much hyped surrealDB whose graph implementation looks more like an afterthought when you get down to the technical details, functionality and performance
Unfortunately this piece is nebulous on what an embedding is. Apparently it is saved as an array of floats, and it has some string of text it is associated with, and the float arrays are compared by "similarity".
None of these explains what an embedding really is. My best guess is that the embedding represents the meaning of the natural language string it was generated from, such that strings with "similar" embeddings have similar meaning. But that's just speculation.
> My best guess is that the embedding represents the meaning of the natural language string it was generated from, such that strings with "similar" embeddings have similar meaning. But that's just speculation.
Yeah, you've got it. A mapping from words to vectors such that semantic similarity between words is reflected in mathematical similarity between vectors.
An idea of how you might train this thing: lets say the words "king" and "queen" are being embedded. In your training data there are lots of examples where "king" and "queen" are interchangeable, for example in the sentence "The ___ is dead, long live the ____", either word is appropriate in either slot, so each time we see an example like this we nudge "king" and "queen" a little closer together in some sense. However you also find phrases where they are not interchangeable, such as "The first born male will one day be ____". So when you see those examples you nudge "king" a little closer in some sense to other words which appropriately complete the sentence (which does not include "queen" in this case).
In this way, repeated over a giant training set with thousands of words, concepts like "male/female" and "royalty", "person/object" and tons of others end up getting reflected in the relationships between the vectors.
These vectors are then useful representations of words to ML models.
Right, makes sense. But then what do you actually do with a database?
Starting with: what do you store in it?
Maybe sentence/vector pairs. But what does that give you? What do you do with that data algorithmically? What's the equivalent of a SELECT statement? What's the application that benefits an end user? That part still seems rather hazy.
I haven't worked in this space, but from what I gather, the idea would be something along the lines of the following:
An autoencoder is a model that takes a high dimensional input, distills it down to a low dimensional middle layer, and then tries to rebuild the high dimensional input again. You train the model to minimize reconstruction error, and the point is then that you can run an input on just the first half to get a low-dimensional representation that captures the "essence" of the thing (in the "latent space"). In this representation, images that are similar should have similar "essences", so their latent vectors should be near to each other.
The low dimensional representation must do a good job capturing the "essence" of your things, otherwise your reconstruction error would be large. The lower the dimension you manage to use while still managing to reconstruct your things, the better of a job it must do at making those parameters really encode the salient features of your thing without wasting any information. So similar things should be encoded similarly.
So imagine you've got a database of images, and you have a table of all of the low dimensional encoded vectors. You want to do a reverse image search. The user sends you an image, you run the encoder on it to get the latent representation, and then you want to essentially run "SELECT ei.image_id FROM encoded_images ei ORDER BY distance(encode(input_image), ei.encoding) LIMIT 10".
So you want a database that supports indexes that let you efficiently run vector similarity queries/nearest neighbor search, i.e. that support an efficient "ORDER BY distance(_, indexed_column)". Since the whole process was fuzzy anyway, you may actually want to support an approximate "ORDER BY distance" for speed.
In practice apparently the encoding might be taking the output of the first or nth layer in a deep network or something rather than specifically using an autoencoder. Or you may have some other way to hash/encode things to produce a latent representation that you want to do distance searches on. And of course images could instead be documents or whatever you want to run similarity searches on.
Often the use case is search. Ex. You have a basic text search engine to find musicians on your site which does some string matching and basic tokenization and so on. But you want to be able to surface similar types of musicians on search too.
In that case you might store vectors representing a user based on some features youve selected, or a word embedding of their common genres/tags.
To actually search this thing, you need something to compare against. You could directly use the word embeddings of the search query. You could also do a search against your existing method, and then use the top results from that as a seed to search your vectors.
Since everything's a vector, you can also ask questions like "what musician is similar to Tom AND Sally" by looking for vectors near T+S. T-S could represent like Tom but not like Sally, etc.
So the answer to what do you store is, what will be your seed to search against?
Wow, that's interesting. So we can do vector arithmetic and the results make sense as a form of embedding/concept logic. Addition seems to work like "and" / set intersection. Subtraction works like set difference ("T\S", also literally written "T-S", i.e. "without"), which logically says "T, but not S", or in terms of predicate calculus, "T(x) & not S(x)".
Perhaps there is also some unitary vector operation which directly corresponds to negation (not Sally)? Perhaps multiplying the vector by -1? Or would (-1)S rather pick out "the opposite of Sally in conceptual space" instead of "not / anyone but, Sally"? And what about logical disjunction (union)? One could go further here, and ask whether there is an analog to logical quantifiers. Then there is of course the question whether there is anything in vector logic which would correspond to relations, binary predicates, like R(x, y), not just unitary ones, etc.
The vectors are usually (if you use OpenAI API anyway) unit in length, and so you can imagine them on the surface of a hypersphere.
You measure the cosine distance between documents, or between search queries and documents. (Cosine is fast, there are other distance metrics).
The vector database queries will do things like given one embedding (document or query) find the nearest embeddings (documents). Or given two embeddings (e.g. a query and a context) with a weight for each one, find the ones that triangulate to being near both.
Simple answer - you normally store text in it, but with the state of neural networks these days most things can be vectorized and searched.
So, coming myself from a database background but working in search, the SELECT statement (and joins) probably aren't the best way to get your head wrapped around things. I would think of the vector as a unique key for a record, and only using a LIKE statement for all my queries, but one that will return a probability of a match instead of an actual match.
A great use case is to think about similarity, where we want the things that are closest to what we want to see, but there isn't an exact match.
For example; a user gives me a sentence that says, "How long do I have to be with the company before I get a 401K match?". My vector store has a bunch of vectors including "A new employee will be eligible for 401K after 6 months." ,and, "The 401K program is run by <MEGACORP X>."
I would like to be able to see that the first vector is a closer match to the user sentence than the second, and by how much. I would also like to do this without having to change my code much based on the structure of the text. Luckily, there is a very simple algorithm for doing this (cosine similarity) that doesn't change regardless of the sentence structure or the question answered. Also, it doesn't matter what kind of question/answer you do as long as it can be vectorized, so you could even give me a vector representing an image and I can give you an image that is most similar.
Here is the most interesting thing about vectors -- with very little effort they turn the english language into a programming language.
Instead of typing "SELECT document_id, document_name, document_body FROM documents WHERE (document_body LIKE '%401K%' AND document_body LIKE '%match%' AND document_body LIKE '%existing employee%') FROM documents" I can just ask, "How long do I have to be with the company before I get a 401K match?" and I will get back a result and a match probability. How I change my text will change the matches, and can do so in ways that are profound and unexpected. Note that the SQL query I gave would not return any values because I didn't have any documents that had the term "existing" in them. Building the correct SQL query could be quite complex it comparison to just using the text.
This is pretty great for long-tailed search, q&a, image search, recommendations, classification, etc.
BTW, I am biased, I work for Elastic (makers of Elasticsearch) and we have been doing traditional search forever, and vector/hybrid search for the last few years.
Because the model used to compute the embeddings is the same across scenarios. You can infer meaning for each dimension by checking which inputs get embeddings that have large values for the dimension.
If the inputs are images, you may find that some dimension scores e.g. how much blue there is in the image. Though often it's not that simple (there could be multiple dimensions that relate to how blue the image is, especially if the embedding dimensionality is large, which it does tend to be these days. Though you could reduce the embedding dimensionality first using PCA, and see what input images correspond to high/low values of the first principal component, etc.).
Dimensions itself do not carry any meaning, what matters are the neighbors to maintain a sense of similarity. Think if it like a very complex point cloud. Applying an n-dimensional rotation leads to the same point cloud content wise.
As for the number of dimensions, in a sense they are a training variable just as the content itself. The more dimensions you utilize for your embeddings the more complex your relations can be during clustering. Too many dimensions can easily lead to over fitting however and too little dimensions can usually not accurately represent the training corpus.
There are good sibling explanations by @ta20211004_1 and @HarHarVeryFunny, but if I can try in an additional way:
Imagine you wanted to go from words to numbers (which are easier to work with mathematically), like you wanted to assign a number to some words.
How could you do it? Well you could do it randomly: cat could be 2, dog could be 10, sweater could be 4.534 and frog could be 8.
Not super useful, but hey - words are now numbers! How can we make this "better"?
What if we decided on a way to put words on a line - let's say we ordered words by how much they had to do with animals. Let's say 10 meant it's a very animal-related word, and 0 is very not-animal related. So cat and dog would be 10, and maybe zoo would be 9, and fur could be 8. But something like sweater would be 1 (depending if the sweater was made from animal wool...?)
What now? Well what's cool is that if you assign words on that "animal-ness" line, you can find the words that are "similar" by looking at the numbers that are close. So, words whose value is around 6 are probably similar in meaning. At least, in terms of how much they relate to animals.
That's the core idea. Ordering words by animal-ness is not that useful in the real world, so maybe we can place words on a 2d grid instead of a line. Horizontally, it would go from 0 to 10 (not animal at all - very animal) and vertically, it could be ordered by brightness - 0 for dark, and 10 for bright.
So now, bright animals will congregate together in one part of the grid, and dark non animals will also live close together. For example, a dark frog might be in the bottom right at position (10, 0) - very animal (right end of the x axis) but not bright (bottom of the y axis). Any other word whose position is close to (10, 0) would presumably also be animal-y and dark.
That's really it. The magic is that... this works in thousands of dimensions. Each dimension being some way that "AIs" see words / our world. It's harder to think about what each dimension "is" or represents. But embeddings are really just that - the position in a space with a huge number of dimensions. Just like dark frogs were (10, 0) in our simple example, the word "frog" might be (0.124, 0.51251, 0.61, 0.2362, 0.236236, ..............) as an embedding.
Each vector is an array of n floats that represent a location of a thing in an n-dimensional space. The idea of learning an embedding is that you have some learning process that will put items that are similar into similar parts of that vector space.
The vectors don’t necessarily need to represent words and the model that produces them doesn’t necessarily to be a language model.
For example, embeddings are widely used to generate recommendations. Say you have a dataset of users clicking on products on a website. You could assume that products that get clicked in the same session are probably similar and use that dataset to learn an embedding for products. This would give you vector representing each product. When you want to generate recommendations for a product, you take the vector for that product and then search through the set of all product vectors to find those that are closest to it in the vector space.
An embedding is a a way to map words into a high-dimensional "concept space", so they can be processed by ML algorithms. The most popular one is word2vec
A vector database is used for things where you're trying query "I have this image, give me a list of the 10 closest images and metrics of how similar they are."
You use a machine learning model (like word2vec, OpenAI, etc.) to produce an "embedding" that describes the image, text, video, etc., which is your "vector".
For all of the other images in your database, you also run them through the same model, and store their embedding vectors in the vector database.
Then, you ask the database "I have this vector, what are the most similar vectors, and what are their primary keys, so I can see what content they refer to".
Think: you want to implement google "search by image". This is the basics of how you'd do that.
Why use the word "embedding" if there are already much more familiar words for it (isn't this the same as feature vector)?
I want to convince myself that this isn't similar to blockchain. In the sense that blockchain renamed an old and simple idea and advertised it as something complex and groundbreaking...
Also, relational databases or graph databases have a reach theory that results in many interesting sub-problems, each interesting in its own right, to contrast this with "document databases", which have no theory, and nothing interesting behind it. So, if I were to invest my time learning about one w/o a financial incentive to do so, I'd not want to concentrate on some accidental concept that just happened to solve an immediate problem, but isn't applicable / transferable to other problems.
For example, graph databases and relational databases create interesting storage problems wrt' optimal layout for various database components. If hash-table is all there is to the vector database, then it's not an interesting storage problem.
Similarly, with querying the database: if key lookup from a hash-table is all there is, then it's not an interesting problem.
Okay, "mapping into concept space" is at least compatible with my meaning theory, but by itself it doesn't say much, since in principle anything can be mapped to anything.
Embeddings are a mapping of some type of thing (pictures, words, sentences, etc) to points in a high-dimensional space (e.g. few hundred dimensions) such that items that are close together in this space have some similarity.
The general idea is that the items you are embedding may vary in very many different ways, so trying to map them into a low dimensional space based on similarity isn't going to be able to capture all of that (e.g. if you wanted to represent faces in a 2-D space, you could only use 2 similarity measures such as eye and skin color). However a high enough dimensional space is able to represent many more axis of similarity.
Embeddings are learnt from examples, with the learning algorithm trying to map items that are similar to be close together in the embedding space, and items that are dissimilar to be distant from each other. For example, one could generate an embedding of face photos based on visual similarity by training it with many photos of each of a large number of people, and have the embedding learn to group all photos of the same person to be close together, and further away from those of other individuals. If you now had a new photo and wanted to know who it is (or who it most looks like), you'd generate the embedding for the new photo and determine what other photos it is close to in the embedding space.
Another example would be to create an embedding of words, trying to capture the meanings of words. The common way to to this is to take advantage of the fact that words are largely defined by use/context, so you can take a lot of texts and embed the constituent words such that words that are physically close together in the text are close together in the embedding space. This works surprisingly well, and words that end up close together in the embedding space can be seen to be related in terms of meaning.
Word embeddings are useful as an input to machine learning models/algorithms where you want the model to "understand" the words, and so it is useful if words with similar meaning have similar representations (i.e. their embeddings are close together), and vice versa.
As opwieurposiu said, embeddings are high-dimensional vectors. Often, they're created by classic math techniques (e.g. principal component analysis), or they are extracted from a model that proved useful for something else.
For example, a neural net model accepts a massive number of input values that directly map to the input. So those initial values don't add any info. But a layer further inside the model, with fewer values and probably close to the end, is smaller and should reflect what the model's learned. Like a lot of deep learning, three values work but don't give much insight.
A word or sentence embedding is a long array of numbers that represents the semantic "position" in a high dimensional space, which allows you to find the distance between any two sentences in this semantic space. My understanding of paragraph and document embeddings is that they are an average of all the sentence vectors combined as one point, which lets you find the distance between any two sentences in this semantic space.
Yeah... for a while I wanted to understand what a vector database is, but this article reads like a thinly-veiled advertorial: too many buzzwords, and the content feels like the author doesn't really have a good knowledge of the subject and is just trying to advertise the tech their company is selling.
An embedding is a series of numbers that have been gradually shifted to better fit some purpose. The gradients tell me that if I increase the first number of embedding X a little, the model will perform better, so I do.
I don't understand how so much money has been poured in to these companies?
I get the why the techniques are suitable, but I just assumed who ever wants to do this kind of retrieval can probably implement a suitable Approx. NN library themselves?
Especially so, because getting good embeddings is the hard part, not the search?
>I don't understand how so much money has been poured in to these companies?
First time here? Just kidding. But not.
You have to separate the VC hype with the product, because the VCs always need something to overhype. Half these people were pumping money in to crypto and whatever-the-hell-web3-is/was just a couple months ago, this is just the next thing they like. Half these companies probably aren't remotely good companies.
I am on a small team that initially rolled our own semantic search system. We quickly ran into issues around scaling, maintenance, and performance. Since we want to focus on delivering features and not turning into a DevOps team, we switched to Pinecone and it has met our needs pretty well. We would like to see auto-scaling and I believe that this feature is in the works. Support has been very responsive and helpful when we do have questions and issues.
There are plenty of LLMs to choose from with regard to finding sources of embeddings. Some free, some for money.
For the same reason you have money going to various SQL-as-a-service companies that run Postgres / MySQL for you as a service: Some folks would rather eat the network latency, give up control of their data and complicate their compliance process than operating a database themselves.
The difference with SQL is that it's not like storing vector embeddings outside your perimeter suppose a big compliance issue --at least in security or legal terms. Giving up control of their data and network latency are legit concerns, that's for sure.
While Pinecone isn't available as a self-hosted option (see many comments with alternatives), we do offer the option of running Pinecone for you on a managed VPC, and we do have SOC2 compliance, and we do pass enterprise-level security reviews regularly. Whether that's sufficient is up to you of course.
A lot of people use some form of managed services if they are in the cloud. Be it S3 or Dynamo DB. Generally cheaper than running things yourself and operationally much easier too.
Anyone who wants this can implement their own library themselves? When has this worked for any problem ever?
Searching efficiently is a problem, and there's several open source and proprietary solutions but I don't get how you can put it in the "everyone should roll their own" category.
There are plenty of options available to run your own local vector database, txtai is one of them. Ultimately depends if you have a sizable development team or not. But saying it is impossible is a step too far.
Even in that article with much smaller vectors than what GPT puts out (1536 dimensions) QPS drops below 100 if recall@1 is more than 0.4. That's to say nothing of cost of regenerating this index using incremental updates. I don't get why people on HN are so adamant on the idea that no one needs scale beyond 1 machine ever.
If you have a billion vectors, is “yourself” a large tech company who does stuff like roll their own browsers, programming languages, invents kubernetes etc. Probably could roll this! And indeed sell this.
Last time I had to deal with vector representation of documents was more than 10 years ago, so I'm a bit rusty, but billion vector scale sound relatively trivial.
With retrieval time in the milliseconds? The entries may be ads, or something else user facing. Your users are not going to sit around while you leisurely retrieve them.
You do realize you have to query an index of all of that data for every single query your use makes right? Computing that index is not entirely trivial, nor is the operation of partitioning the data so it fits in ram across a pool of nodes.
Sure, role your own, but don’t act like making a highly scalable database is a weekend project.
Consumer hardware can still handle that with 1TB RAM + ThreadRipper Pro.
> You do realize you have to query an index of all of that data for every single query your use makes right? Computing that index is not entirely trivial, nor is the operation of partitioning the data so it fits in ram across a pool of nodes.
I don't know what any of this means -- and it sounds like you're slapping a bunch of terminology together, rather than communicating a well-thought-out idea.
Yes, in the general case you're going to have to use an index. Computing an index or a key to that index? Computing the index is a solved problem, that does not have a hard real-time component -- you can do it outside of normal query executions. Computing the key to the index on each query is also a solved problem.
Have dimensions stored in columnar format, generate a sparse primary index on said columns, and then use binary search to quickly find the blocks of interest to do a sequential search on viz. distance function. Or you could even just use regular old SS trees, SR trees, or M Trees for high-dimensional indexing -- they're not expensive to use at all.
There, you can easily run a query on a single dimension (1 billion entries) under a second. You want 300 dimensions? Ok, parallelize it. 128 threads, easy. At most this will take 3 seconds if everything is configured properly (big IF, that seems like few can get right).
This is literally a weekend project. Anyone can build something like this, but not everyone has the integrity to be upfront about how they're reinventing the wheel, and spinning it like they've just broken ground in database R&D.
What kind of QPS are you looking at? How are you handling 1536 dimensions? How long does an incremental index update take? These are the problems you run into in building such a system.
I'm not familiar with the index part, but you can get at least 2TB on a single CPU socket these days. You shouldn't need multiple machines to fit in RAM. Depending on what QPS you need to handle, you might also be fine to not have the whole thing fit in RAM.
lol, not true. Even for huge vectors (1000 page docs), today you can do this with enough disk storage with something like leveldb on a single node, and in memory with something like ScaNN for nearest neighbor.
It is at the intersection of technology investors "know" (databases) and technology investors don't know, but have been told is about to blow up (ML).
It is also effectively "roll your own Google/Shazam/whatever", which probably makes for a fancy demo to those who don't know how trivial it is to implement.
I don't see it addressed in the article, but Elastic 8 has ANN support, and every other feature you'd expect out of a ranking system. Vectors are only one piece of the puzzle for building such a system. (honest question, not trying to troll, as I truly do <3 these pinecone articles)
There currently isn't a way to filter docs alongside a KNN query, and the dimension support is limited to 1024 (a Lucene limitation) and OpenAI embeddings are 1536 dimensions - also indexing performance is not comparable. Wishing this changes, as they're a good stack for the reasons you state
Not really, I've been operating 10+ node Elasticsearch cluster for years, running on a workload scheduler (Nomad). I never have to perform any maintenance or housekeeping except deleting old indices, and updates are performed by bumping a container version number and then restarting nodes one-by-one with a delay in between.
Vector will eventually just be another data-type in all db-systems. Already so many production systems have their data replicated across multiple dbs, just to accommodate different use-cases. I'm not keen in adding yet another one.
We always encourage folks to do their own testing. Everyone has different performance requirements, data shapes/sizes, budgets, and expectations of the user experience.
Elasticsearch is a great option. But clearly there's a large cohort of smart teams that decided the combination of performance + cost + scale + [etc] on Pinecone makes more sense for them.
Hey Greg! Yes I am trolling a bit, to see what the answers might be.
IMO - the real reason "Y Not Elasticsearch" is not because they're dumb or its bad. It's actually because they're not building for the search / AI market like you all are :)
When someone runs out of RAM with their Numpy array, they google, and you guys come up really speaking to that audience, building features, showing people how to build specific solutions, etc.
I looked at the concepts in FAISS and it seems fairly straightforward. In non-jargon you have dimensionality reduction and neighborhoods.
DR is taking a long embedding and doing something to make it shorter. An easy to follow method for this is minhash.
Neighborhoods is representing a cluster of embeddings with a single representative to speed up comparisons. For example, find me the two closest representatives then doing a deeper comparison on all the residents.
Now the feature I haven't seem that will probably cause me to build instead of buy. Most seem designed for a single organization and a single use. For example, Spotify song recommender.
I would like to store embedding from multiple models and be able to search per model. I would also like fine grain user access control, so users could search their embeddings and grant access to others.
If the different models use the same dimensionality, you can keep their embeddings within different namespaces inside the same index. See: https://docs.pinecone.io/docs/namespaces
If you mean for your end-users, you can use namespaces again to separate embeddings for different users inside one index. See: https://docs.pinecone.io/docs/multitenancy
There isn't yet a combination of the two, where you provide Pinecone API access to end-users.
Thank you, I'll definitely play with pinecone before I build. The dimensionality might vary between models or versions of models. Additionally, the end goal would be to expose it to users and not have to post filter. So probably an index per user. Not sure how expensive that is to recalculate regularly.
Do any of the vector databases have support for bit embeddings. We have created bit embeddings[1] for sentences and they save a lot of space. Currently we are just using numpy and sometimes faiss to search through these bit embeddings. Would love for one of the vector dbs to support bit embeddings natively. Then we don't have to engineer that piece :)
I imagine it's because there is a *massive* scramble to incorporate language models into existing applications and most of the time you need a vector database to do that. As a proxy for understanding the scale of this scramble, ChatGPT is the fastest growing software product ever by a pretty substantial margin.
Without a doubt LLMs are driving the interest in vector Dbs.
More so, I am wondering how it is that Pinecone manages to land on the HN front page so frequently given the large number of alternatives (see all the other comments in this thread). It suggests to me a coordinated marketing effort (brigading, cronyism, etc.)
Having myself recently bootstrapped an understanding of language models et al, I would not be surprised if the pinecone learning center gets a lot of traffic.
Does anyone have some real, production experience with milvus? I'm interested this database performs in larger scale. Let's say, you have millions of vectors and traffic reaching thousands requests/s.
Not production, but yes to scale: I pushed milvus to ~140 million vectors (768 dimension) (though only a handful of requests per second (~10)), and it faired alright once everything was up and running and relatively static on the document side. Rebuilding indexes and stability were a bit of a hassle at times (I was live adding more documents to it ~1 million per 30 minutes) and it would occasionally fall over and need to rebuild, subsequently causing a lot more load, rejecting new documents, etc.). Probably lots of tuning I probably could have done to eek out more performance and stability though. Ended up being hours of effort on the rebuilds and lots of careful management of RAM (on a 300 GB RAM machine)
for the scale you are saying "larger scale": At the few million documents scale I would just suggest using just any libary, eg. `hnsw` in `nmslib` or `faiss`.
I just did some benchmarks with 1M docs, `cosinesimil_sparse` on `78628` dimensional binary vectors (nmslib `hnsw`) -> 30 seconds to build the index, and can process a batch of 100 document query in 3ms (Each with 100 KNN). Based on this question, i just put a loop over it and it handled 1000 random queries (non batched) in 1.11 seconds. (~1 GB peak RAM usage, and using 24 threads)
All in all, my personal opinion is: even up to few "millions" scale, i'm finding using the underlying libraries (`faiss` and `nmslib`) significantly easier than using the wrapper tools / databases (milvus and pinecone). I don't really get the point of a separate piece of infra for something that is essentially ~15 lines of python at most scales that matter (~few millions). (Note, in the ~10k-100k scale or less, simple numpy and sort seems to be fast enough (and exact) or just exact NN w/ sklearn.neighbors)... And when you push to scales that it does start breaking (100 million+), then the database versions seem to break as well (and require fiddling with lots of bespoke config)
Thanks for the input! I asked about the scale of items and traffic, because my use case actually requires separate piece of infrastructure. It's around 100 millions of items and live production traffic from millions of users with high latency demand. So it's not a batch job that can be performed in memory, as I understand your case.
Currently I use Elasticsearch with the Open Distro approximate kNN plugin by the way.
Yes! We've been running Milvus in production for about three years now, powering some customers that do have queries at that scale. It has its foibles like all of these systems (the lack of non-int id fields in the 1.x line is maddening and has required a bunch of additional engineering by us to work with our other systems), but it has held up pretty well in our experience.
(I can't speak to Milvus 2.x as we are probably not going to upgrade to that for a number of non-performance reasons)
I have a dumb question: isn't the the same set algorithms and technology we've needed to develop for geospatial search (given coordinates, find the nearest coordinates)?
I remember reading about how Google maps did a very similar thing to figure out which points of interest to load based on your coordinates and zoom.
Can't we repurpose that technology? Or did those bake in assumptions around being 2 dimensional (while this is highly dimensional?)
Apart from the comments you've already gotten, another goal of geospatial systems is to support range queries (e.g. for the bounding box of the user's screen, what are all the businesses in that box). In higher dimensions range queries are mostly useless and the focus is on NN queries.
But as the other comments have mostly said, it's mainly dimensionality and scale differences that drive the design differences (e.g. graphs end up working better than trees in high dimensions)
The main problem is that the embeddings are getting larger and larger (1,000+ dimensions). The pressure is then on reducing memory use through techniques such as Product Quantization while not losing too much accuracy.
Once this is done, the search heuristics are not difficult (find the cells to explore and return nearest neighbors).
Hey all, I'm from Pinecone (shocker). Addressing common questions...
What are vector DBs used for? > Storing and search through embeddings at scale, which are created and consumed by LLMs and other AI models for applications like semantic search and chatbots (eg, to avoid hallucinations).
Why use a managed vector DB like Pinecone instead of [Faiss, pgvector, self-hosted thing, numpy.array, etc]? > Usually comes down to scale and convenience. If you're dealing with a small amount of embeddings, say anything less than 10M, you're probably fine just reaching for the closest and most convenient option. (We try to make Pinecone that convenient option, and our free plan holds up to ~100k 1536-dimension embeddings.) If you're dealing with larger scale -- say hundreds of millions to billions of embeddings -- and have strict performance requirements, and aren't thrilled by the thought of managing your own vector database like we are, then you should consider Pinecone. It turns out there's a sufficiently large population that falls into the latter category, just as with any other database category.
Allow me one more plug: We're hosting a webinar next week about testing Pinecone performance with your own data and performance requirements. I have a feeling lots of folks reading this would find that useful. → https://pinecone-io.zoom.us/webinar/register/WN_z9JqLjLGTyu4...
How are you guys thinking about the embedding generation side of things? It seems like that part has a generally hefty compute cost before it even gets into the index - I just open sourced a swift package to try to make that part as easy as possible, the example project exports directly to pinecone. https://github.com/ZachNagengast/similarity-search-kit
That’s nearest neighbour search which scales O(n^2) for the number of vectors in your DB, what these DBs (and libraries like FAISS) use is approximate nearest neighbour, which makes the search much, much faster.
Salakhutdinov R. R, and Hinton, G. E. (2007) "Semantic Hashing" Proceedings of the SIGIR Workshop on Information Retrieval and Applications of Graphical Models
For prototyping purpose I found that chromadb integration in langchain is very easy to use. That's said I'm not sure about production usage.
https://blog.langchain.dev/langchain-chroma/
I get this question frequently - why not use FAISS or ANNOY directly, instead of a vector database, so glad to see this aspect covered in this article.
Suppose one is writing a desktop app doing searches of embeddings, is there a lightweight vector database that runs on all major desktop platforms? I'm aware of redis and sqlite-vss, but none runs natively on Windows.
I wouldn't call postgres lightweight of course, but it's definitely lightweight in the sense that it doesn't add a whole bunch of new garbage to an otherwise traditional application.
An embedding is the mathematical process of converting tokens (text or otherwise) into a vector of floating point numbers. This vector captures semantics such that words that have similar meaning are close to each other when their distance is measured using the metric of cosine similarity.
An embedding is a numerical representation of a complex object.
In an image, each pixel is a dimension but it does not have any meaning in itself - you need to look at the rest of the image to understand that the pixel is part of a cat. Embeddings is a way to represent this meaning. Think about it as a “summary” of a image / document.
I'm not a data scientist or anything, but as far as my understanding goes, you can think of it as a list of numbers (floats, integers, whatever), each number representing a feature of the thing you're trying to represent as a vector (an image, a video, whatever).
For example, you can create an embedding for an image using a neural net that has been trained to receive an image and output a vector of 1024 floats, which represent the content of the image. This vector is a lossy compressed version of the image.
And (if I'm understanding correctly) vectors that are near each other (in a mathematical sense) represent inputs that are "near" each other (in a conceptual sense).
So... a vector database can be organized to quickly retrieve objects with particular characteristics, without rigidly defining what those characteristics are.
Yup I think you got it perfectly. Just a small note: yes one isn’t rigidly defining what those characteristics are while finding similar embeddings (aka nearest vectors using some distance metric), but those characteristics are implicitly encoded in the model that creates the embeddings depending on how the model is trained.
"Since the vector database provides approximate results, the main trade-offs we consider are between accuracy and speed. The more accurate the result, the slower the query will be. However, a good system can provide ultra-fast search with near-perfect accuracy."
What does "near-perfect" mean in this context, type 1 or type 2?
An alternative to Pinecone is MyScale DB. It is SQL-based vector database and has generous free tier now. https://myscale.com/
1. Manages both structured and vectorized data in a single database and can perform joint queries and analytics on both types of data.
2. Cloud-native OLAP database architecture enables operations on vectorized data to be executed with astounding speed.
3. Complete and extended SQL support for all data operations, accessible via developer tools such as Python SDK.
Karpathy (OpenAI, ex. Tesla) recently tweeted about KISS and just using np.array instead of a vector database !
Searching for similar vectors is basically the (approximate) KNN problem, although I imagine more specialized search methods might apply depending on what you are doing.
100% agree. There are so many simple use cases where people are jumping to a complex option to start. For something like 10K records, a NumPy or PyTorch matrix operation could be enough.
Well, we have about 4B vectors we want to index, with a constantly changing dataset. We are actually running just fine on Elasticsearch with ~1B docs indexed right now, but the hardware costs are looking expensive.
Elasticsearch uses HNSW, not sure what options they have but quantization/compression will help reduce disk storage requirements. Alternatively, you can look at dimensionality reduction algorithms and only store that output in ES. Or pick a model with a small number of dimensions. For example https://huggingface.co/sentence-transformers/all-MiniLM-L6-v... only has 384 dims vs 768/1024/2048/4096.
>Karpathy (OpenAI, ex. Tesla) recently tweeted about KISS and just using np.array instead of a vector database
The context was a very underwhelming side project of his: A movie search engine but you had to use the exact titles of the movies to get results. It only revealed that he doesn't appreciate what similarity search actually is.
It feels almost blasphemous to call a Karpathy side project underwhelming. He is a genius and it really felt unlike him to write that "just use np.array" tweet.
I don't recall the context in that much detail, but I'd have to give him the benefit of the doubt!
Surely the whole point of a vector "database" in that context would be to store semantic sentence embeddings of the movies titles to support approximate / semantically-related search ? Could do the same thing for movie plot synopsis too - allow user to search via vague descriptions of movie. ChatGPT actually does very well at this, although massive overkill.
It definitely depends on your use case. If you are just searching through the entire array at all times, then this is certainly an acceptable option (you could even flip it all onto a GPU too).
But when you start to require filtering or combining the vector search with a lexical search, then something like Pinecone, Vespa, Qdrant, Lucene-based options (e.g. Solr and ES) etc. become a lot more practical than you building all that functionality yourself.
For NLP use-cases, you can use a vector database to index the embeddings of your texts.
For exemple, if you implement a document retrieval (a search engine like Google), you train a transformer model that takes a text as input (the content of your document) and produces a vector of number as output (the embedding). You then index your documents by transorming them to their embeddings and storing them inside your vector database.
When you want to perform a query using keywords, you transform your keywords into a vector, and then ask your vector database to send you the most similar documents, using a similary function such as the cosine function.
It seems like it would be better to index a document multiple times by generating embeddings for every paragraph rather than once per document. What am I missing?
After working through several projects that utilized local hnswlib and different databases for text and vector persistence, I integrated open source hnswlib with sqlite to create an embedded vector search engine that can easily scale up to millions of embeddings. For self-hosted situations of under 10M embeddings and less than insane throughput I think this combo is hard to beat.
Not really. For starters not all the vectors in the database are of the same dimension (at least in ones I've used). You can for example in pgvector have multiple tables with a column of vector type and those tables can have different cardinality.
In any case your description is strangely reductionist. The important thing about a vector db is that is typically designed to store embeddings used in various ML applications. So say you are doing NLP you can tokenize some input and then store the token and positional embeddings in a vector db and then use it for similarity search, training etc.
Mixed tables each with reduced cardinalities makes sense.
> your description is strangely reductionist.
Sure - pure | applied math background, old enough to have used Postgres when it was known as Ingres, to have patched in Spatial relations before it had the GIS functions it has now, and to have written libraries for GIS linked { 256 | 1024 | 2048 } D vector databases for signal aquisition | processing.
I'm late to the 'modern' discussions & just checking my read - I can think of applications for mixed dimensions and discrete space vectors and there are analogs to trad R^N ops for those cases.
if you take a 2d piece of paper, draw the numbers 1 to 5 horizontally, then look at the distance between 1 and 5, or 1 and 3, they vary, now roll it into as cylinder, it's a manifold, and now the distance between 1 and 5 and 1 and 1 and 3 are the same, no matter how many numbers you had written on the paper, they would be connected by the shortest path, something like a helix I believe, the helix being the geodesic, now, increase the dimension, but keep rolling it up into that 'cylinder' and using a new geodesic to connect the shortest path, this is my layman's take on it, if you are #math please jump in and put me right!
and a "vector database" is a collection of vectors, sure.
The question comes from is it the CompSci | HN | AI domain nomenclature to assume a vector database is made up of vectors that are all of the same dimension over a continuum (eg. N real numbers for fixed N) or are vector databases made up of mixed vectors (no fixed dimension) and discrete values, etc.
I ask as the linked article doesn't specify but does appear to imply.
An embedding model will map a string of text (of variable length) to R^D. Each model has its own fixed D, yes. (Typically the vector is a unit vector for performance reasons.) The main function of vector database is similarity lookup so you would calculate approximate nearest neighbours of a vector with a scalar vector distance metric (e.g cosine similarity). These similarity metrics operate on two vectors of the same dimension.
You would not mix and match embedding models (e.g. with differing dimensions) at look-up time. The target vector table assumes you will look it up with a vector created from the exact same embedding model and version that was used to backfill it.
The API documentation for a look up operation may be more illuminating here:
>vector (array of floats)
>The query vector. This should be the same length as the dimension of the index being queried. Each query() request can contain only one of the parameters id or vector.
As per my comment earlier, the dimension isn't fixed. The usual use case (storing embeddings) is instructive as to the range of values. For token embeddings, often the embedding is generated via a lookup in a fixed vocabulary of tokens token to a token ID. So say your vocab is words, the value is a word id which would obviously be an integer not a real. Here's an intro to word embeddings https://wiki.pathmind.com/word2vec and here's one for positional embeddings (the new hotness given how zeitgeisty GPTs are) https://theaisummer.com/positional-embeddings/
If anyone is looking for a vector search engine, see here https://github.com/marqo-ai/marqo. Has additional functionality to make vector search much easier.
In the quest for ultimate speed, I started developing a vector database in assembly using gpt4 as a side project https://github.com/jn2clark/GPT4Memory.
Supabase wrote a solid tutorial[1] (you don't need to run it on Supabase).
0 - https://github.com/pgvector/pgvector
1 - https://supabase.com/blog/openai-embeddings-postgres-vector