What is a Vector Database? (2021)

mritchie712 · on May 5, 2023

If you want to play with a vector database and already use postgres, there's pgvector[0]. It's easy to add as an extension (supports Postgres 11+).

Supabase wrote a solid tutorial[1] (you don't need to run it on Supabase).

0 - https://github.com/pgvector/pgvector

1 - https://supabase.com/blog/openai-embeddings-postgres-vector

imaurer · on May 5, 2023

I am bullish Pgvector because I am “postgres for everything guy”.

Current concerns are the scaling and recall performance.

The author is looking at product quantization along with other ideas: https://github.com/pgvector/pgvector/issues/27

More details on product quantization: https://mccormickml.com/2017/10/13/product-quantizer-tutoria...

A nice repo that tracks the ANN relative performance of different indexes: https://mccormickml.com/2017/10/13/product-quantizer-tutoria...

Also shoutout to Weaviate because they have great docs, are open source and have very informative YouTube channel.

https://weaviate.io/

videlov · on May 5, 2023

Over the past couple of days I tried 11 different vector databases, in order to evaluate and decide which one we'd choose for our use case.

I ended up choosing Weaviate specifically because of the nice docs, but beyond that, time will tell.

mmaia · on May 5, 2023

I would love to read more about your experience. We need more content with feature, peformance, and architecture comparisons. Currently, there's a lot of developer evangelism hype in the space.

canadiantim · on May 5, 2023

Have you tried cozodb? Newest kid on the block, looks very promising

technics256 · on May 5, 2023

Curious if you tried Vespa?

dunefox · on May 5, 2023

Any thoughts on Milvus?

bobvanluijt · on May 5, 2023

Thanks, that's nice feedback

mritchie712 · on May 5, 2023

Yep, we're (https://www.definite.app/) using pgvector and I was initially concerned about scaling, but it doesn't seem it will be a problem for our use case. I definitely wouldn't use it if I was building a feature for Slack, but works for us!

overview · on May 5, 2023

At a glance, your product seems like a fit for my team. However, your landing page doesn’t give specifics. What exactly does it do?

mritchie712 · on May 5, 2023

Yes, working on that landing page right now (currently it's pretty week)!

We're building an AI data analyst. You can ask questions of your database and get answers immediately. We also auto generate entire dashboards based on common patterns (e.g. a "Sales Dashboard", "Marketing Dashboard", "Finance / Burn" etc.).

If you want to give it a try (there's a demo database embedded in the app), you can use it here: https://ui.definite.app/

supriyo-biswas · on May 5, 2023

Can you search both by an equality comparison and a vector search in weaviate? I’d like to do something along the lines of `SELECT * FROM table t WHERE cosine_dist(:my_embedding, t.doc_embedding) < :x AND some_column = “XYZ”`

imaurer · on May 5, 2023

Well Weaviate is graphql and it has filtering and hybrid search which is a great feature that pg can’t fully support because it doesn’t have bm25

https://weaviate.io/developers/weaviate/api/graphql/filters

https://weaviate.io/blog/hybrid-search-explained

I have a ChatGPT session where I have asked it to do a hybrid search using filtering, pg fts and vector search. Looks reasonable just need to test it and write it up somewhere.

KyeRussell · on May 5, 2023

Amen. After suffering through many years of people telling me to use document databases when I was much better served with—at most—Postgres with a jsonb field, I feel vindicated enough to feel justified in doing my due diligence before going off the beaten track.

Not that document databases don’t have their place, but…MongoDB is webscale and all that.

akiselev · on May 5, 2023

Obligatory Youtube video for historical purposes: https://www.youtube.com/watch?v=b2F-DItXtZs

mmaia · on May 5, 2023

Yup. pgvector will do it for a lot of projects, specially if you're just trying things out. It think of it as using PostgreSQL full text search before you need to deploy a decidated solution.

swe_dima · on May 5, 2023

Google Cloud still doesn't support this plugin, big shame

imaurer · on May 5, 2023

AWS just added yesterday. Hosting options tracked here:

https://github.com/pgvector/pgvector/issues/54

tornato7 · on May 5, 2023

Also plugging my crappy vector database, which you probably shouldn't use for anything but a fun project, however it can be set up and used in seconds. https://github.com/corlinp/Victor

justchad · on May 5, 2023

I'm bullish on pgvector as well. Now that RDS supports it as well as plenty of other cloud providers it seems like a no-brainer to be able to stick with your existing stack (assuming it's postgres). Andrew Kane is such a prolific open-source maintainer as well.

fudged71 · on May 5, 2023

Replit now has Postgres databases. Do you know if it's possible to use pgvector on replit?

time_to_smile · on May 5, 2023

Out of curiosity, what's the use case here?

It seems like if the goal is to "play around with vector databases", why not just install it on your local machine? Part of using these tools is learning how they work and configuring them yourself.

If the goal is "start developing products using vector data bases" then it seems like you would surely want something a bit more under your control than using replit.

fudged71 · on May 10, 2023

I would say my use case is the same as many other people who use it. Repl.it is fantastic for getting started, for sharing your code, and for creating small applications.

andre-z · on May 5, 2023

An open-source Pinecone alternative https://github.com/qdrant/qdrant With a cloud offering along https://cloud.qdrant.io with 1GB cluster for free to try out. Disclaimer: I'm part of the team.

vlovich123 · on May 5, 2023

Curious with why you went for an Apache license. Aren’t you worried about copy-cat services? Or does the OSS version lack the scaling/distributed features that would be more difficult to replicate? I think that was ES’s fatal mistake and their licensing games are unlikely to pan out.

softfalcon · on May 5, 2023

The Coral Project [0] (commenting platform used on Washington Post, New York Times, The Verge) uses an Apache 2.0 license [1]. Which doesn't seem to have prevented it from raking in big SaaS customers.

A lot of people worry about copy-cat services, but it's kind of rare that someone will be able to compete with you as the original in hosting your own service as well as you can. Especially when you consider support and maintenance requirements of a new product you aren't personally developing.

I could see copy-cat services being more of an issue in the late stage of a product though? When everyone knows lots about how to stand it up and use it?

[0] https://coralproject.net/

[1] https://github.com/coralproject/talk/blob/develop/LICENSE

jacobr1 · on May 5, 2023

> I could see copy-cat services being more of an issue in the late stage of a product though? When everyone knows lots about how to stand it up and use it?

The concern isn't random small companies. The concern is the big cloud providers like AWS, Azure and Google. And you are right, they aren't going build out a hosted version of your product until there is enough traction. But at that point, customers might indeed trust them more than you to run your own software! Redis and Elastic ran into this problem for example.

The most likely scenario though - is never getting traction - so anything to improve traction such as permissive licensing is probably a better tradeoff.

softfalcon · on May 5, 2023

I'll be honest, this did occur to me, but by the time it popped into my head, I had run out of time to add an edit addressing it.

Thanks for pointing it out so folks can be wary of AWS (and similar) eating their lunch like they have countless other SaaS services!

zcesur · on May 5, 2023

qdrant also pays open source contributors: https://news.ycombinator.com/item?id=35828003

disclaimer: i'm a founder at algora.io, the platform that enables these paid contributions

ffback · on May 5, 2023

"Algora charges a 23% fee over your rewarded bounties (20% Algora fee + 3% Stripe fee). The fee is applied when you complete your bounty payments."

https://docs.algora.io/bounties/payments#compliance

If algora.io didn't charge %23 of the bounty I would have tried to contribute. It felt unfair to me.

zcesur · on May 5, 2023

hey ffback, contributors get 100% of the bounty award :) the organization pays the fee on top of the bounty. will update the docs to make this more clear, thank you!

amitport · on May 5, 2023

The organization is willing to pay 123% of the bounty. So from the contributor pov algora's fee is 20/123~=16.2% still high IMHO.

rpeden · on May 5, 2023

They're willing to pay that amount for the bounty + managing and paying out bounties.

If they weren't paying Algora to do it, they'd be paying their own staff to do it. Either way, the extra percentage wouldn't be part of the bounty.

I'd gladly pay 23% just to not have to worry about the logistics of bounty payments.

thdespou · on May 6, 2023

But if I get a reward for 100 dollars and the company pays 123 dollars which you may not see in your statement then what's the problem?

bobosha · on May 5, 2023

A +1 for qdrant from a happy user. we use qdrant in production with a 50-100MM rows scale. Haven't experienced many bottlenecks thus far, and has performed quite well.

@qdrant_team: perhaps you should look into offering it as a service, a la pinecone.

edit: oops just checked your (updated) website and notice you have an offering already. Congrats! will check it out. ty =)

andre-z · on May 5, 2023

You are welcome! Feel free to reach out for the early adopter discount ;)

echelon · on May 5, 2023

Perfect for our next zero shot model. We'll give it a spin!

Thanks for building this.

mbrochh · on May 8, 2023

Is it different/better than https://milvus.io ?

jcmoscon · on May 10, 2023

Hey this is pretty cool! I will try it! Thanks for sharing it.

monkeydust · on May 5, 2023

Might try this, using pinecone but there documentation even for simple use cases is pretty poor.

woile · on May 5, 2023

how much do one need to know of databases to work at qdrant? Sounds like a nice place to work, specially if remote

andre-z · on May 5, 2023

Well, to work on the core of the Qdrant engine https://github.com/qdrant/qdrant you should have some db knowledge but even more important are Rust skills. However, we have also other products, like the cloud platform https://cloud.qdrant.io there we are looking for different skills.

zh217 · on May 5, 2023

If anyone wants to try a FOSS vector-relational-graph hybrid database for more complicated workloads than simple vector search, here it is: https://github.com/cozodb/cozo/

About the integrated vector search: https://docs.cozodb.org/en/latest/releases/v0.6.html

It also does duplicate detection (Minhash-LSH) and full-text search within the query language itself: https://docs.cozodb.org/en/latest/releases/v0.7.html

HN discussion a few days ago: https://news.ycombinator.com/item?id=35641164

Disclaimer: I wrote it.

digdugdirk · on May 5, 2023

Glad I hopped into this thread while your comment was recent enough to be at the top. This is super interesting! Apologies if you went over this in your other post (or the docs, I'll be digging into this over the weekend) but could you share a bit about why you went this route? What you tried, what the hangups were/are with other approaches, and if there are any interesting possibilities with your approach that other vector databases just wouldn't be able to do?

zh217 · on May 5, 2023

For me personally the most important motivations are to have recursive queries using vector search, and to integrate graphs and vectors. Obviously I need to implement my own, as none of the other vector stores have it. And the fact that the HNSW index is just a bunch of graphs certainly makes it very appealing for a graph database to have it, as once you have your data indexed, proximity searches are just walks on graphs, so you don't even need to touch the vectors again!

drunkan · on May 5, 2023

Thanks for the links and discussions, I’m keeping an eye on this one it looks really promising, at least in the hybrid area compared to the much hyped surrealDB whose graph implementation looks more like an afterthought when you get down to the technical details, functionality and performance

cubefox · on May 5, 2023

Unfortunately this piece is nebulous on what an embedding is. Apparently it is saved as an array of floats, and it has some string of text it is associated with, and the float arrays are compared by "similarity".

None of these explains what an embedding really is. My best guess is that the embedding represents the meaning of the natural language string it was generated from, such that strings with "similar" embeddings have similar meaning. But that's just speculation.

ta20211004_1 · on May 5, 2023

> My best guess is that the embedding represents the meaning of the natural language string it was generated from, such that strings with "similar" embeddings have similar meaning. But that's just speculation.

Yeah, you've got it. A mapping from words to vectors such that semantic similarity between words is reflected in mathematical similarity between vectors.

An idea of how you might train this thing: lets say the words "king" and "queen" are being embedded. In your training data there are lots of examples where "king" and "queen" are interchangeable, for example in the sentence "The ___ is dead, long live the ____", either word is appropriate in either slot, so each time we see an example like this we nudge "king" and "queen" a little closer together in some sense. However you also find phrases where they are not interchangeable, such as "The first born male will one day be ____". So when you see those examples you nudge "king" a little closer in some sense to other words which appropriately complete the sentence (which does not include "queen" in this case).

In this way, repeated over a giant training set with thousands of words, concepts like "male/female" and "royalty", "person/object" and tons of others end up getting reflected in the relationships between the vectors.

These vectors are then useful representations of words to ML models.

atq2119 · on May 5, 2023

Right, makes sense. But then what do you actually do with a database?

Starting with: what do you store in it?

Maybe sentence/vector pairs. But what does that give you? What do you do with that data algorithmically? What's the equivalent of a SELECT statement? What's the application that benefits an end user? That part still seems rather hazy.

ndriscoll · on May 5, 2023

I haven't worked in this space, but from what I gather, the idea would be something along the lines of the following:

An autoencoder is a model that takes a high dimensional input, distills it down to a low dimensional middle layer, and then tries to rebuild the high dimensional input again. You train the model to minimize reconstruction error, and the point is then that you can run an input on just the first half to get a low-dimensional representation that captures the "essence" of the thing (in the "latent space"). In this representation, images that are similar should have similar "essences", so their latent vectors should be near to each other.

The low dimensional representation must do a good job capturing the "essence" of your things, otherwise your reconstruction error would be large. The lower the dimension you manage to use while still managing to reconstruct your things, the better of a job it must do at making those parameters really encode the salient features of your thing without wasting any information. So similar things should be encoded similarly.

So imagine you've got a database of images, and you have a table of all of the low dimensional encoded vectors. You want to do a reverse image search. The user sends you an image, you run the encoder on it to get the latent representation, and then you want to essentially run "SELECT ei.image_id FROM encoded_images ei ORDER BY distance(encode(input_image), ei.encoding) LIMIT 10".

So you want a database that supports indexes that let you efficiently run vector similarity queries/nearest neighbor search, i.e. that support an efficient "ORDER BY distance(_, indexed_column)". Since the whole process was fuzzy anyway, you may actually want to support an approximate "ORDER BY distance" for speed.

In practice apparently the encoding might be taking the output of the first or nth layer in a deep network or something rather than specifically using an autoencoder. Or you may have some other way to hash/encode things to produce a latent representation that you want to do distance searches on. And of course images could instead be documents or whatever you want to run similarity searches on.

qorrect · on May 5, 2023

What a great explanation thank you.

browsewhilepoop · on May 5, 2023

Often the use case is search. Ex. You have a basic text search engine to find musicians on your site which does some string matching and basic tokenization and so on. But you want to be able to surface similar types of musicians on search too.

In that case you might store vectors representing a user based on some features youve selected, or a word embedding of their common genres/tags.

To actually search this thing, you need something to compare against. You could directly use the word embeddings of the search query. You could also do a search against your existing method, and then use the top results from that as a seed to search your vectors.

Since everything's a vector, you can also ask questions like "what musician is similar to Tom AND Sally" by looking for vectors near T+S. T-S could represent like Tom but not like Sally, etc.

So the answer to what do you store is, what will be your seed to search against?

cubefox · on May 6, 2023

Wow, that's interesting. So we can do vector arithmetic and the results make sense as a form of embedding/concept logic. Addition seems to work like "and" / set intersection. Subtraction works like set difference ("T\S", also literally written "T-S", i.e. "without"), which logically says "T, but not S", or in terms of predicate calculus, "T(x) & not S(x)".

Perhaps there is also some unitary vector operation which directly corresponds to negation (not Sally)? Perhaps multiplying the vector by -1? Or would (-1)S rather pick out "the opposite of Sally in conceptual space" instead of "not / anyone but, Sally"? And what about logical disjunction (union)? One could go further here, and ask whether there is an analog to logical quantifiers. Then there is of course the question whether there is anything in vector logic which would correspond to relations, binary predicates, like R(x, y), not just unitary ones, etc.

(Sorry for rambling, I'm thinking out loud here.)

frabcus · on May 5, 2023

The vectors are usually (if you use OpenAI API anyway) unit in length, and so you can imagine them on the surface of a hypersphere.

You measure the cosine distance between documents, or between search queries and documents. (Cosine is fast, there are other distance metrics).

The vector database queries will do things like given one embedding (document or query) find the nearest embeddings (documents). Or given two embeddings (e.g. a query and a context) with a weight for each one, find the ones that triangulate to being near both.

morgango · on May 5, 2023

Simple answer - you normally store text in it, but with the state of neural networks these days most things can be vectorized and searched.

So, coming myself from a database background but working in search, the SELECT statement (and joins) probably aren't the best way to get your head wrapped around things. I would think of the vector as a unique key for a record, and only using a LIKE statement for all my queries, but one that will return a probability of a match instead of an actual match.

A great use case is to think about similarity, where we want the things that are closest to what we want to see, but there isn't an exact match.

For example; a user gives me a sentence that says, "How long do I have to be with the company before I get a 401K match?". My vector store has a bunch of vectors including "A new employee will be eligible for 401K after 6 months." ,and, "The 401K program is run by <MEGACORP X>."

I would like to be able to see that the first vector is a closer match to the user sentence than the second, and by how much. I would also like to do this without having to change my code much based on the structure of the text. Luckily, there is a very simple algorithm for doing this (cosine similarity) that doesn't change regardless of the sentence structure or the question answered. Also, it doesn't matter what kind of question/answer you do as long as it can be vectorized, so you could even give me a vector representing an image and I can give you an image that is most similar.

Here is the most interesting thing about vectors -- with very little effort they turn the english language into a programming language.

Instead of typing "SELECT document_id, document_name, document_body FROM documents WHERE (document_body LIKE '%401K%' AND document_body LIKE '%match%' AND document_body LIKE '%existing employee%') FROM documents" I can just ask, "How long do I have to be with the company before I get a 401K match?" and I will get back a result and a match probability. How I change my text will change the matches, and can do so in ways that are profound and unexpected. Note that the SQL query I gave would not return any values because I didn't have any documents that had the term "existing" in them. Building the correct SQL query could be quite complex it comparison to just using the text.

This is pretty great for long-tailed search, q&a, image search, recommendations, classification, etc.

BTW, I am biased, I work for Elastic (makers of Elasticsearch) and we have been doing traditional search forever, and vector/hybrid search for the last few years.

esafak · on May 5, 2023

The use case is a specific type of search:

* https://en.wikipedia.org/wiki/Semantic_search

* https://en.wikipedia.org/wiki/Similarity_search

manytree8 · on May 5, 2023

Great explanation, thank you!

therealdrag0 · on May 5, 2023

How is each dimension maintained to have a sticky meaning among scenarios?

dsubburam · on May 5, 2023

Because the model used to compute the embeddings is the same across scenarios. You can infer meaning for each dimension by checking which inputs get embeddings that have large values for the dimension.

If the inputs are images, you may find that some dimension scores e.g. how much blue there is in the image. Though often it's not that simple (there could be multiple dimensions that relate to how blue the image is, especially if the embedding dimensionality is large, which it does tend to be these days. Though you could reduce the embedding dimensionality first using PCA, and see what input images correspond to high/low values of the first principal component, etc.).

ptaken · on May 5, 2023

Dimensions itself do not carry any meaning, what matters are the neighbors to maintain a sense of similarity. Think if it like a very complex point cloud. Applying an n-dimensional rotation leads to the same point cloud content wise.

As for the number of dimensions, in a sense they are a training variable just as the content itself. The more dimensions you utilize for your embeddings the more complex your relations can be during clustering. Too many dimensions can easily lead to over fitting however and too little dimensions can usually not accurately represent the training corpus.

esafak · on May 5, 2023

All the embeddings (vectors) are usually generated at the same time, and regenerated periodically. Does this answer your question?

gregsadetsky · on May 5, 2023

There are good sibling explanations by @ta20211004_1 and @HarHarVeryFunny, but if I can try in an additional way:

Imagine you wanted to go from words to numbers (which are easier to work with mathematically), like you wanted to assign a number to some words.

How could you do it? Well you could do it randomly: cat could be 2, dog could be 10, sweater could be 4.534 and frog could be 8.

Not super useful, but hey - words are now numbers! How can we make this "better"?

What if we decided on a way to put words on a line - let's say we ordered words by how much they had to do with animals. Let's say 10 meant it's a very animal-related word, and 0 is very not-animal related. So cat and dog would be 10, and maybe zoo would be 9, and fur could be 8. But something like sweater would be 1 (depending if the sweater was made from animal wool...?)

What now? Well what's cool is that if you assign words on that "animal-ness" line, you can find the words that are "similar" by looking at the numbers that are close. So, words whose value is around 6 are probably similar in meaning. At least, in terms of how much they relate to animals.

That's the core idea. Ordering words by animal-ness is not that useful in the real world, so maybe we can place words on a 2d grid instead of a line. Horizontally, it would go from 0 to 10 (not animal at all - very animal) and vertically, it could be ordered by brightness - 0 for dark, and 10 for bright.

So now, bright animals will congregate together in one part of the grid, and dark non animals will also live close together. For example, a dark frog might be in the bottom right at position (10, 0) - very animal (right end of the x axis) but not bright (bottom of the y axis). Any other word whose position is close to (10, 0) would presumably also be animal-y and dark.

That's really it. The magic is that... this works in thousands of dimensions. Each dimension being some way that "AIs" see words / our world. It's harder to think about what each dimension "is" or represents. But embeddings are really just that - the position in a space with a huge number of dimensions. Just like dark frogs were (10, 0) in our simple example, the word "frog" might be (0.124, 0.51251, 0.61, 0.2362, 0.236236, ..............) as an embedding.

That's it!

boopbeepbop · on May 5, 2023

Wow. Great explanation.

The example you used going from 1 to 2 to n dimensions really made sense

aidanf · on May 5, 2023

An embedding is a collection of learned vectors.

Each vector is an array of n floats that represent a location of a thing in an n-dimensional space. The idea of learning an embedding is that you have some learning process that will put items that are similar into similar parts of that vector space.

The vectors don’t necessarily need to represent words and the model that produces them doesn’t necessarily to be a language model.

For example, embeddings are widely used to generate recommendations. Say you have a dataset of users clicking on products on a website. You could assume that products that get clicked in the same session are probably similar and use that dataset to learn an embedding for products. This would give you vector representing each product. When you want to generate recommendations for a product, you take the vector for that product and then search through the set of all product vectors to find those that are closest to it in the vector space.

opwieurposiu · on May 5, 2023

An embedding is a a way to map words into a high-dimensional "concept space", so they can be processed by ML algorithms. The most popular one is word2vec

https://jalammar.github.io/illustrated-word2vec/

crabbone · on May 5, 2023

Sorry, that's even less helpful in the context of a database... but thanks for trying.

jmalicki · on May 5, 2023

A vector database is used for things where you're trying query "I have this image, give me a list of the 10 closest images and metrics of how similar they are."

You use a machine learning model (like word2vec, OpenAI, etc.) to produce an "embedding" that describes the image, text, video, etc., which is your "vector".

For all of the other images in your database, you also run them through the same model, and store their embedding vectors in the vector database.

Then, you ask the database "I have this vector, what are the most similar vectors, and what are their primary keys, so I can see what content they refer to".

Think: you want to implement google "search by image". This is the basics of how you'd do that.

crabbone · on May 16, 2023

Isn't this just locality-sensitive hashing?

Why use the word "embedding" if there are already much more familiar words for it (isn't this the same as feature vector)?

I want to convince myself that this isn't similar to blockchain. In the sense that blockchain renamed an old and simple idea and advertised it as something complex and groundbreaking...

Also, relational databases or graph databases have a reach theory that results in many interesting sub-problems, each interesting in its own right, to contrast this with "document databases", which have no theory, and nothing interesting behind it. So, if I were to invest my time learning about one w/o a financial incentive to do so, I'd not want to concentrate on some accidental concept that just happened to solve an immediate problem, but isn't applicable / transferable to other problems.

For example, graph databases and relational databases create interesting storage problems wrt' optimal layout for various database components. If hash-table is all there is to the vector database, then it's not an interesting storage problem.

Similarly, with querying the database: if key lookup from a hash-table is all there is, then it's not an interesting problem.

cubefox · on May 5, 2023

Okay, "mapping into concept space" is at least compatible with my meaning theory, but by itself it doesn't say much, since in principle anything can be mapped to anything.

HarHarVeryFunny · on May 5, 2023

Embeddings are a mapping of some type of thing (pictures, words, sentences, etc) to points in a high-dimensional space (e.g. few hundred dimensions) such that items that are close together in this space have some similarity.

The general idea is that the items you are embedding may vary in very many different ways, so trying to map them into a low dimensional space based on similarity isn't going to be able to capture all of that (e.g. if you wanted to represent faces in a 2-D space, you could only use 2 similarity measures such as eye and skin color). However a high enough dimensional space is able to represent many more axis of similarity.

Embeddings are learnt from examples, with the learning algorithm trying to map items that are similar to be close together in the embedding space, and items that are dissimilar to be distant from each other. For example, one could generate an embedding of face photos based on visual similarity by training it with many photos of each of a large number of people, and have the embedding learn to group all photos of the same person to be close together, and further away from those of other individuals. If you now had a new photo and wanted to know who it is (or who it most looks like), you'd generate the embedding for the new photo and determine what other photos it is close to in the embedding space.

Another example would be to create an embedding of words, trying to capture the meanings of words. The common way to to this is to take advantage of the fact that words are largely defined by use/context, so you can take a lot of texts and embed the constituent words such that words that are physically close together in the text are close together in the embedding space. This works surprisingly well, and words that end up close together in the embedding space can be seen to be related in terms of meaning.

Word embeddings are useful as an input to machine learning models/algorithms where you want the model to "understand" the words, and so it is useful if words with similar meaning have similar representations (i.e. their embeddings are close together), and vice versa.

gk1 · on May 5, 2023

Choose your flavor:

https://www.pinecone.io/learn/vector-embeddings/

https://www.pinecone.io/learn/vector-embeddings-for-develope...

vharuck · on May 5, 2023

As opwieurposiu said, embeddings are high-dimensional vectors. Often, they're created by classic math techniques (e.g. principal component analysis), or they are extracted from a model that proved useful for something else.

For example, a neural net model accepts a massive number of input values that directly map to the input. So those initial values don't add any info. But a layer further inside the model, with fewer values and probably close to the end, is smaller and should reflect what the model's learned. Like a lot of deep learning, three values work but don't give much insight.

If I'm wrong, I hope somebody more knowledge corrects me. I got my understanding from basic into tutorials and Wolfram's essay on ChatGPT: https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-...

fudged71 · on May 5, 2023

A word or sentence embedding is a long array of numbers that represents the semantic "position" in a high dimensional space, which allows you to find the distance between any two sentences in this semantic space. My understanding of paragraph and document embeddings is that they are an average of all the sentence vectors combined as one point, which lets you find the distance between any two sentences in this semantic space.

crabbone · on May 5, 2023

Yeah... for a while I wanted to understand what a vector database is, but this article reads like a thinly-veiled advertorial: too many buzzwords, and the content feels like the author doesn't really have a good knowledge of the subject and is just trying to advertise the tech their company is selling.

Buttons840 · on May 5, 2023

An embedding is a series of numbers that have been gradually shifted to better fit some purpose. The gradients tell me that if I increase the first number of embedding X a little, the model will perform better, so I do.

dcl · on May 5, 2023

I don't understand how so much money has been poured in to these companies?

I get the why the techniques are suitable, but I just assumed who ever wants to do this kind of retrieval can probably implement a suitable Approx. NN library themselves?

Especially so, because getting good embeddings is the hard part, not the search?

itsoktocry · on May 5, 2023

>I don't understand how so much money has been poured in to these companies?

First time here? Just kidding. But not.

You have to separate the VC hype with the product, because the VCs always need something to overhype. Half these people were pumping money in to crypto and whatever-the-hell-web3-is/was just a couple months ago, this is just the next thing they like. Half these companies probably aren't remotely good companies.

The VC money hardly ever makes sense.

jderiksen · on May 5, 2023

I am on a small team that initially rolled our own semantic search system. We quickly ran into issues around scaling, maintenance, and performance. Since we want to focus on delivering features and not turning into a DevOps team, we switched to Pinecone and it has met our needs pretty well. We would like to see auto-scaling and I believe that this feature is in the works. Support has been very responsive and helpful when we do have questions and issues.

There are plenty of LLMs to choose from with regard to finding sources of embeddings. Some free, some for money.

moneywoes · on May 5, 2023

may I ask what your use case is for the vector db?

struggling to see the reason for the sudden demand

indeed30 · on May 5, 2023

Semantic search is what the commenter suggested. That's the most commonn use case in my experience too.

jderiksen · on May 12, 2023

Semantic search.

heipei · on May 5, 2023

For the same reason you have money going to various SQL-as-a-service companies that run Postgres / MySQL for you as a service: Some folks would rather eat the network latency, give up control of their data and complicate their compliance process than operating a database themselves.

pantulis · on May 5, 2023

The difference with SQL is that it's not like storing vector embeddings outside your perimeter suppose a big compliance issue --at least in security or legal terms. Giving up control of their data and network latency are legit concerns, that's for sure.

gk1 · on May 5, 2023

(I'm from Pinecone)

While Pinecone isn't available as a self-hosted option (see many comments with alternatives), we do offer the option of running Pinecone for you on a managed VPC, and we do have SOC2 compliance, and we do pass enterprise-level security reviews regularly. Whether that's sufficient is up to you of course.

hiyou102 · on May 5, 2023

A lot of people use some form of managed services if they are in the cloud. Be it S3 or Dynamo DB. Generally cheaper than running things yourself and operationally much easier too.

jstx1 · on May 5, 2023

Anyone who wants this can implement their own library themselves? When has this worked for any problem ever?

Searching efficiently is a problem, and there's several open source and proprietary solutions but I don't get how you can put it in the "everyone should roll their own" category.

softwaredoug · on May 5, 2023

At billion vector scale, doing this yourself is pretty impossible

dmezzetti · on May 5, 2023

Faiss has long discussed strategies for scaling to 1B - 1T records here - https://github.com/facebookresearch/faiss/wiki/Indexing-1G-v...

There are plenty of options available to run your own local vector database, txtai is one of them. Ultimately depends if you have a sizable development team or not. But saying it is impossible is a step too far.

hiyou102 · on May 5, 2023

Even in that article with much smaller vectors than what GPT puts out (1536 dimensions) QPS drops below 100 if recall@1 is more than 0.4. That's to say nothing of cost of regenerating this index using incremental updates. I don't get why people on HN are so adamant on the idea that no one needs scale beyond 1 machine ever.

dmezzetti · on May 5, 2023

The comment said that having an instance with 1B+ vectors yourself is impossible. Clearly that's not the case.

quickthrower2 · on May 5, 2023

If you have a billion vectors, is “yourself” a large tech company who does stuff like roll their own browsers, programming languages, invents kubernetes etc. Probably could roll this! And indeed sell this.

gpderetta · on May 5, 2023

Last time I had to deal with vector representation of documents was more than 10 years ago, so I'm a bit rusty, but billion vector scale sound relatively trivial.

esafak · on May 5, 2023

With retrieval time in the milliseconds? The entries may be ads, or something else user facing. Your users are not going to sit around while you leisurely retrieve them.

VHRanger · on May 5, 2023

not particularly?

1B vectors * 300dimensions * float32 (4 Bytes) ~= 1.2TB

This pretty much still runs on consumer hardware.

Just run that on a 4TB nvme ssd, or a RAID array of ssd's if you're frisky.

linuxdude314 · on May 5, 2023

You do realize you have to query an index of all of that data for every single query your use makes right? Computing that index is not entirely trivial, nor is the operation of partitioning the data so it fits in ram across a pool of nodes.

Sure, role your own, but don’t act like making a highly scalable database is a weekend project.

dbthrowfu · on May 5, 2023

Consumer hardware can still handle that with 1TB RAM + ThreadRipper Pro.

> You do realize you have to query an index of all of that data for every single query your use makes right? Computing that index is not entirely trivial, nor is the operation of partitioning the data so it fits in ram across a pool of nodes.

I don't know what any of this means -- and it sounds like you're slapping a bunch of terminology together, rather than communicating a well-thought-out idea.

Yes, in the general case you're going to have to use an index. Computing an index or a key to that index? Computing the index is a solved problem, that does not have a hard real-time component -- you can do it outside of normal query executions. Computing the key to the index on each query is also a solved problem.

Have dimensions stored in columnar format, generate a sparse primary index on said columns, and then use binary search to quickly find the blocks of interest to do a sequential search on viz. distance function. Or you could even just use regular old SS trees, SR trees, or M Trees for high-dimensional indexing -- they're not expensive to use at all.

There, you can easily run a query on a single dimension (1 billion entries) under a second. You want 300 dimensions? Ok, parallelize it. 128 threads, easy. At most this will take 3 seconds if everything is configured properly (big IF, that seems like few can get right).

This is literally a weekend project. Anyone can build something like this, but not everyone has the integrity to be upfront about how they're reinventing the wheel, and spinning it like they've just broken ground in database R&D.

esafak · on May 5, 2023

A second is orders of magnitude off the typical SLA for these things. It's user facing. That's why these databases are a thing.

hiyou102 · on May 5, 2023

What kind of QPS are you looking at? How are you handling 1536 dimensions? How long does an incremental index update take? These are the problems you run into in building such a system.

ndriscoll · on May 5, 2023

I'm not familiar with the index part, but you can get at least 2TB on a single CPU socket these days. You shouldn't need multiple machines to fit in RAM. Depending on what QPS you need to handle, you might also be fine to not have the whole thing fit in RAM.

VHRanger · on May 5, 2023

My point was, specifically, that this data doesnt have to fit in RAM.

All of it fits on a single machine on one or a few big, fast SSDs.

ndriscoll · on May 5, 2023

A big SSD is 30 TB now: https://www.newegg.com/micron-30-72-tb-9400/p/N82E1682036315...

So that kind of dataset fits on a small SSD. :-)

ramoz · on May 5, 2023

lol, not true. Even for huge vectors (1000 page docs), today you can do this with enough disk storage with something like leveldb on a single node, and in memory with something like ScaNN for nearest neighbor.

hiyou102 · on May 5, 2023

What kind of QPS are you getting and how fast are incremental index updates? That's the hard part.

ShamelessC · on May 5, 2023

It is at the intersection of technology investors "know" (databases) and technology investors don't know, but have been told is about to blow up (ML).

It is also effectively "roll your own Google/Shazam/whatever", which probably makes for a fancy demo to those who don't know how trivial it is to implement.

Basically investors are morons on average.

carimura · on May 5, 2023

Because if AI is the gold rush VC's want to find the Levi's and Wells Fargo's.

softwaredoug · on May 5, 2023

Y not Elasticsearch?

I don't see it addressed in the article, but Elastic 8 has ANN support, and every other feature you'd expect out of a ranking system. Vectors are only one piece of the puzzle for building such a system. (honest question, not trying to troll, as I truly do <3 these pinecone articles)

(Similarly, Y not Solr, Vespa, etc etc) :)

sidi · on May 5, 2023

There currently isn't a way to filter docs alongside a KNN query, and the dimension support is limited to 1024 (a Lucene limitation) and OpenAI embeddings are 1536 dimensions - also indexing performance is not comparable. Wishing this changes, as they're a good stack for the reasons you state

softwaredoug · on May 5, 2023

True though I do think 2k dims is coming it 8.8

peterstjohn · on May 5, 2023

Are they forking Lucene or somehow getting the Lucene devs to increase that limit? Because this PR has been open for over a year now: https://github.com/apache/lucene/issues/11507

softwaredoug · on May 5, 2023

No - they just did something in Elasticsearch to make their own FieldType https://github.com/elastic/elasticsearch/pull/95257

heipei · on May 5, 2023

Plus Elasticsearch is a breeze to operate and scale in a fault-tolerant matter.

bbarnett · on May 5, 2023

I suspect missing sarcasm tags here. The very least, from a lack of a stable, security update only release beanch.

heipei · on May 5, 2023

Not really, I've been operating 10+ node Elasticsearch cluster for years, running on a workload scheduler (Nomad). I never have to perform any maintenance or housekeeping except deleting old indices, and updates are performed by bumping a container version number and then restarting nodes one-by-one with a delay in between.

trgn · on May 5, 2023

Vector will eventually just be another data-type in all db-systems. Already so many production systems have their data replicated across multiple dbs, just to accommodate different use-cases. I'm not keen in adding yet another one.

hobs · on May 5, 2023

In the ANN benchmarks Elastic sets the bottom bar afaict.http://ann-benchmarks.com/

softwaredoug · on May 5, 2023

That appears to be the old community maintained plugin, Elastic KNN, not the official Lucene based HNSW implementation.

hobs · on May 5, 2023

That's very interesting to me, do you know if there's any numbers on the official implementation?

softwaredoug · on May 5, 2023

Not that I know of, I would love to see them if they exist...

gk1 · on May 5, 2023

Hey Doug :)

We always encourage folks to do their own testing. Everyone has different performance requirements, data shapes/sizes, budgets, and expectations of the user experience.

Elasticsearch is a great option. But clearly there's a large cohort of smart teams that decided the combination of performance + cost + scale + [etc] on Pinecone makes more sense for them.

softwaredoug · on May 5, 2023

Hey Greg! Yes I am trolling a bit, to see what the answers might be.

IMO - the real reason "Y Not Elasticsearch" is not because they're dumb or its bad. It's actually because they're not building for the search / AI market like you all are :)

When someone runs out of RAM with their Numpy array, they google, and you guys come up really speaking to that audience, building features, showing people how to build specific solutions, etc.

carom · on May 5, 2023

I looked at the concepts in FAISS and it seems fairly straightforward. In non-jargon you have dimensionality reduction and neighborhoods.

DR is taking a long embedding and doing something to make it shorter. An easy to follow method for this is minhash.

Neighborhoods is representing a cluster of embeddings with a single representative to speed up comparisons. For example, find me the two closest representatives then doing a deeper comparison on all the residents.

Now the feature I haven't seem that will probably cause me to build instead of buy. Most seem designed for a single organization and a single use. For example, Spotify song recommender.

I would like to store embedding from multiple models and be able to search per model. I would also like fine grain user access control, so users could search their embeddings and grant access to others.

gk1 · on May 5, 2023

If the different models use the same dimensionality, you can keep their embeddings within different namespaces inside the same index. See: https://docs.pinecone.io/docs/namespaces

If you mean "user access control" within your company, there are basic access controls within Pinecone. See: https://docs.pinecone.io/docs/add-users-to-projects-and-orga...

If you mean for your end-users, you can use namespaces again to separate embeddings for different users inside one index. See: https://docs.pinecone.io/docs/multitenancy

There isn't yet a combination of the two, where you provide Pinecone API access to end-users.

carom · on May 5, 2023

Thank you, I'll definitely play with pinecone before I build. The dimensionality might vary between models or versions of models. Additionally, the end goal would be to expose it to users and not have to post filter. So probably an index per user. Not sure how expensive that is to recalculate regularly.

nutanc · on May 5, 2023

Do any of the vector databases have support for bit embeddings. We have created bit embeddings[1] for sentences and they save a lot of space. Currently we are just using numpy and sometimes faiss to search through these bit embeddings. Would love for one of the vector dbs to support bit embeddings natively. Then we don't have to engineer that piece :)

[1] https://gpt3experiments.substack.com/p/building-a-new-embedd...

esafak · on May 5, 2023

These are called binary embeddings, and they have been used successfully at pinterest (https://www.arxiv-vanity.com/papers/1908.01707/) and Tencent (https://paperswithcode.com/paper/binary-embedding-based-retr...)

I can't speak for the competition, but weaviate seems to support them: https://weaviate.io/developers/weaviate/concepts/binary-pass...

bckr · on May 5, 2023

Love this idea. Do you have measurements on how it impacts performance of algorithms?

quickthrower2 · on May 5, 2023

I feel like I need a Vector Database Database to choose the closest Vector Database (in cosine similarity) to the job at hand!

justinclift · on May 5, 2023

Ahh, that'd be the Vector Database Broker then? ;)

softwaredoug · on May 5, 2023

There is ANN Benchmarks

http://ann-benchmarks.com/

next_xibalba · on May 5, 2023

I wish I could see the data on how Pinecone’s marketing materials keep making the front page of HN. It’s super sus that this keeps happening.

killthebuddha · on May 5, 2023

I imagine it's because there is a *massive* scramble to incorporate language models into existing applications and most of the time you need a vector database to do that. As a proxy for understanding the scale of this scramble, ChatGPT is the fastest growing software product ever by a pretty substantial margin.

next_xibalba · on May 5, 2023

Without a doubt LLMs are driving the interest in vector Dbs.

More so, I am wondering how it is that Pinecone manages to land on the HN front page so frequently given the large number of alternatives (see all the other comments in this thread). It suggests to me a coordinated marketing effort (brigading, cronyism, etc.)

killthebuddha · on May 5, 2023

Definitely maybe, but also I think pinecone is putting a lot of resources into devrel. For example, see their learning center [1] and examples [2].

Oh man, now I sound like I'm part of the brigading :) FWIW I recently evaluated pinecone and alternatives and decided not go with pinecone (pgvector).

[1] https://www.pinecone.io/learn/

[2] https://docs.pinecone.io/docs/examples

Having myself recently bootstrapped an understanding of language models et al, I would not be surprised if the pinecone learning center gets a lot of traffic.

shanghaikid · on May 5, 2023

Why not choose an open-source solution https://github.com/milvus-io/milvus, free!

sir_eliah · on May 5, 2023

Does anyone have some real, production experience with milvus? I'm interested this database performs in larger scale. Let's say, you have millions of vectors and traffic reaching thousands requests/s.

bluecoconut · on May 5, 2023

Not production, but yes to scale: I pushed milvus to ~140 million vectors (768 dimension) (though only a handful of requests per second (~10)), and it faired alright once everything was up and running and relatively static on the document side. Rebuilding indexes and stability were a bit of a hassle at times (I was live adding more documents to it ~1 million per 30 minutes) and it would occasionally fall over and need to rebuild, subsequently causing a lot more load, rejecting new documents, etc.). Probably lots of tuning I probably could have done to eek out more performance and stability though. Ended up being hours of effort on the rebuilds and lots of careful management of RAM (on a 300 GB RAM machine)

for the scale you are saying "larger scale": At the few million documents scale I would just suggest using just any libary, eg. `hnsw` in `nmslib` or `faiss`.

I just did some benchmarks with 1M docs, `cosinesimil_sparse` on `78628` dimensional binary vectors (nmslib `hnsw`) -> 30 seconds to build the index, and can process a batch of 100 document query in 3ms (Each with 100 KNN). Based on this question, i just put a loop over it and it handled 1000 random queries (non batched) in 1.11 seconds. (~1 GB peak RAM usage, and using 24 threads)

All in all, my personal opinion is: even up to few "millions" scale, i'm finding using the underlying libraries (`faiss` and `nmslib`) significantly easier than using the wrapper tools / databases (milvus and pinecone). I don't really get the point of a separate piece of infra for something that is essentially ~15 lines of python at most scales that matter (~few millions). (Note, in the ~10k-100k scale or less, simple numpy and sort seems to be fast enough (and exact) or just exact NN w/ sklearn.neighbors)... And when you push to scales that it does start breaking (100 million+), then the database versions seem to break as well (and require fiddling with lots of bespoke config)

sir_eliah · on May 5, 2023

Thanks for the input! I asked about the scale of items and traffic, because my use case actually requires separate piece of infrastructure. It's around 100 millions of items and live production traffic from millions of users with high latency demand. So it's not a batch job that can be performed in memory, as I understand your case.

Currently I use Elasticsearch with the Open Distro approximate kNN plugin by the way.

gk1 · on May 5, 2023

We at Pinecone have lots of customers at those operating levels (and many that are even higher)... If a managed option is viable for you.

jamesblonde · on May 5, 2023

Why are you not using the latest OpenSearch instead of Elastic with the older Open Distro kNN plugin?

sir_eliah · on May 5, 2023

Maybe I was not precise, I in fact use OpenSearch, but since it's a fork to ES, I consider this to be the same DB, architecture-wise.

peterstjohn · on May 5, 2023

Yes! We've been running Milvus in production for about three years now, powering some customers that do have queries at that scale. It has its foibles like all of these systems (the lack of non-int id fields in the 1.x line is maddening and has required a bunch of additional engineering by us to work with our other systems), but it has held up pretty well in our experience.

(I can't speak to Milvus 2.x as we are probably not going to upgrade to that for a number of non-performance reasons)

Alifatisk · on May 5, 2023

Or Redis Vector similarity?

wejick · on May 5, 2023

feel like the topology / architecture is too complicated. There's no standalone setup, worse than running something + zookeeper.

tomasreimers · on May 5, 2023

I have a dumb question: isn't the the same set algorithms and technology we've needed to develop for geospatial search (given coordinates, find the nearest coordinates)?

I remember reading about how Google maps did a very similar thing to figure out which points of interest to load based on your coordinates and zoom.

Can't we repurpose that technology? Or did those bake in assumptions around being 2 dimensional (while this is highly dimensional?)

softwaredoug · on May 5, 2023

Things get really weird in high dimensions.

Orthogonality is expected, for example. Proximity is really rare. See: https://softwaredoug.com/blog/2023/02/28/probability-of-dot-...

pbadams · on May 5, 2023

Apart from the comments you've already gotten, another goal of geospatial systems is to support range queries (e.g. for the bounding box of the user's screen, what are all the businesses in that box). In higher dimensions range queries are mostly useless and the focus is on NN queries.

But as the other comments have mostly said, it's mainly dimensionality and scale differences that drive the design differences (e.g. graphs end up working better than trees in high dimensions)

ninja3925 · on May 5, 2023

The main problem is that the embeddings are getting larger and larger (1,000+ dimensions). The pressure is then on reducing memory use through techniques such as Product Quantization while not losing too much accuracy.

Once this is done, the search heuristics are not difficult (find the cells to explore and return nearest neighbors).

pnathan · on May 5, 2023

Higher dimensional algorithms get very messy - see X-trees vs R-trees.

E.g., KNN is very fast, if and only if you have a high performing R-tree query.

so while the 2d algos might "fall out" of the sophisticated cases, the sophisticated cases will need to go in a different optimization direction.

gk1 · on May 5, 2023

Hey all, I'm from Pinecone (shocker). Addressing common questions...

What are vector DBs used for? > Storing and search through embeddings at scale, which are created and consumed by LLMs and other AI models for applications like semantic search and chatbots (eg, to avoid hallucinations).

Why use a managed vector DB like Pinecone instead of [Faiss, pgvector, self-hosted thing, numpy.array, etc]? > Usually comes down to scale and convenience. If you're dealing with a small amount of embeddings, say anything less than 10M, you're probably fine just reaching for the closest and most convenient option. (We try to make Pinecone that convenient option, and our free plan holds up to ~100k 1536-dimension embeddings.) If you're dealing with larger scale -- say hundreds of millions to billions of embeddings -- and have strict performance requirements, and aren't thrilled by the thought of managing your own vector database like we are, then you should consider Pinecone. It turns out there's a sufficiently large population that falls into the latter category, just as with any other database category.

If you're new to this the best place to "see for yourself" is our free plan (https://app.pinecone.io) and collection of examples (https://docs.pinecone.io/docs/examples).

Allow me one more plug: We're hosting a webinar next week about testing Pinecone performance with your own data and performance requirements. I have a feeling lots of folks reading this would find that useful. → https://pinecone-io.zoom.us/webinar/register/WN_z9JqLjLGTyu4...

znagengast · on May 5, 2023

How are you guys thinking about the embedding generation side of things? It seems like that part has a generally hefty compute cost before it even gets into the index - I just open sourced a swift package to try to make that part as easy as possible, the example project exports directly to pinecone. https://github.com/ZachNagengast/similarity-search-kit

jmole · on May 5, 2023

I know nothing about vector databases – is this just “replace SQL with a dot product and return a ranked list (with optimizations)”?

kirill5pol · on May 5, 2023

That’s nearest neighbour search which scales O(n^2) for the number of vectors in your DB, what these DBs (and libraries like FAISS) use is approximate nearest neighbour, which makes the search much, much faster.

etiam · on May 5, 2023

Salakhutdinov R. R, and Hinton, G. E. (2007) "Semantic Hashing" Proceedings of the SIGIR Workshop on Information Retrieval and Applications of Graphical Models

https://www.cs.toronto.edu/~hinton/absps/sh.pdf

wejick · on May 5, 2023

For prototyping purpose I found that chromadb integration in langchain is very easy to use. That's said I'm not sure about production usage. https://blog.langchain.dev/langchain-chroma/

tudorw · on May 5, 2023

Trying chromaDB out, seems to work with langchain and langflow out of the box, nice :)

jabo · on May 5, 2023

I get this question frequently - why not use FAISS or ANNOY directly, instead of a vector database, so glad to see this aspect covered in this article.

Plug: If you're ever looking for an open source alternative to Pinecone, we recently added vector search to Typesense: https://typesense.org/docs/0.24.1/api/vector-search.html

The key thing is that it's in-memory and allows you to combine attribute-based filtering, together with HNSW-based nearest-neighbor search.

We're also working on a way to automatically generate embeddings from within Typesense using any ML models of your choice.

So Algolia + Pinecone + Open Source + Self-Hostable with a cloud hosted option = Typesense

zcbenz · on May 5, 2023

Suppose one is writing a desktop app doing searches of embeddings, is there a lightweight vector database that runs on all major desktop platforms? I'm aware of redis and sqlite-vss, but none runs natively on Windows.

jstx1 · on May 5, 2023

You can use libraries like faiss or ann for the search, you don't need a vector database when you work at small-to-medium scale.

killthebuddha · on May 5, 2023

I would run postgres with pgvector https://github.com/pgvector/pgvector.

I wouldn't call postgres lightweight of course, but it's definitely lightweight in the sense that it doesn't add a whole bunch of new garbage to an otherwise traditional application.

dtrailin · on May 5, 2023

Chroma runs on Windows since I believe it's just a python package: https://github.com/chroma-core/chroma

hallqv · on May 5, 2023

Why not use Elasticsearch/Opensearch? Way more battle-tested and has many more features for text data. Also uses the same vector indexing algos.

jfengel · on May 5, 2023

Can someone give me the quickie on what a vector embedding is? The article assumes I know it, and everything else in the article seems trivial.

linuxdude314 · on May 5, 2023

An embedding is the mathematical process of converting tokens (text or otherwise) into a vector of floating point numbers. This vector captures semantics such that words that have similar meaning are close to each other when their distance is measured using the metric of cosine similarity.

ninja3925 · on May 5, 2023

An embedding is a numerical representation of a complex object.

In an image, each pixel is a dimension but it does not have any meaning in itself - you need to look at the rest of the image to understand that the pixel is part of a cat. Embeddings is a way to represent this meaning. Think about it as a “summary” of a image / document.

rocgf · on May 5, 2023

I'm not a data scientist or anything, but as far as my understanding goes, you can think of it as a list of numbers (floats, integers, whatever), each number representing a feature of the thing you're trying to represent as a vector (an image, a video, whatever).

raindear · on May 5, 2023

For example, you can create an embedding for an image using a neural net that has been trained to receive an image and output a vector of 1024 floats, which represent the content of the image. This vector is a lossy compressed version of the image.

jfengel · on May 5, 2023

And (if I'm understanding correctly) vectors that are near each other (in a mathematical sense) represent inputs that are "near" each other (in a conceptual sense).

So... a vector database can be organized to quickly retrieve objects with particular characteristics, without rigidly defining what those characteristics are.

Have I got it?

alsodumb · on May 5, 2023

Yup I think you got it perfectly. Just a small note: yes one isn’t rigidly defining what those characteristics are while finding similar embeddings (aka nearest vectors using some distance metric), but those characteristics are implicitly encoded in the model that creates the embeddings depending on how the model is trained.

asgeir · on May 5, 2023

https://www.youtube.com/watch?v=gQddtTdmG_8

PaulHoule · on May 5, 2023

Talk about people who were ready with the right product at the right time.

tqi · on May 5, 2023

"Since the vector database provides approximate results, the main trade-offs we consider are between accuracy and speed. The more accurate the result, the slower the query will be. However, a good system can provide ultra-fast search with near-perfect accuracy."

What does "near-perfect" mean in this context, type 1 or type 2?

soft_camel · on May 8, 2023

An alternative to Pinecone is MyScale DB. It is SQL-based vector database and has generous free tier now. https://myscale.com/ 1. Manages both structured and vectorized data in a single database and can perform joint queries and analytics on both types of data. 2. Cloud-native OLAP database architecture enables operations on vectorized data to be executed with astounding speed. 3. Complete and extended SQL support for all data operations, accessible via developer tools such as Python SDK.

HarHarVeryFunny · on May 5, 2023

Karpathy (OpenAI, ex. Tesla) recently tweeted about KISS and just using np.array instead of a vector database !

Searching for similar vectors is basically the (approximate) KNN problem, although I imagine more specialized search methods might apply depending on what you are doing.

dmezzetti · on May 5, 2023

100% agree. There are so many simple use cases where people are jumping to a complex option to start. For something like 10K records, a NumPy or PyTorch matrix operation could be enough.

Recently published an article discussing this: https://neuml.hashnode.dev/customize-your-own-embeddings-dat...

javier2 · on May 5, 2023

Well, we have about 4B vectors we want to index, with a constantly changing dataset. We are actually running just fine on Elasticsearch with ~1B docs indexed right now, but the hardware costs are looking expensive.

dmezzetti · on May 5, 2023

With 4B vectors, you can look at methods like quantization and compression, both detailed here for Faiss - https://github.com/facebookresearch/faiss/wiki/Indexing-1G-v...

Elasticsearch uses HNSW, not sure what options they have but quantization/compression will help reduce disk storage requirements. Alternatively, you can look at dimensionality reduction algorithms and only store that output in ES. Or pick a model with a small number of dimensions. For example https://huggingface.co/sentence-transformers/all-MiniLM-L6-v... only has 384 dims vs 768/1024/2048/4096.

0xDEF · on May 5, 2023

>Karpathy (OpenAI, ex. Tesla) recently tweeted about KISS and just using np.array instead of a vector database

The context was a very underwhelming side project of his: A movie search engine but you had to use the exact titles of the movies to get results. It only revealed that he doesn't appreciate what similarity search actually is.

It feels almost blasphemous to call a Karpathy side project underwhelming. He is a genius and it really felt unlike him to write that "just use np.array" tweet.

HarHarVeryFunny · on May 5, 2023

I don't recall the context in that much detail, but I'd have to give him the benefit of the doubt!

Surely the whole point of a vector "database" in that context would be to store semantic sentence embeddings of the movies titles to support approximate / semantically-related search ? Could do the same thing for movie plot synopsis too - allow user to search via vague descriptions of movie. ChatGPT actually does very well at this, although massive overkill.

peterstjohn · on May 5, 2023

It definitely depends on your use case. If you are just searching through the entire array at all times, then this is certainly an acceptable option (you could even flip it all onto a GPU too).

But when you start to require filtering or combining the vector search with a lexical search, then something like Pinecone, Vespa, Qdrant, Lucene-based options (e.g. Solr and ES) etc. become a lot more practical than you building all that functionality yourself.

Alifatisk · on May 5, 2023

There is a lot of talks about Vector Databases nowadays, is there a trend I am missing out on?

Longwelwind · on May 5, 2023

AI trend due to ChatGPT.

For NLP use-cases, you can use a vector database to index the embeddings of your texts.

For exemple, if you implement a document retrieval (a search engine like Google), you train a transformer model that takes a text as input (the content of your document) and produces a vector of number as output (the embedding). You then index your documents by transorming them to their embeddings and storing them inside your vector database.

When you want to perform a query using keywords, you transform your keywords into a vector, and then ask your vector database to send you the most similar documents, using a similary function such as the cosine function.

politician · on May 5, 2023

It seems like it would be better to index a document multiple times by generating embeddings for every paragraph rather than once per document. What am I missing?

WolfOliver · on May 5, 2023

ChatGPT / OpenAI

wskish · on May 5, 2023

After working through several projects that utilized local hnswlib and different databases for text and vector persistence, I integrated open source hnswlib with sqlite to create an embedded vector search engine that can easily scale up to millions of embeddings. For self-hosted situations of under 10M embeddings and less than insane throughput I think this combo is hard to beat.

https://github.com/jiggy-ai/hnsqlite

defrost · on May 5, 2023

I'm assuming, at its core, a vector database is a collection of vectors all of the same dimension N, for some N, and is a set of points in R^N ?

( aside from all the per vector associated meta-data, various dimensional reduction, nearest neighbour, etc, operations describe in the article ).

seanhunter · on May 5, 2023

Not really. For starters not all the vectors in the database are of the same dimension (at least in ones I've used). You can for example in pgvector have multiple tables with a column of vector type and those tables can have different cardinality.

In any case your description is strangely reductionist. The important thing about a vector db is that is typically designed to store embeddings used in various ML applications. So say you are doing NLP you can tokenize some input and then store the token and positional embeddings in a vector db and then use it for similarity search, training etc.

defrost · on May 5, 2023

Mixed tables each with reduced cardinalities makes sense.

> your description is strangely reductionist.

Sure - pure | applied math background, old enough to have used Postgres when it was known as Ingres, to have patched in Spatial relations before it had the GIS functions it has now, and to have written libraries for GIS linked { 256 | 1024 | 2048 } D vector databases for signal aquisition | processing.

I'm late to the 'modern' discussions & just checking my read - I can think of applications for mixed dimensions and discrete space vectors and there are analogs to trad R^N ops for those cases.

tudorw · on May 5, 2023

i think it might be a n-dimensional topological manifold with a nth dimensional geodesic supplying the shortest path, anyone actually done topology?

tudorw · on May 5, 2023

if you take a 2d piece of paper, draw the numbers 1 to 5 horizontally, then look at the distance between 1 and 5, or 1 and 3, they vary, now roll it into as cylinder, it's a manifold, and now the distance between 1 and 5 and 1 and 1 and 3 are the same, no matter how many numbers you had written on the paper, they would be connected by the shortest path, something like a helix I believe, the helix being the geodesic, now, increase the dimension, but keep rolling it up into that 'cylinder' and using a new geodesic to connect the shortest path, this is my layman's take on it, if you are #math please jump in and put me right!

modernpink · on May 5, 2023

Not sure what this comment is driving at. "At it's core" a database is a collection of data, sure.

defrost · on May 5, 2023

and a "vector database" is a collection of vectors, sure.

The question comes from is it the CompSci | HN | AI domain nomenclature to assume a vector database is made up of vectors that are all of the same dimension over a continuum (eg. N real numbers for fixed N) or are vector databases made up of mixed vectors (no fixed dimension) and discrete values, etc.

I ask as the linked article doesn't specify but does appear to imply.

modernpink · on May 5, 2023

An embedding model will map a string of text (of variable length) to R^D. Each model has its own fixed D, yes. (Typically the vector is a unit vector for performance reasons.) The main function of vector database is similarity lookup so you would calculate approximate nearest neighbours of a vector with a scalar vector distance metric (e.g cosine similarity). These similarity metrics operate on two vectors of the same dimension.

You would not mix and match embedding models (e.g. with differing dimensions) at look-up time. The target vector table assumes you will look it up with a vector created from the exact same embedding model and version that was used to backfill it.

The API documentation for a look up operation may be more illuminating here:

>vector (array of floats)

>The query vector. This should be the same length as the dimension of the index being queried. Each query() request can contain only one of the parameters id or vector.

https://docs.pinecone.io/reference/query

seanhunter · on May 5, 2023

D is fixed for the model but not for the database. You don't need a seperate database for each model.

modernpink · on May 5, 2023

Yes that's imprecision on my part. For a table D is fixed, but not necessarily across tables in the vector database index

seanhunter · on May 5, 2023

As per my comment earlier, the dimension isn't fixed. The usual use case (storing embeddings) is instructive as to the range of values. For token embeddings, often the embedding is generated via a lookup in a fixed vocabulary of tokens token to a token ID. So say your vocab is words, the value is a word id which would obviously be an integer not a real. Here's an intro to word embeddings https://wiki.pathmind.com/word2vec and here's one for positional embeddings (the new hotness given how zeitgeisty GPTs are) https://theaisummer.com/positional-embeddings/

jn2clark · on May 5, 2023

If anyone is looking for a vector search engine, see here https://github.com/marqo-ai/marqo. Has additional functionality to make vector search much easier.

berkle4455 · on May 5, 2023

So, uh, how do I generate vector embeddings in the first place? What's a good ground-level area to start?

tudorg · on May 5, 2023

An easy practical option for text is to use the OpenAI embeddings API.

singularity2001 · on May 5, 2023

May be a bit old-fashioned but is there any standalone Java Vector Database?

jn2clark · on May 5, 2023

In the quest for ultimate speed, I started developing a vector database in assembly using gpt4 as a side project https://github.com/jn2clark/GPT4Memory.