Hacker News new | past | comments | ask | show | jobs | submit login
Pgvector: Open-source vector similarity search for Postgres (github.com/pgvector)
201 points by craigkerstiens on Feb 28, 2023 | hide | past | favorite | 32 comments



See also this post:

"Storing OpenAI embeddings in Postgres with pgvector"

https://supabase.com/blog/openai-embeddings-postgres-vector

Discussion: https://news.ycombinator.com/item?id=34684593

And this one:

"What's Postgres Got To Do With AI?"

https://www.crunchydata.com/blog/whats-postgres-got-to-do-wi...

Discussion: https://news.ycombinator.com/item?id=34903454


PostgreSQL also has built-in support for exact (as opposed to approximate) vector similarity search for some problems with the following constraints:

- you have smaller vectors (no more than 50D)

- Euclidean, taxicab, or Chebyshev distance is appropriate for your problem

Put a GiST index [0] on a cube column [1] and ORDER BY target_cube_value <-> cube_column.

[0] https://www.postgresql.org/docs/current/gist.html

[1] https://www.postgresql.org/docs/current/cube.html


Does this mean we are getting closer to being able to do typical inference search stuff straight in the db? I CTO ecommerce stuff and I am just now starting to look into similarity based recommendations where we can show related products based on a bunch of variables that infer similarity?

My knowledge of this topic is very tangential since I have not had a need to dive in, but isnt a completed / trained model just represented as a bunch of vectors?

We run Elixir so I could run the models in bumblebee but ideally we can stick to postgres only until vertical scaling starts to fall apart (which is double digits terabytes away).


Typically you have (1) an ML model to generate embeddings from the items in your db and (2) vector support in the db to order/compare those rows by similarity (i.e. cosine distance between embeddings). So this just gives you the second part in a more convenient / efficient package. That's super cool, but only the second half.

For the encoding model, you could use any ML model you want, from cheap/less complex models to expensive/incredibly complex models like GPT-3. You could even use a face recognition model to encode face images and sort/compare them the same, etc

So this just makes it a lot easier to roll your own similarity systen with an encoding model of your choice plugged in. If you have a lot of data to encode and aren't afraid of running your own model, it is a great part of a solution. But it is not an all-in-one solution.


Probably better to use an off the shelf recommender or Amazon personalize. Amazon personalize will turn your product text data into embeddings and use that + collaborative filtering to generate various types of recommendations.


I’ve worked on e-commerce recs. Typically you would represent each product with a vector. Then finding similar products becomes a nearest-neighbour search over these vectors. Depending on your use-case it’s feasible now to do this search in the db using pgvector, or using something like solr/elastic which both support vector search in recent releases. You could also use something like faiss or one of the many nearest-neighbour libraries or dedicated vector search engines. (Since you are working with Elixir, you might find ExFaiss interesting [1][2][3]).

But I would say that for recommendations, searching the vectors is the easy part. The main work in getting good recommendations is generating a good set of product vectors in the first place. The quality of the recommendations is directly related how you generate the vectors. You could use one of the many open-source language models to generate vectors, but typically that approach isn’t very good for product recommendations. It will just give you items that are textually similar, and this usually doesn’t give good product recommendations.

To get good product recommendations you’d probably want to build a custom embedding that captures some notion of product similarity during training using some signals you get from user behaviour. E.g. things like products clicked in the same session, or added to cart at the same time, gives a signal on product similarity that you can use to train a product embedding for recommendations.

This is a bit more involved, but the main work is in generating the training data. Once you have that you can use open source tools such as fasttext [4] to learn the embedding and output the product vectors. (Or if you want to void training your own embedding, I’d guess that there are services that will take your interaction data and generate product vectors from them, but I’m not familiar with any).

[1] https://github.com/elixir-nx/ex_faiss

[2] https://dockyard.com/blog/2023/01/04/search-and-clustering-w...

[3] https://dockyard.com/blog/2023/01/11/semantic-search-with-ph...

[4] https://fasttext.cc/docs/en/unsupervised-tutorial.html


Unfortunately, not available yet in AWS, Azure or GCP where most PostgreSQL installs are happening due to integrated monitoring and one-click replication, backup/restore, scaling, etc.

https://docs.aws.amazon.com/AmazonRDS/latest/PostgreSQLRelea...

https://cloud.google.com/sql/docs/postgres/extensions#postgr...

https://learn.microsoft.com/en-us/azure/postgresql/single-se...

Let alone AWS Aurora, AWS RedShift, Snowflake, etc


I looked and DO doesn't offer this extension. Then, I tried their "ideas" app, where users can vote for new ideas/suggestions and this is the "pgvector extension" idea URL [1], full of SPAM. Looks like they don't care anymore.

[1]: https://ideas.digitalocean.com/app-framework-services/p/pgve...


Here's the issue tracking the current hosted options:

https://github.com/pgvector/pgvector/issues/54


Feel free to try it out on https://neon.tech/


Hey, any plans to let us know about pricing? Thanks!


wow, you guys rock!


Are any of the extensions big competitors with their managed services? For example Timescale on AWS - I might not bother with RedShift depending on scale and needs. I could be wrong but it seems like business pressure will prevent the spread of these most useful extensions on first class managed Postgres.


GCP supports lots of extensions including hint query etc https://cloud.google.com/sql/docs/postgres/extensions#postgr...

As an aside, I question timescale use case. It’s highly niche, where you need to store arrays of data, at >>1k entries per day, you can’t use batch export of the data, and the export process is largely concerned with only recent data (aka doesn’t need to incorporate future modifications of older entries).

1 years worth of 4 byte ints at 1 minute intervals is only 2MB uncompressed. Do you really need second-level or sub second resolution in the db? Cuz 2MB chunks (realistically much less) are very cheap. 2MB fits in 250 8kb pages. Many join+limit 25 queries on unclustered tables use about that much. It’s so, so cheap to process.

Last I checked, Timescale also doesn’t make it easy to combine data at different resolutions. If you store 10ms, 1m, 1day resolutions, you do need to write 3 queries and combine them to get a real-time aggregate.

I hope I get owned and learn some stuff. So can you share more about your use case? It’d be interesting to see what is getting you interested in timescale.


The true killer feature in my book is the compression: I get ~93% for my largest datasets, which makes disk space requirements far more sane and dramatically speeds big analytical queries. You’re right that if you have a small (<1GB) table, Timescale is overkill, but shrinking a 70GB table to 5GB is quite a big deal if you’re trying to run on a budget VPS.

Continuous aggs and the job daemon are really nice bonuses too.


PG arrays also compressed on disk. It may not be all that different than timescale. So ya if you store data in ts|id|value row format, it’s 70GB, but id|values array the values array is all compressed. I’ve seen 40x disk size reduction in practice using this method.


I’m surprised by that 40x - I remember benchmarking TOAST compression and being very underwhelmed by the savings (I believe it took me from ~60GB to ~20GB).

What sorta impact does that schema have on insert performance? I would expect the DB to have to rewrite the entire array on every update, though that cost could be mitigated by chunking the arrays.

Are you ditching timestamps and assuming data is evenly spaced?


In a real-life situation where I tried this the keys were very wide, compound, with text keywords. That's where most of the savings came from since before each row was mostly key and not very much actual associated data. Sorry to mislead you about TOAST compression.

It was daily data too, literally 10 or more orders of magnitude less ingestion than Timescale is built for, so the arrays fit inside 1-2 postgres pages and so writing was absolutely not a problem every for 5-10 years of data.

Timescale may be the right solution when you need, I quote from their docs, "Insert rates of hundreds of thousands of writes per second", or somewhere within a few orders of magnitude of that. Good for the niche, but the niche is uncommon.

Yes ditching timestamps. The full row configuration used amazon keywords as keys, so they looked something like start_date|end_date|company_id|keyword_match_type|keyword_text|total_sales_by_day. They match_type and keyword text were like 40 bytes per row, so that's where the huge savings came from.

The data was assumed to be contiguous, aka there's end_date-start_date+1 entries in the total_sales_by_day array.

If this data were in Timescale, the giant keys would be compressed on disk, but I believe (would need to check) that there would be a lot of noise in memory/caches/processing after decompressing while processing the rows.

Anyway, in conclusion, I do think Timescale has its niche uses, but I've seen a lot of people think they have time series data and need Timescale when they really just have data that is pre-aggregated on a daily or even hourly basis. For these situations Timescale is overkill.


Yeah, fully agreed with the overall thesis. Postgres is quite capable straight out of the box.

I really like the idea of TOAST compression as an archival format too for big aggregates - I’ll have to check out the performance of a (grouping_id, date, length_24_array_of_hourly_averages) schema next time I get an excuse.


That said, pgvector could be used for applications that "ship" PostgreSQL with themselves, e.g. adding semantic search to apps like Zulip.


what needs to happen to make it supported? is it just a matter of postgres version or is there some reason for specific support matrices?


It would be interesting to include it in the benchmarks with the vector databases: https://qdrant.tech/benchmarks/

Are there any limits except for the max 2,000 dimension that might be indexed?


So this could be used to find similar images in a database?

For now I sort of tested a naive method where I convert all images in 64x64 black and white, and use a simple levenshtein. It's not efficient, but for not too large dataset, it works.

I guess I could just add the image histogram.

I'm still curious of how tineye works.


It is not only about visual similarity, but the semantics of the images as well (for example, all the dog photos should be close to each other, no matter the colours or the scene).

The simplest way to do that is probably to use one of the pretrained neural networks (like resnet), convert the images into embeddings, index them and use for search.

I'm sharing my article describing how to implement it: https://medium.com/p/5515270d27e3


This is really great. Thanks for sharing.


Have you looked at CLIP? You can use that to create a vector embedding for an image that includes semantic information (what's actually in the image - animals, colours, etc) - those could then be used with pgvector to find similar images.

CLIP is how search engines like https://lexica.art/?q=6dc768e2-7a7c-494d-9a39-fd8f27e69248 work.


Use CLIP could let you search by text sentences. And this could be used to accelerate the embedding similarity sorting part. Like https://mazzzystar.github.io/2022/12/29/Run-CLIP-on-iPhone-t...


Anyone know if Neon supports pgvector?


Yes it does! As of Feb 14th: Valentines day present. https://neon.tech/docs/release-notes


Huzzah! Thank you


Interesting. I want to try this on a bunch of chemical structures


Here's a demo using Milvus if you're interested: https://molsearch.milvus.io/.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: