Does this mean we are getting closer to being able to do typical inference search stuff straight in the db? I CTO ecommerce stuff and I am just now starting to look into similarity based recommendations where we can show related products based on a bunch of variables that infer similarity?
My knowledge of this topic is very tangential since I have not had a need to dive in, but isnt a completed / trained model just represented as a bunch of vectors?
We run Elixir so I could run the models in bumblebee but ideally we can stick to postgres only until vertical scaling starts to fall apart (which is double digits terabytes away).
Typically you have (1) an ML model to generate embeddings from the items in your db and (2) vector support in the db to order/compare those rows by similarity (i.e. cosine distance between embeddings). So this just gives you the second part in a more convenient / efficient package. That's super cool, but only the second half.
For the encoding model, you could use any ML model you want, from cheap/less complex models to expensive/incredibly complex models like GPT-3. You could even use a face recognition model to encode face images and sort/compare them the same, etc
So this just makes it a lot easier to roll your own similarity systen with an encoding model of your choice plugged in. If you have a lot of data to encode and aren't afraid of running your own model, it is a great part of a solution. But it is not an all-in-one solution.
Probably better to use an off the shelf recommender or Amazon personalize. Amazon personalize will turn your product text data into embeddings and use that + collaborative filtering to generate various types of recommendations.
I’ve worked on e-commerce recs. Typically you would represent each product with a vector. Then finding similar products becomes a nearest-neighbour search over these vectors. Depending on your use-case it’s feasible now to do this search in the db using pgvector, or using something like solr/elastic which both support vector search in recent releases. You could also use something like faiss or one of the many nearest-neighbour libraries or dedicated vector search engines. (Since you are working with Elixir, you might find ExFaiss interesting [1][2][3]).
But I would say that for recommendations, searching the vectors is the easy part. The main work in getting good recommendations is generating a good set of product vectors in the first place. The quality of the recommendations is directly related how you generate the vectors. You could use one of the many open-source language models to generate vectors, but typically that approach isn’t very good for product recommendations. It will just give you items that are textually similar, and this usually doesn’t give good product recommendations.
To get good product recommendations you’d probably want to build a custom embedding that captures some notion of product similarity during training using some signals you get from user behaviour. E.g. things like products clicked in the same session, or added to cart at the same time, gives a signal on product similarity that you can use to train a product embedding for recommendations.
This is a bit more involved, but the main work is in generating the training data. Once you have that you can use open source tools such as fasttext [4] to learn the embedding and output the product vectors. (Or if you want to void training your own embedding, I’d guess that there are services that will take your interaction data and generate product vectors from them, but I’m not familiar with any).
Unfortunately, not available yet in AWS, Azure or GCP where most PostgreSQL installs are happening due to integrated monitoring and one-click replication, backup/restore, scaling, etc.
I looked and DO doesn't offer this extension. Then, I tried their "ideas" app, where users can vote for new ideas/suggestions and this is the "pgvector extension" idea URL [1], full of SPAM. Looks like they don't care anymore.
Are any of the extensions big competitors with their managed services? For example Timescale on AWS - I might not bother with RedShift depending on scale and needs. I could be wrong but it seems like business pressure will prevent the spread of these most useful extensions on first class managed Postgres.
As an aside, I question timescale use case. It’s highly niche, where you need to store arrays of data, at >>1k entries per day, you can’t use batch export of the data, and the export process is largely concerned with only recent data (aka doesn’t need to incorporate future modifications of older entries).
1 years worth of 4 byte ints at 1 minute intervals is only 2MB uncompressed. Do you really need second-level or sub second resolution in the db? Cuz 2MB chunks (realistically much less) are very cheap. 2MB fits in 250 8kb pages. Many join+limit 25 queries on unclustered tables use about that much. It’s so, so cheap to process.
Last I checked, Timescale also doesn’t make it easy to combine data at different resolutions. If you store 10ms, 1m, 1day resolutions, you do need to write 3 queries and combine them to get a real-time aggregate.
I hope I get owned and learn some stuff. So can you share more about your use case? It’d be interesting to see what is getting you interested in timescale.
The true killer feature in my book is the compression: I get ~93% for my largest datasets, which makes disk space requirements far more sane and dramatically speeds big analytical queries. You’re right that if you have a small (<1GB) table, Timescale is overkill, but shrinking a 70GB table to 5GB is quite a big deal if you’re trying to run on a budget VPS.
Continuous aggs and the job daemon are really nice bonuses too.
PG arrays also compressed on disk. It may not be all that different than timescale. So ya if you store data in ts|id|value row format, it’s 70GB, but id|values array the values array is all compressed. I’ve seen 40x disk size reduction in practice using this method.
I’m surprised by that 40x - I remember benchmarking TOAST compression and being very underwhelmed by the savings (I believe it took me from ~60GB to ~20GB).
What sorta impact does that schema have on insert performance? I would expect the DB to have to rewrite the entire array on every update, though that cost could be mitigated by chunking the arrays.
Are you ditching timestamps and assuming data is evenly spaced?
In a real-life situation where I tried this the keys were very wide, compound, with text keywords. That's where most of the savings came from since before each row was mostly key and not very much actual associated data. Sorry to mislead you about TOAST compression.
It was daily data too, literally 10 or more orders of magnitude less ingestion than Timescale is built for, so the arrays fit inside 1-2 postgres pages and so writing was absolutely not a problem every for 5-10 years of data.
Timescale may be the right solution when you need, I quote from their docs, "Insert rates of hundreds of thousands of writes per second", or somewhere within a few orders of magnitude of that. Good for the niche, but the niche is uncommon.
Yes ditching timestamps. The full row configuration used amazon keywords as keys, so they looked something like start_date|end_date|company_id|keyword_match_type|keyword_text|total_sales_by_day. They match_type and keyword text were like 40 bytes per row, so that's where the huge savings came from.
The data was assumed to be contiguous, aka there's end_date-start_date+1 entries in the total_sales_by_day array.
If this data were in Timescale, the giant keys would be compressed on disk, but I believe (would need to check) that there would be a lot of noise in memory/caches/processing after decompressing while processing the rows.
Anyway, in conclusion, I do think Timescale has its niche uses, but I've seen a lot of people think they have time series data and need Timescale when they really just have data that is pre-aggregated on a daily or even hourly basis. For these situations Timescale is overkill.
Yeah, fully agreed with the overall thesis. Postgres is quite capable straight out of the box.
I really like the idea of TOAST compression as an archival format too for big aggregates - I’ll have to check out the performance of a (grouping_id, date, length_24_array_of_hourly_averages) schema next time I get an excuse.
So this could be used to find similar images in a database?
For now I sort of tested a naive method where I convert all images in 64x64 black and white, and use a simple levenshtein. It's not efficient, but for not too large dataset, it works.
It is not only about visual similarity, but the semantics of the images as well (for example, all the dog photos should be close to each other, no matter the colours or the scene).
The simplest way to do that is probably to use one of the pretrained neural networks (like resnet), convert the images into embeddings, index them and use for search.
Have you looked at CLIP? You can use that to create a vector embedding for an image that includes semantic information (what's actually in the image - animals, colours, etc) - those could then be used with pgvector to find similar images.
"Storing OpenAI embeddings in Postgres with pgvector"
https://supabase.com/blog/openai-embeddings-postgres-vector
Discussion: https://news.ycombinator.com/item?id=34684593
And this one:
"What's Postgres Got To Do With AI?"
https://www.crunchydata.com/blog/whats-postgres-got-to-do-wi...
Discussion: https://news.ycombinator.com/item?id=34903454