Hacker News new | past | comments | ask | show | jobs | submit login

From looking at this, I think it’s a very risky starting point for an engineer to kick off from.

Things like mentioning they’re clustered by meaning and optimized for analytics are questionable.

The clustering depends on the embedding you calculate. If you think that the embedding is a good semantic approximation of the data then maybe this is a fine way of thinking about it. But it’s not hard to imagine embeddings that may violate this — eg if I use an audio file and a text file that are identical in meaning through the same embed process, unless it is multimodal they will likely be distant in the embedding vector space.

I fully expect to see embeddings that put things close together in the vector space based on utilization rather than semantic similarity. If I’m creating a recommender system, I don’t want to group different varieties of one off purchases closely. For instance, the most semantically similar flight is going to be another fight to the same destination at a different time or a flight to a nearby airport. But I would want to group hotels often purchased by people who have previously bought the flight.

Vector databases also allow you to provide extra dimensionality into the data, like time awareness. Nothing is forcing you to use a vector that encodes semantic meaning.

And from this, you can see that we’re optimized for lookups or searches based on an input vector. This is not analogous to OLAP queries. This is more akin to elasticsearch than snowflake. If you are using a vector database thinking it’s going to give you reporting or large scale analytics on the vector space afaik there isn’t a readily available offering.




calculating the embeddings is still a mystery to me. I get going from a picture of an Apple to a vector representing "appleness" and then comparing that vector to other vectors using all the usual math. What I don't get is, who/what takes the image as input and outputs the vector. Same goes for documents, let's say i want to add a dimension (another number in the array) what part of the vector database do I modify to include this dimension in the vector calculation? Or is going from doc/image/whatver to the vector representation done outside the database in some other way?

edit: it seems like calculating embeddings would be something an ML algorithm would do but then, again, you have to train that one first. ...it's training all the way down.


Yup it happens outside of the system — but there are a number of perks to being able to store that data in a db — including easily adding metadata, updating entries, etc.

I think in 10y we will see retail systems heavily utilizing vector dbs and many embedding as a service products that take into account things like conversion. In this model you can add metadata about products to the vector db and direct program flow instead of querying back out to one or more databases to retrieve relevant metadata.

They’ll also start to enable things like search via image for features like “show us your favorite outfit” pulling up a customized wardrobe based on individual items extracted from the photo and run through the embedder.

Just one of many ways these products will exist outside of RAG. I think we’ll actually see a lot of the opposite — GAR.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: