great question, I can talk about how we do the more challenging "List all YC founders that worked at Google and now have an AI startup."
For this we have a target dataset (the YC co directory) that we have around 100 questions over. We have found that when feeding an entire company listing in along with a single question we can get an accurate single answer (needle in haystack problem).
So to build our evaluation dataset we feed each question with each sample into the cheapest LLM we can find that reliably handles the job. We then aggregate the results.
This is not perfect but it allows us to have a way to benchmark our knowledge graph construction and querying strategy so that we can tune the system ourselves.
OK, so you have a way to evaluate the accuracy and convince yourself that it’s probably works as expected. But what about me, a user? How can I check that the question I asked was answered correctly?
For this we have a target dataset (the YC co directory) that we have around 100 questions over. We have found that when feeding an entire company listing in along with a single question we can get an accurate single answer (needle in haystack problem).
So to build our evaluation dataset we feed each question with each sample into the cheapest LLM we can find that reliably handles the job. We then aggregate the results.
This is not perfect but it allows us to have a way to benchmark our knowledge graph construction and querying strategy so that we can tune the system ourselves.