Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I also work in LLM evaluation. My cynical take is that nobody is really using LLMs for stuff, and so benchmarks are mostly just make up tasks (coding is probably the exception). If we had real specific use cases it should be easier to benchmark and know if one is better, but it’s mostly all hypothetical.

The more generous take is that you can’t benchmarks advanced intelligence very well, whether LLM or person. We don’t have good procedures for assessing a person's fit-for-purpose e.g. for a job, certainly not standardized question sets. Why would we expect to be able to do this with AI?

I think both of these takes are present to some extent in reality.



Do you not have massive volumes of customer queries to extract patterns for what people are actually doing?

We struggle a bit with processing and extracting this kind of insight in a privacy-friendly way, but there’s certainly a lot of data.


We have 20+ services in prod that use llms. So I have 50k (or more) per service per day of data to evaluate. The question is- do people actually evaluate properly.

And how do you do an apples to apples evaluation of such squishy services?


You could have the world expert debate the thing. Someone who can be accused of knowing things. We have many such humans, at least as many as topics.

Publish the debate as~is so that others vaguely familiar with the topic can also be in awe or disgusted.

We have many gradients of emotion. No need to try quantify them. Just repeat the exercise.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: