I also work in LLM evaluation. My cynical take is that nobody is really using LLMs for stuff, and so benchmarks are mostly just make up tasks (coding is probably the exception). If we had real specific use cases it should be easier to benchmark and know if one is better, but it’s mostly all hypothetical.
The more generous take is that you can’t benchmarks advanced intelligence very well, whether LLM or person. We don’t have good procedures for assessing a person's fit-for-purpose e.g. for a job, certainly not standardized question sets. Why would we expect to be able to do this with AI?
I think both of these takes are present to some extent in reality.
We have 20+ services in prod that use llms. So I have 50k (or more) per service per day of data to evaluate. The question is- do people actually evaluate properly.
And how do you do an apples to apples evaluation of such squishy services?
The more generous take is that you can’t benchmarks advanced intelligence very well, whether LLM or person. We don’t have good procedures for assessing a person's fit-for-purpose e.g. for a job, certainly not standardized question sets. Why would we expect to be able to do this with AI?
I think both of these takes are present to some extent in reality.