I also work in LLM evaluation. My cynical take is that nobody is really using LL...

brookst · 2025-11-09T03:42:27 1762659747

Do you not have massive volumes of customer queries to extract patterns for what people are actually doing?

We struggle a bit with processing and extracting this kind of insight in a privacy-friendly way, but there’s certainly a lot of data.

Kostchei · 2025-11-09T04:20:09 1762662009

We have 20+ services in prod that use llms. So I have 50k (or more) per service per day of data to evaluate. The question is- do people actually evaluate properly.

And how do you do an apples to apples evaluation of such squishy services?

econ · 2025-11-09T05:15:36 1762665336

You could have the world expert debate the thing. Someone who can be accused of knowing things. We have many such humans, at least as many as topics.

Publish the debate as~is so that others vaguely familiar with the topic can also be in awe or disgusted.

We have many gradients of emotion. No need to try quantify them. Just repeat the exercise.