> A more robust approach would be to give the whole reasoning to an LLM and ask ...

int_19h · 2025-02-18T07:04:16 1739862256

Every time I see these kinds of prompts that ask an LLM for a numeric ranking, I'm very skeptical that the numbers really mean anything to the model. How does it know what a 0.5 is supposed to be? With humans, you'd have them grade things and then correct the grades so they learn what it is from experience. But unless you specifically fine tune your LLM, this wouldn't apply.

irthomasthomas · 2025-02-18T14:02:26 1739887346

I went through this with gemini-1.5, using it to evaluate responses. Almost everything was graded 8-9/10. To get useful results I did the following. 1. Created a long few-shot prompt with many examples of human graded results. 2. Prompt it to write it's review before it's assesment. 3. Prompt it to include example quotes to justify it's assesment 4. Finally produce a numeric score.

With gemini-2 I've been able to get similar results without the few-shot prompts. Simply by prompting it to not be a sycophant, and explaining why it was important to get realistic, even hard scores, and that i expected most scores to be low, on order for the high scoring content to stand out.

In a recent test, I changed to using word scores, low, medium, high, and very high. Out of about 500 examples none scored very high. I thought that was pretty cool, as when I do find one scoring high it will stand out, and hopefully justify it's score

stared · 2025-02-18T11:09:18 1739876958

Yes, you are right.

If we ask LLM to grade something, we must create a prompt with good instructions. Otherwise, we will have no idea what 0.5 means or whether it is given consistently.

(A rule of thumb: Is it likely that various people, not knowing the context of a given task, will give the same grade?)

The most robust approach is to ask to rank things within a task. That is, "given blog post titles, grade them according to (criteria)" rather than asking about each title separately.