Hacker News new | past | comments | ask | show | jobs | submit login

> A more robust approach would be to give the whole reasoning to an LLM and ask to grade according to a given criterion

We actually use a variant of this approach in our reasoning prompts. We use structured output to force the LLM to think for 15 steps, and in each step we force it to generate a self-assessed score and then make a decision as to whether it wants to CONTINUE, ADJUST, or BACKTRACK.

  - Evaluate quality with reward scores (0.0-1.0)
  - Guide next steps based on rewards:
    • 0.8+ → CONTINUE current approach
    • 0.5-0.7 → ADJUST with minor changes
    • Below 0.5 → BACKTRACK and try different approach
I go into a bit more depth about it here, with an explicit example of its thinking at the end: https://bits.logic.inc/p/the-eagles-will-win-super-bowl-lix



Every time I see these kinds of prompts that ask an LLM for a numeric ranking, I'm very skeptical that the numbers really mean anything to the model. How does it know what a 0.5 is supposed to be? With humans, you'd have them grade things and then correct the grades so they learn what it is from experience. But unless you specifically fine tune your LLM, this wouldn't apply.


I went through this with gemini-1.5, using it to evaluate responses. Almost everything was graded 8-9/10. To get useful results I did the following. 1. Created a long few-shot prompt with many examples of human graded results. 2. Prompt it to write it's review before it's assesment. 3. Prompt it to include example quotes to justify it's assesment 4. Finally produce a numeric score.

With gemini-2 I've been able to get similar results without the few-shot prompts. Simply by prompting it to not be a sycophant, and explaining why it was important to get realistic, even hard scores, and that i expected most scores to be low, on order for the high scoring content to stand out.

In a recent test, I changed to using word scores, low, medium, high, and very high. Out of about 500 examples none scored very high. I thought that was pretty cool, as when I do find one scoring high it will stand out, and hopefully justify it's score


Yes, you are right.

If we ask LLM to grade something, we must create a prompt with good instructions. Otherwise, we will have no idea what 0.5 means or whether it is given consistently.

(A rule of thumb: Is it likely that various people, not knowing the context of a given task, will give the same grade?)

The most robust approach is to ask to rank things within a task. That is, "given blog post titles, grade them according to (criteria)" rather than asking about each title separately.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: