Also, using an OpenAI model to judge the performance of an OpenAI model seems pr...

LauraMedia · 2025-08-08T07:54:59 1754639699

Am I missing something? If LLM-1 is supposed to judge LLM-2, doesn't LLM-1 have to be better than LLM-2? If LLM-1 is only 40% as good at coding as LLM-2, why would you trust the LLM with the lesser knowledge?

BlindEyeHalo · 2025-08-08T08:02:19 1754640139

At the heart of the P vs NP problem lies the observation that solution verification seems to be much easier than solution generation. If that applies in this context is another question but I think it is not unreasonable to assume that the judge needs to be less powerful than the performer.

Or in other words, I don't need to be a chef myself to decide if a meal is good or not.

rowanG077 · 2025-08-08T08:31:15 1754641875

That really doesn't hold for all problems. You can imagine any number of problems where a valid solution is easier, complexity wise, to generate than it is to validate. A trivial example is semiprime factorization. Easy to generate any semiprime, hard to factor.

BlindEyeHalo · 2025-08-08T10:10:39 1754647839

Sure, it was never my intention to make it seem like a general statement, just highlighting that there is a large class of problems for which it is true.

As you point out there are many problems that higher complexity classes than NP.

mcphage · 2025-08-08T12:46:42 1754657202

> That really doesn't hold for all problems.

But it does hold for this problem.

rowanG077 · 2025-08-08T13:07:58 1754658478

How so? Asking LLMs to solve a problem can be a problem of any form. For example I just asked this.

Can you give me a very large semiprime?

And claude opus answered:

Here's a very large semiprime:

N = 29927402397991286489627837734179186385188296382227646249397073654051914085318503794952624411151858464246403027505634195232053330357484129331920822220662818816547063469215394303721576869467659309978113411955550111870966028627418736664

This is a over 200-digit semiprime. Factoring semiprimes of this size is computationally intensive, which is why they form the basis of RSA encryption security.

---

Verifying whether this answer is correct is very hard, much harder than generating it.

Problems of this form come up very often. Not even in formal mathematics. Some magic number in the code that you need to reverse engineer to tell it's correct. Some library which you don't have the documentation for but was available when it was written. Hidden intentions or even requirements that are not clear from the code itself. If a weaker LLM is validating a stronger LLM the weaker LLM will simply not grasp the subtleties the stronger LLM created in it's answer. In fact it's a pretty common statement that writing code is easier than reading it. Which is precisely about generation vs validation.

david_allison · 2025-08-08T13:52:23 1754661143

> Factoring semiprimes of this size is computationally intensive, which is why they form the basis of RSA encryption security.

Not if it's divisible by 2.

    from sympy import isprime
    num = 29927402397991286489627837734179186385188296382227646249397073654051914085318503794952624411151858464246403027505634195232053330357484129331920822220662818816547063469215394303721576869467659309978113411955550111870966028627418736664
    print(num//2) # 14963701198995643244813918867089593192594148191113823124698536827025957042659251897476312205575929232123201513752817097616026665178742064665960411110331409408273531734607697151860788434733829654989056705977775055935483014313709368332
    print(isprime(num//2)) # False

rowanG077 · 2025-08-08T14:25:58 1754663158

Indeed that works for that case. But you can prompt yourselves, it will not always generate natural that are easy to validate with such shortcuts. So I don't think it invalidates the point I'm making.

jama211 · 2025-08-08T08:42:01 1754642521

Pretty sure they know that, their point still stands

torginus · 2025-08-08T12:58:33 1754657913

It's a bit different for reasoning LLMs - they operate in a feedback loop, measuring the quality of the solution and iterating on it until either the quality meets a desired threshold, or all reasoning effort is expended.

This can correct for generation errors, but cannot correct for quality measurement errors, so the question is valid.

cubefox · 2025-08-08T09:56:01 1754646961

It's usually easier to create a false statement than to check whether it's false.

stingraycharles · 2025-08-08T10:39:27 1754649567

At least use something like Zen MCP’s Consensus tool to gain a consensus around a large variety of models.

mirekrusin · 2025-08-08T07:53:19 1754639599

Exactly, they should at least compare with judges as best models from others, ideally verified by human/ground truth/tests.