RLHF works on problems that are difficult to specify yet easy to judge.
While RLHF will help improve systems, code correctness is not easy to judge outside of the simplest cases.
Note how on OpenAI's technical report, they admit performance on college level tests is almost exclusively from pre-training. If you look at LSAT as an example, all those questions were probably in the corpus.
>RLHF works on problems that are difficult to specify yet easy to judge.
But that's the thing, that it seems that everyone here on HN (and elsewhere) finds it easy to judge the flaws of AI-generated code, and they seem relatively consistent. So if we start offering these critiques as RLHF at scale, we should be able to bring the LLM output to the level where further feedback is hard (or at least inconsistent), right?