Hacker News new | past | comments | ask | show | jobs | submit login

Are there examples of the outputs the LLMs under test generated? I couldn't find any detailed ones in the paper or code.

The result here seems to be "Our Judge LLM gave another LLM a 21% grade for some code it generated", which is ... not qualitatively meaningful at all to me.






Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: