Hacker News new | past | comments | ask | show | jobs | submit login

OpenAI/evals > Building an eval: https://github.com/openai/evals/blob/main/docs/build-eval.md

"Robustness of Model-Graded Evaluations and Automated Interpretability" (2023) https://www.lesswrong.com/posts/ZbjyCuqpwCMMND4fv/robustness... :

> The results inspire future work and should caution against unqualified trust in evaluations and automated interpretability.

From https://news.ycombinator.com/item?id=37451534 : add'l benchmarks: TheoremQA, Legalbench




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: