OpenAI/evals > Building an eval: https://github.com/openai/evals/blob/main/docs/...

"Robustness of Model-Graded Evaluations and Automated Interpretability" (2023) https://www.lesswrong.com/posts/ZbjyCuqpwCMMND4fv/robustness... :

> The results inspire future work and should caution against unqualified trust in evaluations and automated interpretability.

From https://news.ycombinator.com/item?id=37451534 : add'l benchmarks: TheoremQA, Legalbench