The point is to bootstrap self-improving AI. Once a measurement becomes a goal, ...

riku_iki · 2025-04-03T18:22:32 1743704552

> If f(paper) -> code is the weakest part of the chain, it makes sense to target that.

my point is that LLMs are already potentially seeing solution on github, so you can't use that benchmark as metric unless there is some explanation.

kelseyfrog · 2025-04-03T18:46:40 1743706000

How does that work with knowledge cutoff?

riku_iki · 2025-04-03T18:51:07 1743706267

It could work with knowledge cut off if they can reliably guarantee it, and also make sure LLMs are not searching github under the surface.

kelseyfrog · 2025-04-03T19:01:32 1743706892

What's the likelihood that the researchers have done this? It seems fairly easy.

riku_iki · 2025-04-03T19:04:08 1743707048

I honestly have no idea how OAI researchers can guarantee cut off date for Antropic models for example.