Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I will repeat my question from one of the previous threads:

Can someone explain these Aider benchmarks to me? They pass same 113 tests through llm every time. Why they then extrapolate ability of llm to pass these 113 basic python challenges to the general ability to produce/edit code? Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?

Did anyone ever try to change them test cases or wiggle conditions a bit to see if it will still hit the same %?



Indeed, test data like this constantly leaks into the training data, so these leaderboards are not necessarily representative for real-world problems. A better approach is to use variable evaluation like GSM-Symbolic (for evaluating mathematic reasoning): https://arxiv.org/abs/2410.05229


> Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?

They could. They would easily be found out as they loose in real world usage or improved new unique benchmarks.

If you were in charge of a large and well funded model, would you rather pay people to find and "cheat" on LLM benchmarks by training on them, or would you pay people to identify benchmarks and make reasonably sure they specifically get excluded from training data?

I would exclude them as well as possible so I get feedback on how "real" any model improvement is. I need to develop real world improvements in the end, and any short term gain in usage by cheating in benchmarks seems very foolish.


It sounds very nice, but at the same time very naive, sorry. Funding is not a gift, and they must make money. The more funding they get - the more pressure there is to make money.

When you're in charge of a billion-dollar valuation company which is expected to remain unprofitable by 2029, it's hard to find a topic more crucial and intriguing than growth and making more money.

And yes, it is a recurring theme for vendors to tune their products specifically for industry-standard benchmarks. I can't find any specific reason for them not to pay people for training their model to score 90% on these 113 python tasks, as it directly drives profits up, whereas not doing it will bring absolute nothing to the table - surely they have their own internal benchmarks which they can exclude from training data.


> If you were in charge of a large and well funded model, would you rather pay people to find and "cheat" on LLM benchmarks by training on them, or would you pay people to identify benchmarks and make reasonably sure they specifically get excluded from training data?

You should already know by now that economic incentives are not always aligned with science/knowledge...

This is the true alignment problem, not the AI alignment one hahaha


The AI alignement problem and the people alignment problem are actually the same problem! :D

One is just a bit harder due to the less familiar mind "design".


They cannot be found out as long as there is no better evaluation. Sure, if they produce obvious nonsense, but the point of a systematic evaluation is exactly to overcome subjective impressions based on individual examples as a notion of quality.

Also, you are right that excluding test data from the training data improves your model. However, given the insane amounts of training data, this requires significant effort. If that additionally leads to your model performing worse in existing leaderboards, I doubt that (commercial) organizations would pay for such an effort.

And again, as long as there is no better evaluation method, you still won't know how much it really helps.


This market is all about hype and mindshare, proper testing is hard and not performed by individuals, so there are no incentives not to train a bit on the test set.


And if there is a board that will fire you if expected profits do not increase, do you still maintain this stance?


> Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?

Yes, this is an inherit problem with the whole idea of LLM's. They're pattern recognition "students" but the important thing, that all the providers like to sell is their reasoning. A good test is a reasoning test. I'll try to find a link and update with a reference.


There is an opportunity to develop black-box benchmarks and offer them to LLM providers to support their testing phase. If I were in their place, I would find it incredibly valuable to have such tamper-proof testing before releasing a model.


Conveniently, author of these benchmarks remains silent on topic every time. Think about it :)




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: