Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The new Sonnet tops aider's code editing leaderboard at 84.2%. Using aider's "architect" mode it sets the SOTA at 85.7% (with DeepSeek as the "editor" model).

  84% Claude 3.5 Sonnet 10/22
  80% o1-preview
  77% Claude 3.5 Sonnet 06/20
  72% DeepSeek V2.5
  72% GPT-4o 08/06
  71% o1-mini
  68% Claude 3 Opus
It also sets SOTA on aider's more demanding refactoring benchmark with a score of 92.1%!

  92% Sonnet 10/22
  75% o1-preview
  72% Opus
  64% Sonnet 06/20
  49% GPT-4o 08/06
  45% o1-mini
https://aider.chat/docs/leaderboards/



I will repeat my question from one of the previous threads:

Can someone explain these Aider benchmarks to me? They pass same 113 tests through llm every time. Why they then extrapolate ability of llm to pass these 113 basic python challenges to the general ability to produce/edit code? Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?

Did anyone ever try to change them test cases or wiggle conditions a bit to see if it will still hit the same %?


Indeed, test data like this constantly leaks into the training data, so these leaderboards are not necessarily representative for real-world problems. A better approach is to use variable evaluation like GSM-Symbolic (for evaluating mathematic reasoning): https://arxiv.org/abs/2410.05229


> Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?

They could. They would easily be found out as they loose in real world usage or improved new unique benchmarks.

If you were in charge of a large and well funded model, would you rather pay people to find and "cheat" on LLM benchmarks by training on them, or would you pay people to identify benchmarks and make reasonably sure they specifically get excluded from training data?

I would exclude them as well as possible so I get feedback on how "real" any model improvement is. I need to develop real world improvements in the end, and any short term gain in usage by cheating in benchmarks seems very foolish.


It sounds very nice, but at the same time very naive, sorry. Funding is not a gift, and they must make money. The more funding they get - the more pressure there is to make money.

When you're in charge of a billion-dollar valuation company which is expected to remain unprofitable by 2029, it's hard to find a topic more crucial and intriguing than growth and making more money.

And yes, it is a recurring theme for vendors to tune their products specifically for industry-standard benchmarks. I can't find any specific reason for them not to pay people for training their model to score 90% on these 113 python tasks, as it directly drives profits up, whereas not doing it will bring absolute nothing to the table - surely they have their own internal benchmarks which they can exclude from training data.


> If you were in charge of a large and well funded model, would you rather pay people to find and "cheat" on LLM benchmarks by training on them, or would you pay people to identify benchmarks and make reasonably sure they specifically get excluded from training data?

You should already know by now that economic incentives are not always aligned with science/knowledge...

This is the true alignment problem, not the AI alignment one hahaha


The AI alignement problem and the people alignment problem are actually the same problem! :D

One is just a bit harder due to the less familiar mind "design".


They cannot be found out as long as there is no better evaluation. Sure, if they produce obvious nonsense, but the point of a systematic evaluation is exactly to overcome subjective impressions based on individual examples as a notion of quality.

Also, you are right that excluding test data from the training data improves your model. However, given the insane amounts of training data, this requires significant effort. If that additionally leads to your model performing worse in existing leaderboards, I doubt that (commercial) organizations would pay for such an effort.

And again, as long as there is no better evaluation method, you still won't know how much it really helps.


This market is all about hype and mindshare, proper testing is hard and not performed by individuals, so there are no incentives not to train a bit on the test set.


And if there is a board that will fire you if expected profits do not increase, do you still maintain this stance?


> Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?

Yes, this is an inherit problem with the whole idea of LLM's. They're pattern recognition "students" but the important thing, that all the providers like to sell is their reasoning. A good test is a reasoning test. I'll try to find a link and update with a reference.


There is an opportunity to develop black-box benchmarks and offer them to LLM providers to support their testing phase. If I were in their place, I would find it incredibly valuable to have such tamper-proof testing before releasing a model.


Conveniently, author of these benchmarks remains silent on topic every time. Think about it :)


Thanks! I was waiting for your benchmarks. Do you plan to test haiku 3.5 too? It would be nice to show API prices needed to run the whole benchmark too to have a better idea of how many internal tokens o1 models consume.


Are these synthetic or real-world benchmarks?

Answering myself: ”Aider’s code editing benchmark asks the LLM to edit python source files to complete 133 small coding exercises from Exercism”

Not gonna start looking for a job any time soon


Example I chose at random:

> Convert a hexadecimal number, represented as a string (e.g. "10af8c"), to its decimal equivalent using first principles (i.e. no, you may not use built-in or external libraries to accomplish the conversion).

So it's fairly synthetic. It's also the sort of thing LLMs should be great at since I'm sure there's tons of data on this sort of thing online.


Yeah but programming isn't about solving problems that were solved millions of times already. I mean, web dev kind of is, but that's not the point. If a problem is solved, then it's just a matter of implementing the solution and anyone can do that given the proper instructions (even without understanding how or why they solve the problem).

I've formalized a lot of stuff I didn't understand just by copying the formulas from Wikipedia.

As long as LLMs are not capable of proper reasoning, they will remain a gimmick in the context of programming.

They should really just focus on refactoring benchmarks across many languages. If an AI can refactor my complex code properly without changing the semantics, it's good enough for me. But that unfortunately requires such a high-level understanding of the codebase that with the current tech it's just impossible to get a half-decent result in any real-world scenario.


I use Claude for coding and it's fantastic. I definitely have outsourced a lot of my coding to it.


What's the (current) best way to integrate it? VS Code extension? Other IDE?


I'll throw this out here as well: Is there any decent alternative to GitHub Copilot when using Visual Studio? (Pretty happy with it to be fair, but would be open to trying others.)


Supermaven is really good. I am a paying user of super maven.


I use cursor (cursor.com) and it's fantastic


Fellow cursor user here, I'm very new to it. I am getting some very convenient and welcome autocomplete. I am also getting quite a lot of bad autocomplete suggestions, which require cognitive overhead and context switching to evaluate. So I am thus far not fully convinced. Any tips for getting the most out of cursor?


Huge seconding of cursor.


Aider, created by the originator of this very comment thread.


Sourcegraph Cody.


Sure it can do coding but can it do software engineering


What exactly is left when we remove coding from software engineering? Could it be handled by a manager? Or perhaps by a single senior SWE who could now perform the work of an entire team using these rapidly advancing AI coders?


for a lot of tasks that aren't as cut & dry, i often find myself having to provide it pseudo code, which it can then one-shot to working code.

don't get me wrong, it's still a massive upgrade from the pre-sonnet era, but i still don't think it can take a high-level requirement and convert it into a working project... yet


> but i still don't think it can take a high-level requirement and convert it into a working project.

It cannot, you need to hand-hold it, as in, to make something larger than a (albeit good looking) to do app, you don't need to write code , but you do need to be able to review and debug code and take the architectural decisions. It'll simply loop forever otherwise.


It’s a good question. I would ask…

(1) Sure, it can tell you how to write new code in response to a prompt about your current local problem, but

(2) can it reason about an entire code base of known and unknown problems, and use that basis to figure out solutions to the unknowns such that you delete code and collapse complexity.

The software equivalent of realising that if you subtract xy from this:

  x2 + 3xy + y2
You can turn it into a much neater version:

  (x + y)2 + xy
…but doing that with 100k tokens of code instead of a handful of algebra tokens.


I haven't had much luck with architecture stuff. Maybe I'm holding it wrong.


The new version is already in Cursor and its outstanding.

Can code at mid-level now. Almost.


When using these models via the official Anthropic API, do I have to do anything to "opt in" to the new Sonnet, or am I switched over automatically?


That depends on the model ID you are using.

If you use "claude-3-5-sonnet-latest" you'll be upgraded to "claude-3-5-sonnet-20241022" already - I tested that this morning.

If you're on "claude-3-5-sonnet-20240620" you'll need to change that ID to either the -latest one or the -20241022 one.


FWIW, the refactor benchmark is quite mechanical - it just stresses reliability of LLMs over long context windows:

Questions are variants of:

Refactor the _set_csrf_cookie method in the CsrfViewMiddleware class to be a stand alone, top level function. Name the new function _set_csrf_cookie, exactly the same name as the existing method. Update any existing self._set_csrf_cookie calls to work with the new _set_csrf_cookie function.


Assuming that that is indeed what most of the benchmark does: If the LLMs are as bad as it as the numbers suggest, then it seems like a perfectly good benchmark. I would definitely want them to be able to do stuff like that when I let them write my code.


Anecdotally but I still get significantly better results from ChatGPT than claude for coding.

Claude is way less controllable it is difficult to get it to do exactly what I want. ChatGPT is way easier to control in terms of asking for specific changes.

Not sure why that is maybe the chain of thought and instruction tuning dataset has made theirs a lot better for interactive use.


For me it's the opposite; chatgpt (o1 preview and 4o) keep making very strange errors; errors that I even exactly tell it how to fix and it simply repeats the fundamental mistakes again. With claude, I did not have that.

Example; I asked it to write some js that finds a button on a page, clicks the button, then waits for a new element with some selector to appear and return a ref to it; chatgpt kept returning (pseudo code);

while (true) {

button.click()

wait()

oldItems = ...

newItems = ...

newItem = newItems - oldItems

if (newItem) return newItem

sleep(1)

}

which obviously doesn't work. Claude understands to put the oldItems outside the while; even when I tell chatgpt to do that, it doesn't. Or it does one time and with another change, it moves it back in.


Try as I might, ChatGPT couldn’t give me working code for a simple admin dash layout in Vue with a sidebar than can minimise. I had to correct it, it would say my apologies and provide new code with a different error. About 10 times in a row it got in a loop of errors and I gave up.

Do any of these actually help coding?


Prompting is a skill you can develop with practice and get better at. Also, some tasks just aren’t going to work well for various reasons.

Yes, LLMs can actually help with coding. But it’s not magic. There are limits. And you get better with practice.


Without people providing their prompts, it's impossible to say whether they are skilled or not, and their complaints or claims of "it worked with this prompt" without the output are also not possible to validate.

Maybe there's a clue in there as to why these experiences seem so different. I'm glad GPTs don't get frustrated.


I have a personal policy of sharing my prompts as openly as possible. I've shared hundreds at this point - for a bunch of recent examples see https://simonwillison.net/2024/Oct/21/claude-artifacts/ and https://simonwillison.net/tags/ai-assisted-programming/


Ive spent thousands of hours, literally, learning the ropes, and continue to hone it. There is a much higher skill ceiling for prompting than there was for Google-fu.


Back in the day googling was a skill not with the Rise of LLMs Prompting is a skill


Literally ropes as in RoPE, rotary positional embeddings?


Give it one or two examples of what you want. Don't expect these things to perfectly solve every problem - they're transformation machines, so they can do pretty much anything if you figure out the right input.


Just tried it and it worked. Try this:

give me a vue js page. I want a sidebar that minimizes (if triggered). Make simple admin placeholder page.


This was about 6 months ago I think. I’ll happily give it another shot.


Maybe it's relative? Claude beats GPT-4/o by a far margin for me but I am mostly using them for Rust.


I also think there are subtle differences in how models like to be prompted, so some people will have more luck with one type of model.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: