Benchmarks are pretty flawed IMO, in particular their weakness here seems to be that they are poor at evaluating long-tail multiturn conversations. 4o often gives a great first response, then spirals into a repetition. Sonnet 3.5 is much better at seeing the big picture in a longer conversation IMO.
I made a mobile app the other day using LLMs (I had never used React or TypeScript before, and I built an app with React Native). I was pretty disappointed, both Sonnet 3.5 and gpt-4-turbo performed pretty poorly, making mistakes like missing a closing bracket somewhere and meaning I had to revert, because I had no idea where they meant to put it.
Also they did the thing that junior developers tend to do, where you have a race condition of some sort, and they just work around it by adding some if checks. The app is at around 400 lines right now, it works but feels pretty brittle. Adding a tiny feature here or there breaks something else, and GPT does the wrong thing half the time.
All in all, I'm not complaining, because I made an app in two days, but it won't replace a developer yet, no matter how much I want it to.
Repetition in multiturn conversations is actually Sonnet's fatal flaw, both 3 and 3.5. 4o is also repetitive to an extent. Opus is way better than both at being non-repetitive.
Which benchmarks are you looking at? It is very competitive with GPT4o in the table of metrics I just built at work. Have you used it to code? Qualitatively, it is much better - once it can execute Python it will be supzors.
it's the consensus of people that use both that claude.ai is superior for practical use despite benchmark results which are mostly one-shot based prompts.