lol no it’s not, the benchmarks don’t show that at all. Both have issues in diff...

causal · on July 16, 2024

Benchmarks are pretty flawed IMO, in particular their weakness here seems to be that they are poor at evaluating long-tail multiturn conversations. 4o often gives a great first response, then spirals into a repetition. Sonnet 3.5 is much better at seeing the big picture in a longer conversation IMO.

stavros · on July 16, 2024

I made a mobile app the other day using LLMs (I had never used React or TypeScript before, and I built an app with React Native). I was pretty disappointed, both Sonnet 3.5 and gpt-4-turbo performed pretty poorly, making mistakes like missing a closing bracket somewhere and meaning I had to revert, because I had no idea where they meant to put it.

Also they did the thing that junior developers tend to do, where you have a race condition of some sort, and they just work around it by adding some if checks. The app is at around 400 lines right now, it works but feels pretty brittle. Adding a tiny feature here or there breaks something else, and GPT does the wrong thing half the time.

All in all, I'm not complaining, because I made an app in two days, but it won't replace a developer yet, no matter how much I want it to.

orbital-decay · on July 16, 2024

Repetition in multiturn conversations is actually Sonnet's fatal flaw, both 3 and 3.5. 4o is also repetitive to an extent. Opus is way better than both at being non-repetitive.

rjurney · on July 22, 2024

Which benchmarks are you looking at? It is very competitive with GPT4o in the table of metrics I just built at work. Have you used it to code? Qualitatively, it is much better - once it can execute Python it will be supzors.

whywhywhywhy · on July 17, 2024

Benchmarks claim GPT4o is better than GPT4 so anyone who's actually used the software knows benchmarks mean nothing.

treme · on July 17, 2024

it's the consensus of people that use both that claude.ai is superior for practical use despite benchmark results which are mostly one-shot based prompts.