Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I happened to be in the middle of a task in a production codebase that the various models struggled on so I can give a quick vibe benchmark:

opus 4.1: made weird choices, eventually got to a meh solution i just rolled back.

codex: took a disgusting amount of time but the result was vastly superior to opus. night and day superiority. output was still not what i wanted.

sonnet 4.5: not clearly better than opus. categorically worse decision-making than codex. very fast.

Codex was night and day the best. Codex scares me, Claude feels like a useful tool.





These reviews are pretty useless to other developers. Models perform vastly differently with each language, task type, framework.

> These reviews are pretty useless to other developers.

Agreed. If these same models were used on a different codebase/language etc. it will likely produce very different results.


And prompt and usage.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: