I just ran this through a simple change I’ve asked Sonnet 4 and Opus 4.1, and it...

darksaints · 2025-09-29T17:34:30 1759167270

> I worry everyone is chasing benchmarks to the detriment of general performance.

I've been worried about this for a while. I feel like Claude in particular took a step back in my own subjective performance evaluation in the switch from 3.7 to 4, while the benchmark scores leaped substantially.

To be fair, benchmarking has always been the most difficult problem to solve in this space, so it's not surprising that benchmark development isn't exactly keeping pace with all of the modeling/training development happening.

GoatInGrey · 2025-09-29T20:04:19 1759176259

Not that it was better at programming, but I really miss Sonnet 3.5 for educational discussions. I've sometimes considered that what I actually miss was the improvement 3.5 delivered over other models at that time. Though since my system message for Sonnet since 3.7 has been primarily instructing it to behave like a human and have a personality, I really think we lost something.

walthamstow · 2025-09-30T14:23:47 1759242227

I still use 3.5 today in Cursor. It's still the best model they've produced for my workflow. It's twice as fast as 4 and doesn't vomit pointless comments all over my code.

BolexNOLA · 2025-09-29T18:30:03 1759170603

> I worry everyone is chasing benchmarks to the detriment of general performance.

I’m not sure this is entirely what you’re driving at, but the example I always think of in my head is “I want an AI agent that will scan through my 20 to 30,000 photos, remove all the duplicates, then organize them all in some coherent fashion.” that’s the kind of service I need right now, and it feels like something AI should be able to do, yet I have not encountered anything that remotely accomplishes this task. I’m still using Dupe Guru and depending on the ref system to not scatter my stuff all over further.

Sidebar, if anybody has any recommendations for this, I would love to hear them lol

fullstackchris · 2025-09-30T01:34:40 1759196080

azure vision / "cognitive services" can do this for literally a few bucks

am i even on hacker news? how do people not know there are optimized models for specific use cases? not everything (nor should it) has to run through an LLM

https://azure.microsoft.com/en-us/pricing/details/cognitive-...

BolexNOLA · 2025-09-30T12:17:05 1759234625

This is hardly the fluid, turn key solution I am talking about, so I don’t know why you’re talking like this to me and acting like the answer is so obvious. Frankly your tone was rude and unnecessary. Not everyone on HN shares the same knowledge and experience about all the same subjects, let alone all the ones you expect all of us to know.

mh- · 2025-09-29T19:09:07 1759172947

The reality of that specific ask is it would not be difficult to build, but I believe it would be extremely difficult to build and offer at a price that users would pay for. So you're unlikely to find a commercial offering that does that using a (V)LM.

BolexNOLA · 2025-09-29T20:01:10 1759176070

Yeah I imagine so. Hell I would pay like $100 for them to just do it once. If they really could do it with like 99% accuracy I would pay upwards of $300 tbh. Still, that’s probably not good enough lol

kirkoplamen · 2025-09-30T19:04:50 1759259090

Hey bro, I'd like to take this project using Claude for $300 :) Do you mind contacting me? stxcth9aoj at mozmail.com

Eisenstein · 2025-09-30T08:00:27 1759219227

I made this as a first step in the process of organizing large amounts of images. Once you have the keywords and descriptions in the metadata, it should be possible to have a more powerful text only LLM come up with an organizing scheme and enact it by giving it file or scripting access via MCP. Thanks for reminding me that I need to work on that step now since local LLMs are powerful enough.

* https://github.com/jabberjabberjabber/ImageIndexer

BolexNOLA · 2025-09-30T14:03:42 1759241022

Very cool, thanks for sharing!

base698 · 2025-09-30T13:23:28 1759238608

Perceptual Hash. I have a Python script that does just this I did a million years ago: https://gist.github.com/base698/42d24be9309520fe8ad768844868...

I used it to match frames between different quality video streams. Operates on gray scale.

MichealCodes · 2025-09-29T17:40:57 1759167657

More like churning benchmarks... Release new model at max power, get all the benchmark glory, silently reduce model capability in the following weeks, repeat by releasing newer, smarter model.

zamadatix · 2025-09-29T17:48:30 1759168110

That (thankfully) can't compound, so would never be more than a one time offset. E.g. if you report a score of 60% SWE-bench verified for new model A, dumb A down to score 50%, and report a 20% improvement over A with new model B then it's pretty obvious when your last two model blogposts say 60%.

The only way around this is to never report on the same benchmark versions twice, which they include too many to realistically do every release.

MichealCodes · 2025-09-29T17:52:27 1759168347

The benchmarks are not typically ongoing, we do not often see comparisons between week 1 and week 8. Sprinkle a bit of training on the benchmarks in and you can ensure higher scores for the next model. A perfect scam loop to keep the people happy until they wise up.

zamadatix · 2025-09-30T00:37:04 1759192624

> The benchmarks are not typically ongoing, we do not often see comparisons between week 1 and week 8

You don't need to compare "A (Week 1)" to "A (Week 8)" to be able to show "B (Week 1)" is genuinely x% better than "A (Week 1)".

MichealCodes · 2025-09-30T01:23:17 1759195397

As I said sprinkle a bit of benchmarks polluting the training and you have your loop. Each iteration will be better at benchmarks if that's the goal and that goal/context reinforces.

zamadatix · 2025-09-30T17:04:36 1759251876

Sprinkling in benchmark training isn't a loop, it's just plain cheating. Regardless, not all of these benchmarks are public and, even with mass collusion across the board, it wouldn't make sense only open weight LLMS have been improving.

la_fayette · 2025-09-29T18:28:26 1759170506

At this point it would be an interesting idea, to collect examples, in a form of a community database, were LLMs miserably fail. I have examples myself...

vunderba · 2025-09-29T23:16:20 1759187780

Any such examples are often "closely guarded secrets" to prevent them from being benchmaxxed and gamed - which is absolutely what would happen if you consolidated them in a publicly available centralized repository.

la_fayette · 2025-09-30T06:47:38 1759214858

Since such a database should evolve continuously, I wouldn't see that as a problem. The important thing is, that each example is somehow verifiable, in the form of a unmodifiable test setup. So the LLM provides a solution, which is executed against the test to verify. Something like ACID3 Tests... But sure it can be gamed somehow in probably all setups...

squigz · 2025-09-30T05:59:28 1759211968

This seems like a non-issue, unless I'm misunderstanding. If failures can be used to help game benchmarks, companies are doing so. They don't need us to avoid compiling such information, which would be helpful to actual users.

dns_snek · 2025-09-30T06:16:05 1759212965

People might want to use the same test scenario in the future to see how much the models have improved. We can't do that if the example gets scraped into the training data set.

Cthulhu_ · 2025-09-29T18:14:56 1759169696

That's what I was thinking too; the models have the same data sources (they have all scraped the internet, github, book repositories, etc), they all optimize for the same standardized tests. Other than marginally better scores in those tests (and they will cherry-pick them to make them look better), how do the various competitors differentiate from each other still? What's the USP?

cies · 2025-09-29T23:08:58 1759187338

LLM (the model) is not the agent (ClaudeCode) that uses LLMs.

LLMs improve slowly, but the agents are where the real value is produced: when should it write tests, when should it try to compile, how to move fwd from a compile error, can it click on your web app to test its own work, etc. etc.

PunchTornado · 2025-09-29T18:38:46 1759171126

Downvoted because you didn’t mention the prompt and the issue.

itsoktocry · 2025-09-29T18:21:46 1759170106

>It’s a simple substitution request where I provide a Lint error that suggests the correct change. All the models fail. I could ask someone with no development experience to do this change and they could.

I don't understand why this kind of thing is useful. Do the thing yourself and move on. For every one problem like this, AI can do 10 better/faster than I can.

stefs · 2025-09-29T18:55:09 1759172109

How can I trust it to do the complicated task well when it fails to do the simple thing?

baq · 2025-09-30T05:45:39 1759211139

The jagged edge effect: you can trust it to do some tasks extremely well, but a slightly different task might consistently fail. Your job as a tool user is to understand when it’ll work and when it won’t - it isn’t an oracle or a human.

enraged_camel · 2025-09-29T19:29:23 1759174163

It's not about simple vs. complex. It's about the types of tasks the AI has been trained on: pattern-matching, thinking, reasoning, research.

Tasks like linting and formatting a block of code are pretty simple, but also very specialized. You're much better off using formatters/linters than an AI.

landl0rd · 2025-09-29T21:40:12 1759182012

I want the bot to do the drudge work, not me. I want the bot to fix lint errors the linter can't safely autofix, not me.

You're talking about designing a kitchen where robots do the cooking and humans do ingredient prep and dishwashing. We prefer kitchens where we do the cooking and use tools or machines to prep and wash dishes.

I don't want it to be an "architect" or "designer". I want it to write the annoying boilerplate. I don't want it to do the coding and me to do the debugging, I want to code while it debugs. Anything else and you are the bot's assistant, not vice-versa.

ewoodrich · 2025-09-29T21:23:04 1759180984

An agent being tasked to resolve simple issues from a compiler/test suite/linter/etc is pretty typical use case. It's not clear in this example if the linter was capable of auto fixing the problem, so ordinarily this would be a case where you'd hope an LLM would shine given specific, accurate context and known solution.

bobbylarrybobby · 2025-09-29T18:49:05 1759171745

One reason is to simply say “fix all lints” and have the model do it

beefnugs · 2025-09-29T18:38:19 1759171099

You dont understand how complete unreliability is a problem?

So instead of just "doing things" you want a world where you try it ai-way, fail, then "do thing" 47 times in a row, then 3 ai-way saved you 5 minutes. Then 7 ai-way fail, then try to remember hmm did this work last time or not? ai-way fails another 3 times. "do thing" 3 times. How many ai-way failed today? oh it wasted 30% of the day and i forget which ways worked or not, i better start writing that all down. Lets call it the MAGIC TOME of incantations. oh i have to rewrite the tome again the model changed