I like how critique of LLMs evolved on this site over the last few years. We are...

skyechurch · on March 26, 2025

The most straightforward way to measure the pace of AI progress is by attaching a speedometer to the goalposts.

kaliqt · on March 26, 2025

Oh, that's a good one. And it's true. There seems to be a massive inability for most people to admit the building impact of modern AI development on society.

benterix · on March 26, 2025

Oh, we do admit impact and even have a name for it: AI slop. (Speaking on LLMs now since AI is a broad term and it has many extremely useful applications in various areas)

Workaccount2 · on March 26, 2025

AI slop is soon to be "AI output that no one wanted to take credit for".

josefx · on March 26, 2025

They certainly seem to have moved from "it is literally skynet" and "FSD is just around the corner" in 2016 to "look how well it paces my first lady Trump/Musk slashfic" in 2025. Truly world changing.

orena · on March 28, 2025

I've asked claude to explain what you meant... https://claude.ai/share/391160c5-d74d-47e9-a963-0c19a9c7489a

dieortin · on March 28, 2025

I’m not source outsourcing even the comprehension of HN comments to an LLM is going to work out well for your mind

etherealG · on March 31, 2025

I’m not sure lacking comprehension of a comment and choosing to ignore that lack is better. Or worse: asking everyone to manually explain every reference they make. The LLM seems a good choice when comprehension is lacking.

qnleigh · on April 3, 2025

This is so on-point. Many things that we now take for granted from LLMs would have been considered sufficient evidence for AGI not all that long ago. Likely the only test of AGI is whether we can still come up with new goalpost.

Nition · on March 26, 2025

Haha, so that's the first derivative of goalpost position. You could take the derivative of that to see if the rate of change is speeding up or slowing.

munksbeer · on March 26, 2025

I love this comment.

solardev · on March 26, 2025

It's not really passing the Turing Test until it outsells Harry Potter.

dragonwriter · on March 26, 2025

> It's not really passing the Turing Test until it outsells Harry Potter.

Most human-written books don't do that, so that seems to be a ceiteria for a very different test that a Turing test.

ZiiS · on March 26, 2025

Both books that have outsold the Harry Potter series claim divine authorship, not purely human. I am prepared to bet quite a lot that the next isn't human-written, either.

mirekrusin · on March 26, 2025

The joke is that the goalpost is constantly moving.

TeMPOraL · on March 26, 2025

This subgoal post can't move much further after it passes "outsells the Bible" mark.

zimbatm · on March 26, 2025

Why would the book be worth buying tough. If AI can generate a fresh new one just for you?

TeMPOraL · on March 26, 2025

I don't know. It's a question relevant to all generative AI applications in entertainment - whether books, art, music, film or videogames. To the extent the value of these works is mostly in being social objects (i.e. shared experience to talk about with other people), being able to generate clones and personalized variants freely via GenAI destroys that value.

mirekrusin · on March 26, 2025

You may be right, on the other hand it always feels like the next goalpost is the final one.

I'm pretty sure if something like this happens some dude will show up from nowhere and claim that it's just parroting what other, real people have written, just blended it together and randomly spitted it out – "real AI would come up with original ideas like cure for cancer" he'll say.

After some form of that comes another dude will show up and say that this "alphafold while-loop" is not real AI because he just went for lunch and there was a guy flipping burgers – and that "AI" can't do it so it's shit.

https://areweagiyet.com should plot those future points as well with all those funky goals like "if Einstein had access to the Internet, Wolfram etc. he could came up with it anyway so not better than humans per se", or "had to be prompted and guided by human to find this answer so didn't do it by itself really" etc.

silveraxe93 · on March 26, 2025

From Gary Marcus' (notable AI skeptic) predictions of what AI won't do in 2027:

> With little or no human involvement, write Pulitzer-caliber books, fiction and non-fiction.

So, yeah. I know you made a joke, but you have the same issue as the Onion I guess.

tummler · on March 26, 2025

Let me toss a grenade in here.

What if we didn’t measure success by sales, but impact to the industry (or society), or value to peoples’ lives?

Zooming out to AI broadly: what if we didn’t measure intelligence by (game-able, arguably meaningless) benchmarks, but real world use cases, adaptability, etc?

szatkus · on March 26, 2025

I recently watched some Claude Plays Pokemon and believe it's better measure than all those AI benchmarks. The game could be beaten by a 8yo which obviously doesn't have all that knowledge that even small local LLMs posess, but has actual intelligence and could figure out the game within < 100h. So far Claude can't even get past the first half and I doubt any other AI could get much further.

solardev · on March 26, 2025

Now I want to watch Claude play Pokemon Go, hitching a ride on self-driving cars to random destinations and then trying to autonomously interpret a live video feed to spin the ball at the right pixels...

2026 news feed: Anthropic cited as AI agents simultaneously block traffic across 42 major cities while trying to capture a not-even-that-rare pokemon

harrison_clarke · on March 26, 2025

the true measure of AI: does it have fun playing pokemon? did it make friends along the way?

etruong42 · on March 26, 2025

We humans love quantifiability. Since you used the word "measure", do you believe the measurement you're aspiring for is quantifiable?

I currently assert that it's not, but I would also say that trying to follow your suggestion is better than our current approach of measuring everything by money.

icrbow · on March 26, 2025

> We humans love quantifiability.

No. Screw quantifiability. I don't want "we've improved the sota by 1.931%" on basically anything that matters. Show me improvements that are obvious, improvements that stand out.

Claude Plays Pokemon is one of the few really important "benchmarks". No numbers, just the progress and the mood.

Workaccount2 · on March 26, 2025

This is difficult to do because one of the juiciest parts of AI is being able to take credit for it's work.

ninetyninenine · on March 26, 2025

the goal posts will be moved again. Tons of people clamoring the book is stupid and vapid and only idiots bought the book. When ai starts taking over jobs which it already has you’ll get tons of idiots claiming the same thing.

eru · on March 26, 2025

Well, strictly speaking outselling the Harry Potter would fail the Turing test: the Turing test is about passing for human (in an adversarial setting), not to surpass humans.

Of course, this is just some pedantry.

I for one love that AI is progressing so quickly, that we _can_ move the goalposts like this.

jychang · on March 26, 2025

To be fair, pacing as a big flaw of LLMs has been a constant complaint from writers for a long time.

There were popular writeups about this from the Deepseek-R1 era: https://www.tumblr.com/nostalgebraist/778041178124926976/hyd...

newswasboring · on March 26, 2025

This was written on march 15. Deepseek came out in January. "Era" is not a language I would use for something that happened few days ago

krzat · on March 26, 2025

This either ends at "better than 50% of human novels" garbage or at unimaginably compelling works of art that completely obsoletes fiction writing.

Not sure what is better for humanity in long term.

WindyMiller · on March 26, 2025

That could only obsolete fiction-writing if you take a very narrow, essentially commercial view of what fiction-writing is for.

I could build a machine that phones my mother and tells her I love her, but it wouldn't obsolete me doing it.

bergundytomato · on March 26, 2025

Ahh, now this would be a great premise for a short story (from the mom's POV).

ruraljuror · on March 26, 2025

We are, if this comment is the standard for all criticism on this site. Your comment seems harsh. Perhaps novel writing is too low-brow of a standard for LLM critique?

jorl17 · on March 26, 2025

I didn't quite read parent's comment like that. I think it's more about how we keep moving the goalposts or, less cynically, how the models keep getting better and better.

I am amazed at the progress that we are _still_ making on an almost monthly basis. It is unbelievable. Mind-boggling, to be honest.

I am certain that the issue of pacing will be solved soon enough. I'd give 99% probability of it being solved in 3 years and 50% probability in 1.

jiggawatts · on March 26, 2025

In my consulting career I sometimes get to tune database servers for performance. I have a bag of tricks that yield about +10-20% performance each. I get arguments about this from customers, typically along the lines of "that doesn't seem worth it."

Yeah, but 10% plus 20% plus 20%... next thing you know you're at +100% and your server is literally double the speed!

AI progress feels the same. Each little incremental improvement alone doesn't blow my skirt up, but we've had years of nearly monthly advances that have added up to something quite substantial.

eru · on March 26, 2025

Yes, if you are Mary Poppins, each individual trick in your bag doesn't have to be large.

(For those too young or unfamiliar: Mary Poppins famously had a bag that she could keep pulling things out of.)

rafaelmn · on March 26, 2025

Except at some point the low hanging fruit is gone and it becomes +1%, +3% in some benchmarked use case and -1% in the general case, etc. and then come the benchmarking lies that we are seeing right now, where everyone picks a benchmark that makes them look good and its correlation to real world performance is questionable.

dalmo3 · on March 26, 2025

What exactly is the problem with moving the goalposts? Who is trying to win arguments over this stuff?

Yes, Z is indeed a big advance over Y was a big advance over X. Also yes, Z is just as underwhelming.

Are customers hurting the AI companies' feelings?

TeMPOraL · on March 26, 2025

> Are customers hurting the AI companies' feelings?

No. It's the critics' feelings that are being hurt by continued advances, so they keep moving goalposts so they can keep believing they're right.

HelloMcFly · on March 26, 2025

The goalposts should keep moving. That's called progress. Like you, I'm not sure why it seems to irritate or even amuse people.

rafaelmn · on March 26, 2025

People are trying to use gen AI in more and more use-cases, it used to fall flat on its face at trivial stuff, now it got past trivial stuff but still scratching the boundaries of being useful. And that is not an attempt to make the gen AI tech look bad, it is really amazing what it can do - but it is far from delivering on hype - and that is why people are providing critical evaluations.

Lets not forget the OpenAI benchmarks saying 4.0 can do better at college exams and such than most students. Yet real world performance was laughable on real tasks.

parineum · on March 26, 2025

> Lets not forget the OpenAI benchmarks saying 4.0 can do better at college exams and such than most students. Yet real world performance was laughable on real tasks.

That's a better criticism of college exams than the benchmarks and/or those exams likely have either the exact questions or very similar ones in the training data.

The list of things that LLMs do better than the average human tends to rest squarely in the "problems already solved by above average humans" realm.

stickfu · on March 28, 2025

I don’t know why I keep submitting myself to hacker news but every few months I get the itch, and it only takes a few minutes to be turned off by the cynicism. I get that it’s from potentialy wizened tech heads who have been in the trenches and are being realistic. It’s great for that, but any new bright eyed and bushy tailed dev/techy, whatever, should stay far away until much later in their journey

ksec · on March 26, 2025

Do we have any simple benchmarks ( and I know benchmarks are not everything ) that tests all the LLMs?

The pace is moving so fast I simply cant keep up. Or a ELI5 page which gives a 5 min explanation of LLM from 2020 to this moment?

basch · on March 26, 2025

It’s more a bellwether or symptom of a flaw where the context becomes poisoned and continually regurgitates the same thought over and over.

leokennis · on March 26, 2025

Not really new is it? First cars just had to be approaching horse and cart levels of speed. Comfort, ease of use etc. were non-factors as this was "cool new technology".

In that light, even a 20 year old almost broken down crappy dinger is amazing: it has a radio, heating, shock absorbers, it can go over 500km on a tank of fuel! But are we fawning over it? No, because the goalposts have moved. Now we are disappointed that it takes 5 seconds for the Bluetooth to connect and the seats to auto-adjust to our preferred seating and heating setting in our new car.

ripped_britches · on March 26, 2025

lol wouldn’t that be great to read this comment in 2022