Hacker News new | past | comments | ask | show | jobs | submit login
AI hype is built on flawed test scores (technologyreview.com)
204 points by antondd on Oct 10, 2023 | hide | past | favorite | 230 comments



I don't think the "hype" is built on test scores.

It is built on the observation how fast AI is getting better. If the speed of improvement stays anywhere near the level it was the last two years, then over the next two decades, it will lead to massive changes in how we work and which skills are valuable.

Just two years ago, I was mesmerized by GPT-3's ability to understand concepts:

https://twitter.com/marekgibney/status/1403414210642649092

Nowadays, using it daily in a productive fashion feels completely normal.

Yesterday, I was annoyed with how cumbersome it is to play long mp3s on my iPad. I asked GPT-4 something like "Write an html page which lets me select an mp3, play it via play/pause buttons and offers me a field to enter a time to jump to". And the result was usable out of the box and is my default mp3 player now.

Two years ago it didn't even dawn on me that this would be my way of writing software in the near future. I have been coding for over 20 years. But for little tools like this, it is faster to ask ChatGPT now.

It's hard to imagine where we will be in 20 years.


The article doesn't say that LLMs aren't useful - the "hype" they mean is overestimating their capabilities. An LLM may be able to pass a "theory of mind" test, or it may fail spectacularly, depending on how you prompt it. And that's because, despite all of its training data, it's not capable of actually reasoning. That may change in the future, but we're not there yet, and (AFAIK) nobody can tell how long it will take to get there.


> And that's because, despite all of its training data, it's not capable of actually reasoning. That may change in the future [...]

I don't think so. When you say "it's not capable of actually reasoning", that's because it's a LLM; and if it "changes in the future", that's because the new system must no longer be a pure LLM. The appearance of reasoning in LLMs is an illusion.


How is the illusion of reasoning different from say “actual reasoning?”


Because it literally can't reason, and it also has no innate agency. Even the most dedicated creators of LLM-based AI technology have clearly and repeatedly stated that these are very sophisticated stochastic parrots with no sense of self. How much easier could it be to see that LLMs like GPT aren't actual thinking machines in the way we humans are?

Yes, many people reason based on pure pattern-matching and repeat opinions not because they've reasoned them but because they're what they've absorbed from other sources, but even the world's most unreasoned human being with at least functional cognition still uses an enormous amount of constant, daily, hourly self-directed decision-making for a vast variety of complex and simple, often completely spontaneous scenarios and tasks in ways that no machine we've yet built on Earth does or could.

Moreover, even when some humans say or "believe" things based on nothing more than what they've absorbed from others without really considering it in depth, they almost always do so in a particularly selective way that fits their cognitive, emotional and personal predispositions. This very selectiveness is a distinctly conscious trait of a self-aware being. Its something LLM's don't have as far as I've yet seen.


In the same way that illusions of anything else differ from the real thing. A wax apple is different from a real apple, even if it's hard to tell them apart sometimes. You may require further investigation to differentiate them (e.g., cutting open the apple or asking the AI to solve tricky reasoning questions), but if you can find a difference, there is a difference.


I have a hunch I am misunderstanding your argument, but does that mean the only way to build a "true reasoning machine" would be to just create a human.

I guess what I'm really asking, what would you expect to observe to make it not illusory?


To distinguish between "is an illusion" and "is not an illusion", you need evidence that isn't observational. The whole point of illusions is that observational evidence is unreliable.

A desert mirage in the distance is an illusion; to the observer, it's indistinguishable from an oasis. You can only tell that it's a mirage by investigating how the appearance was created (e.g. by dragging your thirsty ass through the sand, to the place where the oasis appeared to be).


If one has a reasonable understanding of 2 concepts that make up a larger system. And, such a system has little else in addition to those concepts, one is able to come up with that system by itself. Even though, it has never seen it, or their composition was never explained prior to that logical process.

The illusion happens when, clearly, the alleged reasoning behind how such a system comes to be is based on prior knowledge of the system as a whole. Meaning, its construction/source was within the training data.


That sounds like a good litmus test. Do you have a specific example you've tried?

My opinion is it isn't binary, rather it's a scale. Your example is a point on the scale higher than what it is now.

But perhaps that's too liberal a definition of "reasoning" , no idea.

We seem to move the goalposts on what constitutes human level intelligence as we discover the various capabilities exhibited in the animal kingdom. I wonder if it is/will be the same with AI


I'm really curious, are you able to demonstrate reasoning, not reasoning and the illusion of reasoning in a toy example? I'd like to see what each looks like.


Have you met someone who is full of bullshit? They sound REALLY convincing, except if you know anything about the subject, their statements are just word salad?


Have you met someone who's good at bullshitting their way out of a tough spot? There may be a word salad involved, but preparing it takes some serious skill and brainpower, and perhaps a decent high-level understanding of a domain. At some point, the word salad stops being a chain of words, and becomes a product of strong reasoning - reasoning on the go, aimed at navigating a sticky situation, but reasoning nonetheless.


The finest bullshitter I knew had serious skill and brainpower; and he BS'd about stuff he was expert in. It was really a sort of party trick - he could leave his peer experts speechless (more likely, rolling on the floor laughing).

His output was indeed word-salad, but he was eloquent. His bullshit wasn't fallacious reasoning; it didn't even have the appearance of reasoning at all. He was just stringing together words and concepts that sound plausible. It was funny, because his audience knew (and were supposed to know) that it was nonsense.

LLMs are the same, except they're supposed to pretend that it isn't nonsense.


Which would be a good test - and by that test ChatGPT is not reasoning, since it cant get out of sticky situations.

Yeah, I think you've got a good example that improves the analogy.


Are you able to give some examples? I'd like to know what it looks like w r.t. LLMs.



Bullshit has an illusion of reasoning instead of actual reasoning. Basically you give arguments that sounds reasonable on the surface but there is no actual reasoning behind them.


> Bullshit has an illusion of reasoning instead of actual reasoning.

Bullshit is a good case to consider, actually. What is the relationship between bullshit and reasoning? You could argue that bullshit is fallacious reasoning, "pseudo-reasoning" based on incorrect rules of inference.

But these models don't use any rules of inference; they produce output that resembles the result of reasoning, but without reasoning. They are trained on text samples that presumably usually are the result of human reasoning. If you trained them on bullshit, they'd produce output that resembled fallacious reasoning.

No, I don't think the touchstone for actual reasoning is a human mind. There are machines that do authentic reasoning (e.g. expert systems), but LLMs are not such machines.


> Bullshit is a good case to consider, actually. What is the relationship between bullshit and reasoning?

None in principle, at least if you take the common definition of bullshit as saying things for effect, without caring whether they're true or false.

Fallacious reasoning will make you wrong. No reasoning will make you spew nonsense. Truth and lies and bullshit, all require reasoning for the structure of what you're saying to make sense, otherwise it devolves to nonsense.

> But these models don't use any rules of inference

Neither do we. Rules of inference came from observation. Formal reasoning is a tool we can employ to do better, but it's not what we naturally do.


> None in principle, at least if you take the common definition of bullshit as saying things for effect, without caring whether they're true or false.

Maybe splitting hairs, but I’d argue that the bullshitter is reasoning about what sounds good, and what sounds good needs at least some shared assumptions and resulting logical conclusion to hang its hat on. Maybe not always, but enough of the time that I would still consider reasoning to be a key component of effective bullshit.


That's not the case. It's very much in the realm of "we don't know what's going on in the network."

Rather than a binary it's much more likely that it's a mix of factors going into results that includes basic reasoning capabilities developed from the training data (much like board representations and state tracking abilities developed feeding board game moves into a toy model in Othello-GPT) as well as statistic driven autocomplete.

In fact often when I've seen GPT-4 get hung up with logic puzzle variations such as transparency, it tends to seem more like the latter is overriding the former, and changing up tokens to emoji representations or having it always repeat adjectives attached to nouns so it preserves variation context gets it over the hump to reproducible solutions (as would be expected from a network capable of reasoning) but by default it falls into the pattern of the normative cases.

For something as complex as SotA neural networks, binary sweeping statements seem rather unlikely to actually be representative...


As an PhD student in NLP who's graduating soon, my perspective is that language models do not demonstrate "reasoning" in the way most people colloquially use the term.

These models have no capacity to plan ahead, which is a requirement for many "reasoning" problems. If it's not in the context, the model is unlikely to use it for predicting the next token. That's why techniques like chain-of-thought are popular; they cause the model to parrot a list of facts before making a decision. This increases the likelihood that the context might contain parts of the answer.

Unfortunately, this means the "reasoning" exhibited by language models is limited: if the training data does not contain a set of generalizable text applicable to a particular domain, a language model is unlikely to make a correct inference when confronted with a novel version of a similar situation.

That said, I do think adding reasoning capabilities is an active area of research, but we don't have a clear time horizon on when that might happen. Current prompting approaches are stopgaps until research identifies a promising approach for developing reasoning, e.g. combining latent space representations with planning algorithms over knowledge bases, constraining the logits based on an external knowledge verifier, etc (these are just random ideas, not saying they are what people are working on, rather are examples of possible approaches to the problem).

In my opinion, language models have been good enough since the GPT-2 era, but have been held back by a lack of reasoning and efficient memory. Making the language models larger and trained on more data helps make them more useful by incorporating more facts with increased computational capacity, but the approach is fundamentally a dead end for higher level reasoning capability.


Congrats on the upcoming PhD!

I'm curious where you are drawing your definition or scope for 'reasoning' from?

For example, in Shuren The Neurology of Reasoning (2002) the definition selected was "the ability to draw conclusions from given information."

While I agree that LLMs can only process token to token and that juggling context is critical to effective operation such that CoT or ToT approaches are necessary to maximize the ability to synthesize conclusions, I'm not quite sure what the definition of reasoning you have in mind is such that these capabilities fall outside of it.

The typical lay audience suggestion that LLMs cannot generate new information or perspectives outside of the training data isn't the case, as I'm sure you're aware, and synthesizing new or original conclusions from input is very much within their capabilities.

Yes, this has to happen within a context window and occurs on a token by token basis, but that seems like a somewhat arbitrary distinction. Humans are unquestionably better at memory access and running multiple subprocesses on information than an LLM.

But if anything, this simply suggests that continuing to move in the direction of multiple pass processing of NLP tasks with selective contexts and a variety of fine tuned specializations of intermediate processing is where practical short term gains might lie.

As for the issue of new domains outside of training data, I'm somewhat surprised by your perspective. Hasn't one of the big research trends over the past twelve months been that in context learning has proven more capable than was previously expected? I'd agree that a zero shot evaluation of a problem type that isn't represented in a LLMs training data is setting it up for failure, but the capacity to extend in context examples outside of training data has proven relatively more successful, no?


> These models have no capacity to plan ahead, which is a requirement for many "reasoning" problems. If it's not in the context, the model is unlikely to use it for predicting the next token. That's why techniques like chain-of-thought are popular; they cause the model to parrot a list of facts before making a decision. This increases the likelihood that the context might contain parts of the answer.

Is it not possible that this is essentially how our brains do it too? Attempt to plan by branching out to related ideas until they contain an answer. Any of these statements that AI can't be on track to reason like a human because of X seem to come with an implication that we have such a good model of the human brain that we know it doesn't X. But I'm not an expert on neuroscience so in many of these cases maybe that implication is true.


>Is it not possible that this is essentially how our brains do it too?

Is that how you think? Just curious


I think the word "essentially" is important here. I don't think we can observe how we think. How it appears in consciousness is not necessarily real - it might be just a model constructed ex-post.

I do not know that much about AI but I know at least something about cognitive psychology and it seems to me that a lot of claims about LLMs "not actually reasoning" and similar are probably made by CS graduates who have unreflected assumptions about how human thinking works.

I don't claim to know how human thinking works but if there is one thing I would conclude from studying psychology and knowing at least some basics about neuroscience, it would be that "it's not how it appears to us".

Nobody knows how human reasoning actually works but if I had to guess (based on my amateurish mental model of the functioning of the human brain), I would say that it is probably a lot closer to LLMs and a lot less rational than is commonly assumed in discussions like this one.


Maybe don't assume that PhD-level NLP researchers are out of touch on cognitive neuroscience topics related to language understanding. The latest research seems to indicate that language production and understanding exist separately from other forms of cognitive capacity. This includes people with global aphasia (no language ability) being able to do math, understand social situations, appreciate music, etc.

If you want to follow this more closely, I'd recommend the work of Evelina Fedorneko a cognitive neuroscientist at MIT who specializes in language understanding.

Check out these talks for more details: https://youtu.be/TsoQFZxrv-I?t=580 https://youtu.be/qublpBRtN_w

What this means in the context of LLMs is that next word prediction alone does not provide the breadth of cognitive capacity humans exhibit. Again, I'd posit GPT-2 is plenty capable as an LM, if combined with an approach to perform higher-level reasoning to guide language generation. Unfortunately, what that system is and how to design it currently eludes us.


First, you are right I should not assume anyone's knowledge (or lack thereof). It just popped into my mind as something that could explain the thing that's been puzzling me for months - what are people talking about when they say that LLMs are not actually reasoning, or Stable Diffusion is not actually creating? I wish I had not included that assumption and was inquisitive instead. Let me try again.

Maybe I diverted your focus the wrong way when I used LLMs as an example - what if I used more general term "neural network"? I said LLMs because this thread is about LLMs but let me clarify what I meant:

The thing that interests me in this thread is the claim that LLMs are "not capable of actually reasoning". Whether you agree with it depends on your mental model of actual reasoning, right?

My model of reasoning: the fundamental thing about it is that I have a network of things. The signal travels through the network guided by the weight of connections between them and fires some pattern of the things. That pattern represents something. Maybe it is a word in the case of LLMs (or syllable or whatever the token actually is - let's ignore those details for now) or a thought in the case of my brain (I was not saying people reason in language) - the resulting "token" can be many things, I imagine (like some mental representation of objects and their positions in spatial reasoning) - those are the specifics, but "essentially", the underlying mechanism is the same.

In my mental model, there is nothing fundamental that distinguishes what LLMs do from the "actual reasoning". If you have enough compute and good enough training data, you can create LLM reasoning as well as humans - that is my default hypothesis.

If I understand your position, you would not agree with that, correct? I am not claiming you are wrong - I know way too little for that. I would just be really curious - what is your mental model of actual reasoning? What does it have that LLMs do not have?

I know you mentioned that "these models have no capacity to plan ahead" - I am not sure I understand what you mean by that. Is this not just a matter of training?

BTW, I have talked about this topic before and some people apparently see conscience as a necessary part of actual reasoning. I do not - do you?


I don’t think we are conscious about how the language center correlates with our memories and then predicts the strings of words coming out.


> if the training data does not contain a set of generalizable text applicable to a particular domain, a language model is unlikely to make a correct inference when confronted with a novel version of a similar situation.

True. But look at the Phi-1.5 model - it punches 5x above its weight limit. The trick is in the dataset:

> Our training data for phi-1.5 is a combination of phi-1’s training data (7B tokens) and newly created synthetic, “textbook-like” data (roughly 20B tokens) for the purpose of teaching common sense reasoning and general knowledge of the world (science, daily activities, theory of mind, etc.). We carefully selected 20K topics to seed the generation of this new synthetic data. In our generation prompts, we use samples from web datasets for diversity. We point out that the only non-synthetic part in our training data for phi-1.5 consists of the 6B tokens of filtered code dataset used in phi-1’s training (see [GZA+ 23]).

> We remark that the experience gained in the process of creating the training data for both phi-1 and phi-1.5 leads us to the conclusion that the creation of a robust and comprehensive dataset demands more than raw computational power: It requires intricate iterations, strategic topic selection, and a deep understanding of knowledge gaps to ensure quality and diversity of the data. We speculate that the creation of synthetic datasets will become, in the near future, an important technical skill and a central topic of research in AI.

https://arxiv.org/pdf/2309.05463.pdf

Synthetic data has its advantages - less bias, more diverse, scalable, higher average quality. But more importantly, it can cover all the permutations and combinations of skills, concepts, situations. That's why a small model just 1.5B like Phi was able to work like a 7B model. Usually at that scale they are not coherent.


Are you going to school in Langley, Virginia?


NSA is more commonly associated with Fort Meade, MD, for what that's worth.


> These models have no capacity to plan ahead

How would you describe the behavior of "GPT Advanced Data Analysis"?


> it's not capable of actually reasoning

Define reasoning. Because in my definition GPT 4 could reason without doubt. It definitely can't reason better than experts in the field, but it could reason better than say interns.


I don't have access to GPT 4 but I'd be interested to see how it does on a question like this:

"Say I have a container with 50 red balls and 50 blue balls, and every time I draw a blue ball from the container, I add two white balls back. After drawing 100 balls, how many of each different color ball are left in the container? Explain why."

... because on GPT 3.5 the answer begins like the below and then gets worse:

"Let's break down the process step by step:

Initially, you have 50 red balls and 50 blue balls in the container.

1) When you draw a blue ball from the container, you remove one blue ball, and you add two white balls back. So, after drawing a blue ball, you have 49 blue balls (due to removal) and you add 2 white balls, making it a total of 52 white balls (due to addition) ..."

If I was hiring interns this dumb, I'd be in trouble.

EDIT: judging by the GPT-4 responses, I remain of the opinion I'd be in trouble if my interns were this dumb.


This is such a flawed puzzle. And GPT 4 answers it rightly. It is a long answer but the last sentence is "This is one possible scenario. However, there could be other scenarios based on the order in which balls are drawn. But in any case, the same logic can be applied to find the number of each color of ball left in the container."


The ability to identify that there isn't a simple closed form result is actually a key component of reasoning. Can you stick the answer it gives on a gist or something? The GPT 3.5 response is pure, self-contradictory word salad and of course delivered in a highly confident tone.


> The ability to identify that there isn't a simple closed form result is actually a key component of reasoning.

If that's the case, then most humans alive would fail to meet this threshold. Finding a general solution to a specific problem, and identifying whether or not there exist a closed-form solution, and even knowing these terms, are skills you're taught in higher education, and even the people who went through it are prone to forget all this unless they're applying those skills regularly in their life, which is a function of specific occupations.


https://pastebin.com/r9bNi8GD

GPT 4 goes into detail about one example scenario, which most humans won't do, but it is technically correct answer as it said it depends on the order.


Its answer isn't correct, this isn't a possible ending scenario:

- *Ending Scenario:* - Red Balls (RB): 0 (all have been drawn) - Blue Balls (BB): 50 (none have been drawn) - White Balls (WB): 0 (since no blue balls were drawn, no white balls were added) - Total Balls: 50


> but it is technically correct answer as it said it depends on the order.

It should give you pause that you had to pick not only the line by which to judge the answer but the part of the line. The sentence immediately before that is objectively wrong:

> This is one possible scenario.


But the reasoning is total garbage, right?

It says the number of blue balls drawn is x and the number of red balls drawn is y, and then asserts x + y = 100, which is wrong.

Then it proceeds to "solve" an equation which reduces to x = x to conclude x = 0.

It then uses that to "prove" that y = 100, which is a problem as there are only 50 red balls in the container and nothing causes any more to be added.

It's like "mistakes bad students make in Algebra 1".


I asked GPT4 and it gave a similar response. So then I asked my wife and she said, "do you want more white balls at the end or not?" And I realized as CS or math question we assume that the draw is random. Other people assume that you're picking which ball to draw.

So I clarified to ChatGPT that the drawing is random. And it replied: "The exact numbers can vary based on the randomness and can be precisely modeled with a simulation or detailed probabilistic analysis."

I asked for a detailed probabilistic analysis and it gives a very simplified analysis. And then basically says that a Monte Carlo approach would be easier. That actually sounds more like most people I know than most people I know. :-)


I don't understand the question. Surely the answer depends which order you withdraw balls in? Is the idea that you blindly withdraw a ball at every step, and you are asking for the expected value of each number of ball at the end of the process?

Seems like quite a difficult question to compute exactly.

I reworded the question to make it clearer and then it was able to simulate a bunch of scenarios as a monte carlo simulation. Was your hope to calculate it exactly with dynamic programming? GPT-4 was not able to do this, but I suspect neither could a lot of your interns.


>I don't understand the question. Surely the answer depends which order you withdraw balls in? Is the idea that you blindly withdraw a ball at every step, and you are asking for the expected value of each number of ball at the end of the process?

These are very good questions that anyone with the ability to reason would ask if given this problem.


"You're holding it wrong."

You're asking GPT to do maths in its head, the AI equivalent of a person standing in the middle of the room with no tools and getting grilled in a oral examination of their knowledge.

Instead, collaborate with it, while giving it the appropriate tools to help you.

I asked it to write a Monte Carlo simulation of the problem in Wolfram Mathematica script. It did this about 10-100x faster than I would have been able to. It made a few small mistakes with the final visualisation, but I managed to get it to output a volumetric plot showing the 3D scatter plot of the histogram of possible outcomes.

I even got it to save a video of the plot rotating: https://streamable.com/2aphbz


AI can reason! Just not reasonably!


It can reason better than most humans put into the same situation.

This problem doesn't result in a constant value, it results in a 3D probability distribution! Very, very few humans could work that out without tools. (I'm including pencil and paper in "tools" here.)

With only a tiny bit of coaxing, GPT 4 produced an animated video of the solution!

Try to guess what fraction of the general population could do that at all. Also try to estimate what fraction of general software developers could solve it in under an hour.


A human could get a valid end state most of the time, gpt-4 seems to mess up more than it got it right based on the examples posted here. So to me it seems like gpt-4 is worse than humans.

Gpt-4 with help from a competent human will of course do better than most humans, but that isn't what we are discussing.


> valid end state most of the time

I disagree. Don't assume "most humans" are anything like Silicon Valley startup developers. Most developers out there in the wild would definitely struggle to solve problems like this.

For example, a common criticism of AI-generated code is the risk of introducing vulnerabilities.

I just sat in a meeting for an hour, literally begging several developers to stop writing code vulnerable to SQL injection! They just couldn't understand what I was even talking about. They kept trying to use various ineffective hacky workarounds ("silver bullets") because they just didn't grok the the problem.

I've found GPT 4 outperforms median humans.


>It can reason better than most humans put into the same situation.

On what basis do you allege this? People say the most unhinged stuff here about AI, and it so often goes completely unchallenged. This is a huge assertion that you are making.


The equivalent of what current-gen LLMs do is an oral examination. Picture standing in the middle of a room surrounded by subject matter experts grilling you for your knowledge of various random topics. You have no tools, no calculator, no pencil and paper.

You’re asked a question and you just have to spit out the answer. No option to backtrack, experiment, or self correct.

“Translate this to Hebrew”.

“Is this a valid criticism of this passage from a Platonic perspective?”

“Explain counterfactual determinism in Quantum Mechanics.”

“What is the cube root of 74732?”

You would fail all of these. The AI gets 3 of 4 correct.

Tell me who’s smarter?

You because of your preconceptions, or because of real superiority?

Your model for human intelligence is probably more like this scene: https://youtu.be/KvMxLpce3Xw?si=Suy0Cj_pL0vru5Uj

The reality is the opposite. The AI could answer questions in this scenario but no nonfictional human could.


It's just a completely baseless comparison the way you are going about it, and you are mistaking intelligence for the recitation of facts

>“Is this a valid criticism of this passage from a Platonic perspective?”

I haven't seen AI answering questions like this correctly at all


This is what I got on a basically brand new OpenAI account: https://chat.openai.com/share/5199c972-478d-406f-9092-061a6b...

All told, I'd say it's a decent answer.

Edit: I took it to completion:https://chat.openai.com/c/6cdd92f1-487a-4e1c-ab94-f2bdbf282d...

These were the first responses each time, with no massaging/retires/leading answers. I will say it's not entirely there. I re-ran the initial question a few times afterwards and one was basically giberish.


GPT-4:

e composition of what's left in the container.

There's a couple of scenarios, which depend on when you run out of blue balls:

1) If you draw all 50 blue balls first, for every blue ball you draw, you are adding 2 white balls. Hence, once you've drawn all the blue balls, you have replaced them with 100 white balls. This means you've drawn 50 balls total and there's now 100 white balls and 50 red balls (150 balls) in the container. Then, you continue and draw the next 50 balls. These could be any combination of white and red balls, we don't know which color is preferred when you continue drawing.

2a) If you draw all red balls in these next 50 draws, you would have 50 red balls out and 50 blue balls out. So, you would end with 100 white balls in the container.

2b) If you draw all white balls in these next 50 draws you would have 50 red balls, 50 blue balls, and 50 white balls out. So, you would end with 50 white balls and 50 red balls in the container.

2c) If you draw a mixture of red and white balls in these next 50 draws, the number leftover in the container would vary according to the specific combination drawn.

Remember, the order in which the balls are drawn matters to this question since we have a rule in place that changes the overall total (or population) of balls within the container. In each of these scenarios, you've drawn 50 times initially (all blue) and then 50 times again (in some unknown combination of red and white). You've drawn 100 times total and changed the number of white balls in the container from zero to an amount dependent on how many times you drew a white ball on your second round of 50 draws.


Yeah, that's still pretty much nonsense isn't it?

2b) If you draw all white balls in these next 50 draws you would have 50 red balls, 50 blue balls, and 50 white balls out. So, you would end with 50 white balls and 50 red balls in the container.

... so after removing 100 balls, I've removed 150 balls? And the 150 balls that I've removed are red, white and blue despite the fact that I removed 50 blue balls initially and then 50 white ones.


Just because it fails one test in a particular way doesn’t mean it lacks reasoning entirely. It clearly does have reasoning based on all the benchmarks it passses

You are really trying to make it not have reasoning for your own benefit


> You are really trying to make it not have reasoning for your own benefit

This whole thread really seems like it's the other way around. It's still very easy to make ChatGPT to spit out obviously wrong answers depending on the prompt. If it had actual ability to reason as opposed to just generating continuation to your prompt, the quality of the prompt wouldn't matter as much


Then why does it do so well on all the reasoning benchmarks?


GPT 3.5 is VERY dumb when compared to GPT 4. Like, the difference is massive.


GPT 4 still does a lot of dumb stuff on this question, you see several people post outright wrong answer and say "Look how gpt-4 solved it!". That happens quite a lot in these discussions, so it seems like the magic to get gpt-4 to work is that you just don't check its answers properly.


It's still a tool after all.

I've had to work with imperfect machines a lot in my recent past. Just because sometimes it breaks, doesn't mean it's useless. But you do have to keep your eyes on the ball!


> It's still a tool after all.

I think that's the crux of the whole argument. It's an imperfect (but useful) tool, which sometimes produces answers that make it seem like it can reason, but it clearly can't reason on its own in any meaningful way


A smart hammer that sometimes unavoidably hits your thumb. How smart!


There's a reason you see people walking around in hard hats and steel toed boots in some companies. It's not because everything works perfectly all the time!


Yeah but lets not pretend regular hammers don't exist and probably already did the job fine and safer


An argument as old as rocks! ;-)

https://www.youtube.com/watch?v=nyu4u3VZYaQ


I ran this through GPT-4 Advanced Data Analytics version: https://chat.openai.com/share/b84feb03-22ed-4231-be41-cdb725...

Seems like it reasons it's way to this answer at the end to me: Mind you, while averages are insightful, they don't capture the delightful unpredictability of each individual run. Would you like to explore this delightful chaos further, or shall we move on to other intellectual pursuits?


https://chat.openai.com/share/a9806bd1-e5a9-4fea-981b-2843e6...

Took a bit of massaging and I enabled the Data Analysis plugin which lets it write python code and run it. It looks like the simulation code is correct though.


>Let's assume you draw x blue balls in 100 draws. Then you would have drawn 100−x red balls.

Uhm.


I came at it from a different angle. The simulation code in my case had a bug which I needed to point out. Then it got a similar final answer.


It's not reasoning. It's word prediction. At least at the individual model level. OpenAI is likely using a collection of models.


ChatGPT is trained on text that includes most reasoning problems that people come up with.

You see reasoning issues when you use more real world examples, rather than theoretical tests.

I had 4 failure states.

1) Summarization: It summarized 3 transcripts correctly, for the fourth it described the speaker as a successful VC. The speaker was a professor.

2) It was to act as a classifier, with a short list of labels. Depending on the length of text, the classifier would swap over to text gen. Other issues included novel labels, new variations of labels, and so on.

3) Agents - This died on the vine. Leave having to learn asynch, vector DBs or whatever. You can never trust the output of an LLM, so you can never chain agents.

4) I focused on using ChatGPT to complete a project. I hadnt touched HTML ever - the goal was to use ChatGPT to build the site. This would cover design, content, structure, development, hosting, and improvements.

I still have trauma. Wrong code, bad design, were base issues. If code was correct, it simply meant I had dug a deeper grave. I had anticipated 70% of the work being handled by ChatGPT, it ended up at 30% at the most.

ChatGPT is great IF you already are a subject expert - you can brush over the issues and move on.

"Hallucinations" is the little bit of string that you pull on, and the rest unravels. There are no hallucinations, only humans can hallucinate - because we have an actual ground truth to work with.

LLMs are only creating the next token. For them to reason, they must be holding structures and proxies in some data store, and actively altering it.

Its easier to see once you deal with hallucinations.


What is your definition?


If it can solve basic logic problems, then it could reason. And if it could write code of a new game with new logic, then it could reason for sure.

Example of basic problem: In a shop, there are 4 dolls of different heights P,Q,R and S. S is neither as tall as P nor as short as R. Q is shorter than S but taller than R. If Kittu wants to purchase the tallest doll, which one should she purchase? Think step by step.



Really?


> And that's because, despite all of its training data, it's not capable of actually reasoning.

Your conclusion doesn't follow from your premise.

None of these models are trained to do their best on any kind of test. They're just trained to predict the next word. The fact that they do well at all on tests they haven't seen is miraculous, and demonstrates something very akin to reasoning. Imagine how they might do if you actually trained them or something like them to do well on tests, using something like RL.


> None of these models are trained to do their best on any kind of test

How do you know GPT-4 wasn't trained to do well on these tests? They didn't disclose what they did for it, so you can't say it wasn't trained to do well on these tests. That could be the magic sauce for it.


They are trained to predict next tokens in a stream.

That is the learning algorithm.

The algorithm they learn, in response, is quite different. Since that learned algorithm is based on the training data.

In this case the models learn to sensibly continue text or conversations. And they are doing it so well it’s clear they have learned to “reason” at an astonishing level.

Sometimes, not as good as a human.

But in a tremendous number of ways they are better.

Try writing an essay about the many-worlds interpretation of the quantum field equation, from the perspective of Schrödinger, with references to his personal experiences, using analogies with medical situations, formatted as a brief for the Supreme Court, in Dr. Seuss prose, in a random human language of choice.

In real time.

While these models have some trouble with long chains of reasoning, and reasoning about things they don’t have experiences (different modalities, although sometimes they are surprisingly good), it is clear that they can also reason combining complex information drawn from there whole knowledge base much faster and sensibly than any human has ever come close to.

Where they exceed us, they trounce us.

And where they don’t, it’s amazing how fast they are improving. Especially given that year to year, biological human capabilities are at a relative standstill.

——

EDIT: I just tried the above test. The result was wonderful whimsical prose and references, that made sense at a very basic level, that a Supreme Court of 8 year olds would likely enjoy, especially if served along with some Dr. Seuss art! In about 10-15 seconds.

Viewed as a solution to an extremely complex constraint problem, that is simply amazing. And far beyond human capabilities on this dimension.


You are right that the process involves predicting words from training data. But you can still make training data focused on passing these tests. Adding millions of test questions to all of these to optimize for answering test questions is perfectly doable when you have the resources OpenAI has.

A strong hint to what they focused on in their training process is what metrics they used in their marketing of the model. You should always bet on models being optimized to perform on whatever metrics they themselves give you when they market the model. Look at the gpt-4 announcement, what metrics did they market? So what metrics should we expect they optimized the model for?

Exam results are the first metric they mentions, so exams was probably one of their top priorities when they trained gpt-4.

https://openai.com/research/gpt-4


Yes, absolutely. They can adjust performance priorities.

By the relative mix of training data, additional fine tuning training phases, and/or pre-prompts that give the model extra guidance relative to particular task types.


>The fact that they do well at all on tests they haven't seen

Haven't they seen these tests?

We know little to nothing of how these models get trained.


LLMs are trained to predict text, and one of the results of this is the LLM has as many "faces" as exist in the training data, so it's going to be _very_ different depending on the prompt. It's not a consistent entity like a human. RLHF is an attempt to mediate this, but it doesn't work perfectly.


I’m often confused over claims on the reasoning capabilities. It is often mentioned in debates as a clear and undeniable issue with current LLM’s. So since this claim can be made, where are said tests about reasoning skills that GPT-4 fails?

If it’s a debate on the illusion of reasoning, I’d be careful how I step here, because it’s been found these things probably work so well because the human brain is also a biological real-time prediction machine and “just” guessing too: https://www.scientificamerican.com/article/the-brain-guesses...


Isn't that the same as for humans? If you are speaking with me (prompting), my answers will be differents, based on how you prompted me for an answer.


> I was mesmerized by GPT-3's ability to understand concepts

This language embodies the anthropomorphic assumptions that the author is attacking.


Or the corollary: that there's really no such thing as anthropomorphic. There's inputs and outputs, and an observer's opinion on how well the outputs relate to the inputs. Thing producing the outputs, and the observer, can be human or not human. Same difference.


It absolutely is anthropomorphizing to claim "GPT-3's ability to understand concepts" rather than simply calling it "reproduce, mix and match text from an enormous corpus". And we can totally legitimately compare to a jury of human observers' opinions on how well(/badly) the output generated relates to the inputs.

For the specific example the OP cited "War: like being eaten by a dragon and then having it spit you out"

then unless its answer to "Where were you in between being eaten by a dragon and before it spat you out?" is "in the dragon's digestive system" that isn't understanding.

And I'm curious to see it answer "Dragons only exist in mythology; does your analogy mean war doesn't exist either? Why not compare to an animal that exists?"


> "War: like being eaten by a dragon and then having it spit you out"

This exact text, and the response (several attempts) is flagged and censored, with ChatGPT-4 web interface. :-|


Anyone who have used GPT-4 for any amount of time knows it is definitely not "reproduce, mix and match text from an enormous corpus". Also what's the point of the question. GPT has been specifically trained to not answer this. But in any case here is the answer with instruct model which generally is more open ended:

Sure, I can compare it to a real animal such as bears. Just like dragons are depicted as fearsome and destructive creatures in mythology, bears are also portrayed as powerful and dangerous animals in many cultures. However, just as the existence of dragons is a matter of fiction, the concept of a "war on bears" would also be nonsensical.

In reality, the issue of war is complex and multifaceted, involving political, economic, and societal factors. It cannot be reduced to a simplistic battle between humans and a particular animal species.

Furthermore, the use of animals in analogies can be problematic as it may trivialize the gravity and complexity of real-world issues. Wars and conflicts have serious consequences for human lives, whereas animals do not possess the same agency or ability to wage war.

In conclusion, while dragons may serve as a metaphor for war in mythology, comparing war to a fictional creature diminishes the seriousness of the topic and fails to capture the complexities of real conflicts.


>Also what's the point of the question. GPT has been specifically trained to not answer this. But in any case here is the answer with instruct model which generally is more open ended:

It would demonstrate basic reasoning skills that weren't things one would "reproduce, mix and match text from an enormous corpus". Like the response you provided, which is meaningless word salad. It's a prima facie takedown of your post.

This is like people who hate poetry, insisting their bad poetry is good poetry. Why? Because who else is to say otherwise! Well, the good poets. The people that appreciate poetry will know the difference. Everyone else wont care, save for those invested in having to sell their bad poetry as good.


What has poetry to do with reasoning? You should think GPT as a terse person who refuses this kind of thing. Certainly there are people like that who have good reasoning skill but can't answer your question in a poetic way(I being one).


It's an analogy - something someone who hates poetry would hate


Can AI people stop with the defense of "what if thing really is not a thing?" "what if thing is really what humans do?" These aren't answers to questions. Its deflecting nonsense posed as philosophical thought.


This.

We are in a Cambrian Explosion on the software side and hardware hasn’t yet reacted to it. There’s a few years of mad discovery in front of us.

People have different impressions as to the shape of the curve that’s going up and right, but only a fool would not stop and carefully take what is happening.


Exactly and things are actually getting crazy now. Pardon the tangent but for some reason this hasn't reached the frontpage on HN yet: https://github.com/OpenBMB/ChatDev

Making your own "internal family system" of AI's is a making this exponential (and frightening), like an ensemble on top of the ensemble, with specific "mindsets", that with shared memory can build and do stuff continuously. Found this from a comp sci professor on Tiktok so be warned: https://www.tiktok.com/@lizthedeveloper/video/72835773820264...

I remember a couple of comments here on HN when the hype began about how some dude thought he had figured out how to actually make an AGI - can't find it now, but it was something about having multiple ai's with different personalities discoursing with a shared memory - and now it seems to be happening.

This coupled with access to linux containers that can be spawned on demand, we are in for a wild ride!


I saw chatdev on hn and have been pretty disappointed with it :(

Haven’t had it make anything usable that’s more complicated than a mad lib yet


> If the speed of improvement stays anywhere near the level it was the last two years, then over the next two decades, it will lead to massive changes in how we work and which skills are valuable.

That's a big assumption to make. You can't assume that the rate of improvement will stay the same, especially over a period of 2 decades, which is a very long time. Every advance in technology hits diminishing returns at some point.


Why do you think so?

Technological progress seems rather accelerated than diminishing to me.

Computers are a great example: They have been getting more capable exponentially over the last decades.

In terms of performance (memory, speed, bandwidth) and in terms of impact. First we had calculators, then we had desktop applications, then the Internet and now we have AI.

And AI will help us get to the next stage even faster.


I’m not putting my coins on this advances.

More likely this will become the new “search” technology and get polluted with ads. People will lose trust and it will decay.


That is certainly where the economic incentives appear to be.


A lot of the progress in the last 3-4 years was predictable from GPT-2 and especially GPT-3 onwards - combining instruction following and reinforcement learning with scaling GPT. With research being more closed, this isn't so true anymore. The mp3 case was predictable in 2020 - some early twitter GIFs showed vaguely similar stuff. Can you predict what will happen in 2026/7 though, with multimodal tech?

I simply don't see it a being the same today. The obvious element of scaling or techniques that imply a useful overlap isn't there. Whereas before researchers brought together excellent and groundbreaking performance on different benchmarks and areas together as they worked on GPT-3, since 2020, except instruction following, less has been predictable.

Multi modal could change everything (things like the ScienceQA paper suggest so), but also, it might not shift benchmarks. It's just not so clear that the future is as predictable or will be faster than the last few years. I do have my own beliefs similar to Yann Lecun about what architecture or rather infrastructure makes most sense intuitively going forward, and there's not really the openness we used to have from top labs to know if they are going these ways, or not. So you are absolutely right that it's hard to imagine where we will be in 20 years, but in a strange way, because it is much less clear than in 2020 where we will be in 3 years time onwards, I would say it is much less guaranteed progress than it is felt by many...


I was also thinking about how quickly AI may progress and am curious for your or other people's thoughts. When estimating AI progress, estimating orders of magnitude sounds like the most plausible way to do it, just like Moore's law has guessed the magnitude correctly for years. For AI, it is known that performance increases linearly when the model size increases exponentially. Funding currently increases exponentially meaning that performance will increase linearly. So, AI will increase linearly as long as the funding does too. On top of this, algorithms may be made more efficient, which may occasionally make an order of magnitude improvement. Does this reasoning make sense? I think it does but I could be completely wrong.


You can check my post history to see how unpopular this point of view is, but the big "reveal" that will come up is as follows:

The way that LLMs and humans "think" is inherently different. Giving an LLM a test designed for humans is akin to giving a camera a 'drawing test.'

A camera can make a better narrow final output than a human, but it cannot do the subordinate tasks that a human illustrator could, like changing shadings, line width, etc.

An LLM can answer really well on tests, but it often fails at subordinate tasks like 'applying symbolic reasoning to unfamiliar situations.'

Eventually the thinking styles may converge in a way that makes the LLMs practically more capable than humans on those subordinate tasks, but we are not there yet.


Most of the improvements apparently come from training larger models with more data. Which is part of the problem mentioned in the article - the probability that the model just memorizes the answers to the tests is greatly increased.

AI is getting subjectively better, and we need better tests to figure out if this improvement is objectively significant or not.


> Most of the improvements apparently come from training larger models with more data.

OpenAI is reportedly losing 4 cents per query. With a thousandfold increase in model size, and assuming linear scale in cost, that's a problem. Training time is going to go up too. Moore's law isn't going to help any more. Algorithmic improvements may help...if any significant ones can be found.


That’s backwards.

Training a model on more data improves generalization not memorization.

To store more information in the same number of parameters requires the commonality between examples to be encoded.

In contrast, the less data trained on, especially if repeated, lets the network learn to provide good answers for that limited set without generalizing. I.e. memorizing.

——

It’s the same as with people. The more variations people see of something, the more likely they intuit the underlying pattern.

The fewer examples, the more likely they just pattern match.


> It’s the same as with people. The more variations people see of something, the more likely they intuit the underlying pattern.

> The fewer examples, the more likely they just pattern match.

A kid who uses a calculator and just fills in the answer to every question will see a lot more examples than a kid that learned by starting from simple concepts and understanding each step. But the kid who focused on learning concepts and saw way fewer problems will obviously have a better understanding here.

So no, you are clearly wrong here, humans doesn't learn that way at all. These models learn that way, you are right on that, but humans don't.


I have no idea where your calculator came from.

In neither case did I introduce one.

And since the calculator itself has already a general understanding, it would seem completely counter productive to start training a computer or child by first giving them a machine that has already solved the problem.

Also, for what it’s worth, I am speaking from many years experience not just training models but creating the algorithms that train them.


Replace "uses calculator" to "looks through solved problems", same thing. Not sure what you don't understand. Humans don't build understanding by seeing a lot of solved examples.

To make a human understand we need to explain how things work to them. You don't just show examples. A human who is just shown a lot of examples wont understand much at all, even if he tries to replicate them.

> Also, for what it’s worth, I am speaking from many years experience not just training models but creating the algorithms that train them.

What does this has to do with how humans learn?


Humans learn vast amounts of information from examples.

They learn their first words, how to walk, what a cat looks like from many perspectives, how to parse a visual scene, how to parse the spoken word, interpret facial expressions and body language, how different objects move, how different creatures behave, different materials feel, what things cause pain, what things taste like and how they make them feel, how to get what they want, how to climb, how not to fall, all by trial & example. On and on.

And yes, as we get older we get better and better at learning 2nd hand from others verbally, and when people have the time to show us something, or with tools other people already invented.

Like how a post-trained model picks up on something when we explain it via a prompt.

But that is not the kind of training being done by models at this stage. And yet they are learning concepts (pre-prompt) that, as you point out, you & I had to have explained to us.


> Like how a model picks up on when we explain something to it after it has been trained.

Models don't learn by you telling them something, the model doesn't update itself. A human updates their model when you explain how something works to them, that is the main way we teach humans. Models don't update themselves when we explain how something works to them, that isn't how we train these models, so the model isn't learning its just evaluating. It would be great if we could train models that way, but we can't.

> Humans learn vast amounts of information from examples.

Yes, but to understand things in school those examples comes with an explanation of what happens. That explanation is critical.

For example, a human can learn to perform legal chess moves in minutes. You tell them the rules each piece has to follow and then they will make legal moves in almost every case. You don't do it by showing them millions of chess boards and moves, all you have to do is explain the rules and the human then knows how to play chess. We can't teach AI models that way, this makes human learning and machine learning fundamentally different still.

And you can see how teaching rules creates a more robust understanding than just showing millions of examples.


> you explain how something works to them, that is the main way we teach humans

I am curious who taught you to recognize sounds, before you understood language, or how to interpret visual phenomena, before you were capable of following someone’s directions.

Or recognize words independent of accent, speed, pitch, or cadence. Or even what a word was.

Humans start out learning to interpret vast amounts of sensory information, and predictions of results of there physical motor movements, from a constant stream of examples.

Over time they learn the ability to absorb information indirectly from others too.

This is no different from models, except that it turns out, they can learn more things, at a higher degree of abstraction, just from example than us.

And work on their indirect learning (I.e. long term retention of information we give them via prompts), is just beginning.

But even as adults, our primary learning mode is experience is from the example situations we encounter non-stop as we navigate life.

Even when people explain things, we generalize a great deal of nuance and related implications beyond what is said.

“Show, don’t tell”, isn’t common advice for no reason. We were born example generalizers.

Then we learn to incorporate indirect information.


You are right, but I think it is really important to have this difference in learning in mind, because not being able to learn rules during training is the main weakness in these models currently. Understanding that weakness and how that makes their reasoning different from humans is key both to using these models and for any work on improving them.

For example, you shouldn't expect it to be able to make valid chess moves reliably, that requires reading and understanding rules which it can't do during training. It can get some understanding during evaluation, but we really want to be able to encode that understanding into the model itself rather than have to keep it in eval time.


Yes, agreed, you are right too.

There is a distinction between reasoning skills learned inductively (generalizing from examples), and reasoning learned deductively (via compact symbols or other structures).

The former is better at recognition of complex patterns, but can incorporate some basic deduction steps.

But explicit deduction, once it has been learned, is a far more efficient method of reasoning, and opens up our minds to vast quantities of indirect information we would never have the time or resources to experience directly.

Given how well models can do at the former, it’s going to be extremely interesting to see how quickly they exceed us at the latter - as algorithms for longer chains of processing, internal “whiteboarding” as a working memory tool for consistent reasoning over many steps and many facts, and long term retention of prompt dialogs, get developed!


I pretty much want the LLM to be great at memorizing things. That's what I'm not great at.

If it had perfect recall I would be so thrilled.

And just because it's memorized the data--as all intelligences would need to do to spit data out--doesn't mean it can't still do useful operations on the data, or explain it in different words, or whatever a human might do with it.


Do we? I use gpt-4 daily and it matters not to me what the source of the "intelligence" is. It's subjective what "intelligence" even means. It's subjective how the brain works. Almost by definition AI is "things that can't be objectively measured".


What's the benefit of doing this vs copying one of the many (far superior) Javascript mp3 players on the internet, such as here?

https://freefrontend.com/javascript-music-players/


It'd be a bit faster to get up and running with ChatGPT. In the AI, you'd have to phrase the instruction & copy the output into a file. For search, you have to do both those things and learn a UI that wasn't built to taste.


Almost nothing happened in AI for about 50 years. That's the normal in the field.


I got curious and did this myself. Needed a bit of nudging to get where I wanted, but I even had it make an Electron wrapper:

https://chat.openai.com/share/29d695e6-7f23-4f03-b2be-29b7c9...


This is awesome, thanks for sharing.

Do you (or anyone) know of any products that allow for iterating on the generated output through further chatting with the ai? What I mean, is that each subsequent prompt here either generated a new whole output, or new chunks to add to the output. Ideally, whether generating code or prose, I’d want to keep prompting about the generated output and the AI further modifies the existing output until it’s refined to the degree I want it.

Or is that effectively what Copilot/cursor do and I’m just a bad operator?


> Do you (or anyone) know of any products that allow for iterating on the generated output through further chatting with the ai? What I mean, is that each subsequent prompt here either generated a new whole output, or new chunks to add to the output. Ideally, whether generating code or prose, I’d want to keep prompting about the generated output and the AI further modifies the existing output until it’s refined to the degree I want it.

ChatGPT does this.


No problem, it was a fun morning exercise for me :)

Copilot, at least from what little I did in vscode, isn't as powerful as this. I think there's a GPT4 mode for it that I haven't played with that'd be a lot closer to this.


I used gpt4 to write a script that I can ssh from my iPhone to a m1 that downloads the mp3 from a yt url on my iPhone clipboard. The only thing I am missing is automating the sync button when the iPhone is on the same home wifi to add the mp3 to the music app.


> Two years ago it didn't even dawn on me that this would be my way of writing software in the near future

So you were ignorant two years ago, GitHub Copilot was already available to users back then. The only new big thing the past two years was GPT-4, and nothing suggest anything similar will come the next two years. There are no big new things on the horizon, we knew for quite a while that GPT-4 was coming, but there isn't anything like that this time.


Copilot was not around when I wrote the Tweet.

But when Copilot came out, I was indeed ignorant! I remember when a friend showed it to me for the first time. I was like "Yeah, it outputs almost correct boilerplate code for you. But thankfully my coding is so that I don't have to write boilerplate". I didn't expect it to be able to write fully functional tools and understand them well enough to actually write pretty nice code!

Regarding "there isn't anything like that this time." : Quite the opposite! We have not figured out where using larger models and throwing more data at them will level off! This could go on for quite a while. With FSD 12, Tesla is already testing self driving with a single large neural net, without any glue code. I am super curious how that will turn out.

The whole thing is just starting.


Well, my point is that you perceive progress to be fast since you went from not understanding what existed to later getting in on it. That doesn't mean progress was that fast, it means that you just discovered a new domain.

Trying to extrapolate actual progress is bad in itself, but trying to extrapolate your perceived progress is even worse.


Yeah you have hit the nail on the head here. A lot was predictable with seeing that GPT-2 could reasonably stay within language and generate early coherent structures, that coming at the same time as instructions with the T5 stuff and the widespread use of embeddings from BERT told us this direction was likely, it's just for many people this came to awareness in 2021/22 rather than the 2018-2020 ramp up the field/hobbyists experienced.


Whisper, Stable Diffusion, VoiceBox, GPT4 vision, DALL.E3

Other breakthroughs in graph machine learning https://towardsdatascience.com/graph-ml-in-2023-the-state-of...


Those are image/voice generation, the topic is about potential replacement of knowledge workers such as coders. The discussion about image/voice generation is a very different topic since nobody thinks those are moving towards AGI and nobody argued they were "conscious" etc.


AI hype is really problematic in Enterprise. Big companies are now spending C executive time figuring out a company "AI strategy". This is going to be another cycle of money-wasted/biz-upset, very similar to what I have seen with Big data. The thing in Enterprise is that everyone serious about biz operations knows AI test scores and AI quality is not there, but very few are able to communicate these concerns in a constructive way, rather everyone is embracing the hype because, maybe they get a promotion? Tech, as usual, is very happy to feed the hype and never, as usual, telling businesses honestly that, at best, this is an incremental productivity improvement, nothing life changing. I think the issue is overall lack of honesty, professionalism, and accountability across the board, with tech leading this terrible way of pushing product and "adding value".


> Tech, as usual, is very happy to feed the hype

I agree completely with you on this.

In defence of the executives however is that some businesses will be seriously affected. Call centres and plagiarism scanner have already been affected, but it’s unclear which industries will be affected too. Maybe the probability is low, but the impact could be very high. In think this reasoning is driving the executives.


Look, I am going to wait and see on this, maybe new facts will make me reconsider. In the meanwhile, github Copilot is just cost to my company, haven't seen much additional productivity. I guess my concern, given how hard is to hire developers and technologists, is replacing simpler job roles, like a customer service representative, with complicated new ones, like "MLOps Engineer".


Copilot cost is a joke unless you're running a sweatshop - it has to boost productivity in minutes/month to justify the cost considering dev salaries.

Personally I think it's priced perfectly - it's a really good typing assistant for obvious code, and helps me stay in flow longer.

In fact I'd pay double for a version with half the latency.


No, it's a joke that you tell me it's ok to pay for something with no clear ROI. Feel free to live in fantasy land, but when you run a business cost counts, this is exactly what I dislike about tech, a lot of talk, but 0 accountability when it comes to how they actually impact the bottom line.


>it's a joke that you tell me it's ok to pay for something with no clear ROI.

Can you measure the bottom line impact of using CI/CD, IDEs, static code analysis, source control, whatever tool ? If you don't know the exact numbers and are just guesstimating - are you actually accounting for the costs or just moaning because you don't like the tool ? Who even works with exact ROI numbers for these kinds of decisions ? I can't think of a scenario where accurately determining the ROI of any one thing is possible and it doesn't reduce to gut checks. Pretending it can be measured sounds as naive as people trying to measure developer productivity with fixed metrics.

Cost of Copilot is so low that it's under discretionary spending - it would take more time to figure out the actual value than to pay for people that want it. People already figured out that it's better to just allocate a budget to individuals, let them decide which tools work for them and go trough purchase requisition and approval dance for big ticket/external dependency items where the impact is worth the time spent on making the decision.


Isn't this the same failing that prevents us from funding basic research or infrastructure? It obviously has positive ROI, but because you can't estimate it more narrowly than between "big" and "huge", you assume it's negative and reject the idea?


It’s rational herd dynamics for the execs. Going against the herd and being wrong is a career ender. Going with the herd and being wrong will be neutral at worst.


Going against the herd and being right can also be a career ender.


Blindly following a trend will likely not end well. But even with previous hype cycles, those companies that identified good use cases, validated those use cases, and had solid execution of the projects leaped ahead. Big Data was genuinely of value to plenty of organizations, and a waste of time for others. IoT was crazy for plenty of orgs ... but also was really valuable to certain segments. Gartner's hype cycle ends with the plateau of productivity for a reason ... you just have to go through the trough of disillusionment first, which is going to come from a the great multitudes of failed and ill-conceived projects.


Identifying an “AI strategy” seems backwards. What they should be doing is identifying the current problems and goals of the company and reassessing how best to accomplish them given the new capabilities which have surfaced. Perhaps “AI” is the best way. Or maybe simpler ways are better.

I’ve said it before, but as someone to whom “AI” means something more than making API calls to some SAAS, I look forward to the day they hire me at $300/hour to replace their “AI strategy” with something that can be run locally off of a consumer-grade GPU or cheaper.


Agreed. I think there is a FOMO phenomena among C-level execs, that is generating a gigantic waste os money and time, creating distractions around “AI strategy”.

It started a few years back and it is now really inflamed with LLM, because of the consumer level hype and general media reporting about it.

You can perceive that by the multiple AI startups capturing millions in VC capital for absolutely bogus value proposition. Bizarre!


The problem with your premise is that you're already drawing conclusions about the potential of AI and deciding it is hype. Perhaps decades ago someone could have equally criticised "Internet hype" and "mobile hype" and look foolish now.


Also decades ago someone criticised "bigdata hype" and "microservices hype" and looks right now. Doing things just out of FOMO is rarely a good business decision. It can pay out, even a broken clock is right twice a day, but it's definitely bad to follow every new thing just because Gartner mentioned it. I'm not giving advice of course, but having seen enterprises betting good money even on NFT I tend to treat every new enterprise powerpoint idea with a certain dose of skepticism.


Business can work on more than one thing at once. Businesses typically take any number of risks they invest in. Proper risk management ensures you've not over committed assets to the point of an unrecoverable loss.

Some businesses in some industries can follow a strategy of "never do anything until it's a well established process", others cannot.


Yes, hype exists and some things we thought were promising turned out not to be. However, if anyone is making the case that we know enough today to claim that AI is mostly hype, I think that's foolish.


Given the adoption of microservices and tech like kubernetes, I’d say you’re pretty wrong in judging that one.


> because, maybe they get a promotion?

While I agree with you in general, I don't think this bit is particularly fair. I'd say we know the limitations, and we also know that using LLMs might bring some advantage, and the companies that are able to use it properly will have a better position, so it makes sense to at least investigate the options.


> AI hype is really problematic in Enterprise.

This only appears so because we here have some insight into the domain. But there have always been hype cycles. We just didn't notice them so readily.

The speed with which this happens makes me suspect there is a hidden "generic hype army" that was already in place, presumably hyping the last thing, and ready to jump on this thing.


in consulting all we hear is sell sell sell AI so i'm sure my industry isn't helping at all. I'm not on board yet, I just don't see a use case in enterprise beyond learning a knowledge base to make a more conversational self-help search and things like that. It's great that it can help right a function in javascript but that's not a watershed moment... yet. Curious to see AI project sales at end of 2024 (everything in my biz is measured in units of $$).


In my case executives were more focussed on how it could be built into new projects, presales etc rather than internal efficiency improvements. A lot of people were amazed to see someone getting value out of it (efficiency gains) without building stuff around it. Blew my mind that this was the case.


Publicly listed companies whose traditional business model is under pressure are incentivized to hype because if they don’t inspire there idea of sustained growth to their wary investors, they cautionary tale in form of Twitter (valuation low enough to lose control) exists.

In Capitalism, you grow or you die and sometimes you need to bullshit people about growth potential to buy yourself time


Yes, sad but true.


This is exactly correct.


I remember watching a documentary about an old blues guitar player from the 1920's. They were trying to learn more about him and track down his whereabouts during certain periods of his life.

At one point, they showed some old footage which featured a montage of daily life in a small Mississippi town. You'd see people shopping for groceries, going on walks, etc. Some would stop and wave at the camera.

In the documentary, they noted that this footage exists because at the time, they'd show it on screen during intermission at movie theaters. Film was still in its infancy in that time, and was so novel that people loved seeing themselves and other people on the big screen. It was an interesting use of a new technology, and today it's easy to understand why it died out. Of course, it likely wasn't obvious at the time.

I say all that because I don't think we can know at this point what AI is capable of, and how we want to use it, but we should expect to see lots of failure while we figure it out. Over the next decade there's undoubtedly going to be countless ventures similar to the "show the townspeople on the movie screen" idea, blinded by the novelty of technological change. But failed ventures have no relevance to the overall impact or worth of the technology itself.


> it's easy to understand why it died out

I think it's probably more sociological than technical. People love to see themselves and their friends/family. My work has screens that show photos of events and it always causes a bit of a stir ("Did you see X's photo from the summer picnic?") Yearbooks are perennially popular and there's a whole slew of social media.

However, for this to be "fun", there must be a decent chance that most people in the audience know a few people in a few of the pictures. I can't imagine this working well in a big city, for example, or a rural theatre that draws from a huge area.


Selfies and 15 second videos still exist as shorts and tiktoks.


What died out? Film?


The practice of filming a montage around your local neighborhood or town to play during intermission. Though you could say intermission as well, since that was a legacy concept that was inherited from plays and eventually died out as well.


The idea of an intermission is something that I think should be entertained. 2001 A Space Odyssey has a 10 minute intermission for instance. It can also split up potentially longer pieces.

“The length of a film should be directly related to the endurance of the human bladder.” - Alfred Hitchcock


IIRC Tarantino experimented with it in The Hateful Eight, in some theaters.


>What died out?

The custom of showing film consisting of footage of the general public in movie theaters.


Showing locals little movie clips of themselves in intermissions at the local theater.


The debate over whether LLMs are "intelligent" seem a lot like the old debate among NLP experts whether English must be modeled as a context-free grammar (push down automaton) or finite-state machine (regular expression). Yes, any language can be modeled using regular expressions; you just need an insane number of FSMs (perhaps billions). And that seems to be the model that LLMs are using to model cognition today.

LLMs seem to use little or no abstract reasoning (is-a) or hierarchical perception (has-a), as humans do -- both of which are grounded in semantic abstraction. Instead, LLMs can memorize a brute force explosion in finite state machines (interconnected with Word2Vec-like associations) and then traverse those machines and associations as some kind of mashup, akin to a coherent abstract concept. Then as LLMs get bigger and bigger, they just memorize more and more mashup clusters of FSMs augmented with associations.

Of course, that's not how a human learns, or reasons. It seems likely that synthetic cognition of this kind will fail to enable various kinds of reasoning that humans perceive as essential and normal (like common sense based on abstraction, or physically-grounded perception, or goal-based or counterfactual reasoning, much less insight into the thought processes / perceptions of other sentient beings). Even as ever-larger LLMs "know more" by memorizing ever more FSMs, I suspect they'll continue to surprise us with persistent cognitive and perceptual deficits that would never arise in organic beings that do use abstract reasoning and physically grounded perception.


> LLMs can memorize a brute force explosion in finite state machines (interconnected with Word2Vec-like associations) and then traverse those machines and associations as some kind of mashup, akin to a coherent abstract concept.

That's actually the closest to a working definition of what a concept is. The discussion about language representation has little bearing on humans or intelligence, because it's not how we learn and use language. Similarly, the more people - be it armchair or diploma-carrying philosophers - try to find the essence of a meaning of some word, the more they fail, because it seems that meaning of any concept is defined entirely through associations with other concepts and some remembered experiences. Which again seems pretty similar to how LLMs encode information through associations in high-dimensional spaces.


Can you recommend any books on this?


This really is a good article, and is seriously researched. But the conclusion in the headline - “AI hype is built on flawed test scores” - feels like a poor summary of the article.

It _is_ correct to say that an LLM is not ready to be a medical doctor, even if it can pass the test.

But I think a better conclusion is that test scores don’t help us understand LLM capabilities like we think they do.

Using a human test for an LLM is like measuring a car’s “muscles” and calling it horsepower. They’re just different.

But the AI hype is justified, even if we struggle to measure it.


Two years ago I didn't use AI at all. Now I wouldn't go without it; I have Copilot integrated with Emacs, VSCode, and Rider. I consider it a ground-breaking productivity accelerator, a leap similar to when I transitioned from Turbo Pascal 2 to Visual C 6.

That's why I'm hyped. If it's that good for me, and it's generalizable, then it's going to rock the world.


Life longer programmer, and same sentiments, I use it everywhere I can.

I am currently transliterating a language PDF into a formatted lexicon, I wouldn't even be able to do this without co-pilot, it has turned this seemingly impossibly arduous task into a pleasurable one.


Coding on something without copilot these days feels like having my hands tied. I'm looking at you, XCode and Colab...


I don't think test scores have anything to do with the hype. Most people don't even realize test scores exist.

One is just to wow factor. It will be short lived. A bit like VR, which is awesome when you first try it, but it wears out quickly. Here, you can have a bot write convincing stories and generate nice looking images, which is awesome until you notice that the story doesn't make sense and that the images has many details wrong. This is not just a score, it is something you can see and experience.

And there is also the real thing. People start using GPT for real work. I have used it to document my code for instance, and it works really well, with it I can do a better job than without, and I can do it faster. Many students use it to do their homework, which may not be something you want, but it no less of a real use. Many artists are strongly protesting against generative AI, this in itself is telling, it means it is taken seriously, and at the same time, other artists are making use of it.

It is even use for great effect where you don't notice. Phone cameras are a good example, by enhancing details using AI, they give you much better pictures than what the optics are capable of. Some people don't like that because the picture are "not real", but most enjoy the better perceived quality. Then, there are image classifiers, speech-to-text and OCR, fuzzy searching, content ranking algorithms we love to hate, etc... that all make use of AI.

Note: here AI = machine learning with neural networks, which is what the hype is about. AI is a vague term that can mean just about anything.


> I don't think test scores have anything to do with the hype. Most people don't even realize test scores exist.

They put the test scores front and center in the initial announcement with a huge image showing improvements on AP exams, it was the main thing people talked about during the announcement and the first thing anyone who read anything about gpt-4 sees.

I don't think many who are hyped about these things missed that.

https://openai.com/research/gpt-4


It is what they talk about during announcements because people like numbers. It looks more serious than "hey look, GPT-4 smart" with some example quotes that anyone knows are cherry picked. But the real hype comes from people trying for themselves.

I seriously don't remember hearing these test results being mentioned in any casual conversation, and I heard a lot of casual conversations about AI. The majority of these center around personal experiences ("I asked ChatGPT this and I got that..."), homework is another common topic. When we compare systems, we won't say "this one got a 72 and the other got a 94", but more like "I asked new system to give me a specific piece of code (or cocktail recipe, or anything) and the result is much better". Again, personal experience and anecdotes before scores.

Maybe people in the field hype themselves with score, but not the general public, and probably not the investors either, who will most likely look at the financial performance of the likes of OpenAI instead.


If you followed the initial announcement, then you were presumably already hyped. The novel thing about chatgpt has been the mass amount of people who hadn't heard about generative AI in the past glomming onto the technology. Most of these people heard about it via word of mouth. They then tried it themselves and told people about it. They never even heard of tests, let alone based their perception on them.


This video from Yann LeCun gives a great summary on where things stand. https://www.youtube.com/watch?v=pd0JmT6rYcI

He is of the opinion the current generation transformers architecture is flawed and it will take a new generation of models to get close to the hype.


It's not built on high test scores - while academics do benchmark models on various tests, all the many people who built up the hype mostly did it based on their personal experience with a chatbot, not by running some long (and expensive) tests on those datasets.

The tests are used (and, despite their flaws, useful) to compare various facets of model A to model B - however, the validation whether a model is good now comes from users, and that validation really can't be flawed much - if it's helpful (or not) to someone, then it is what it is, the proof of the pudding is in the eating.


This article is absurd.

> But when a large language model scores well on such tests, it is not clear at all what has been measured. Is it evidence of actual understanding? A mindless statistical trick? Rote repetition?

It is measuring how well it does _at REPLACING HUMANS_. It is hard to believe how the author clearly does not understand this. I don't care how it obtains its results.

GPT-4 is like a hyperspeed entry to mid level dev that has almost no ability to contextualize. Tools built on top of 32k will allow repo ingestion.

This is the worst it will ever be.


>It is measuring how well it does _at REPLACING HUMANS_

It's possible to do well on a test and have no ability to do the thing the job tests for.

GPT-4 scores well on an advanced sommelier exam, but obviously cannot replace a human sommelier, because it does not have a mouth.


Which tests test specifically for “replacing humans?” That seems like a wild metric to try and capture in a test.

Also an aside:

> This is the worse it will ever be.

I hear this a lot and it really bothers me. Just because something is the worst it’ll ever be doesn’t mean it’ll get much better. There could always be a plateau on the horizon.

It’s akin to “just have faith.” A real weird sentiment that I didn’t notice in tech before 2021.


GPT passed a test on the theoretical fundamentals of selling and serving wine in fancy restaurants. In a human passing such a test provides a useful signal of job suitability because people who pass it are often also capable of the physical bits, like theatrically opening wine bottles. But obviously that doesn't work for an AI.

Lots of things usefully correlate with test scores in humans but might not in an AI.


It is measuring how well it does replacing humans - in those tests.


I note something very interesting in the AI hype, and I would like someone to help explain it.

Whenever there's a news or article noting the limits of current LLM tech (especially the GPT class of models from OpenAI), there's always a comment that says something along the lines of "ah did you test it on GPT-4"?

Or if it's clear that it's the limitation of GPT-4, then you have comments along the lines of "what's the prompt?", or "the prompt is poor". Usually, it's someone who hasn't in the past indicated that they understand that prompt engineering is model specific, and the papers' point is to make a more general claim as opposed to a claim on one model.

Can anyone explain this? It's like the mere mention of LLMs being limited in X, Y, Z fashion offends their lifestyle/core beliefs. Or perhaps it's a weird form of astroturfing. To which, I ask, to what end?


> there's always a comment that says something along the lines of "ah did you test it on GPT-4"?

Perhaps because whenever there's "a news or article noting the limits of current LLM tech", it's a bit like someone tried to play a modern game on a machine they found in their parents' basement, and the only appropriate response to this is, "have you tried running it on something other than a potato"? This has been happening so often over the past few months that it's the first red flag you check for.

GPT-4 is still qualitatively ahead of all other LLMs, so outside of articles addressing specialized aspects of different model families, the claims are invalid unless they were tested on GPT-4.

(Half the time the problem is that the author used ChatGPT web app and did not even realize there are two models and they've been using the toy one.)


As someone who has this instinct myself, there is a line of reactionism to modern AI/ML that says, "this is just a toy, look it can't do something simple." But often the case, if _can_ do that thing with a either a more advanced model, or a more built-out system. So the instinct is to try and explain that the pessimism is wrong. That we really can push the boundary and do more, even if it isn't going to work out of the box yet. I react that way against all forms of poppy snipping.


Hyping up tech based on what you think it will be able to do in the future is the misplaced overhyping that is the problem. The issues people say are easy to fix aren't easy to fix.

Expect the model to continue to perform like it does today, and then lots of dumb integrations added to it, and you will get a very accurate prediction of how most of new tech hype turns out. Dumb integrations can't add intelligence, but it can add a lot of value, so the rational hype still sees this as a very valuable and exciting thing, but it isn't a complete revolution in its current form.


The output of any model is essentially random and whether it is useful or impressive is a coin flip. While most people get a mix of heads and tails, there are a few people at any time that are getting streaks of one head after another or vice versa.

So my perception is this leads to people who have good luck and perceive LLMs as near AGI because it arrives at a useful answer more often than not, and these people cannot believe there are others who have bad luck and get worthless output from their LLM, like someone at a roulette table exhorting "have you tried betting it all on black? worked for me!"


1. Just like it's frustrating when a paper is published making claims that are hard to verify, it's frustrating when somebody says "x can't do y" in a way that is hard to verify^^

2. LLMs, in spite of the complaints about the research leaders, are fairly democratic. I have access to several of the best LLMs currently in existence and the ones I can't access haven't been polished for general usage anyway. If you make a claim with a prompt, it's easy for me to verify it

3. I've been linked legitimate ChatGPT prompts where someone gets incorrect data from ChatGPT - my instinct is to help them refine their prompt to get correct data

4. If you make a claim about these cool new tools (not making a claim about what they're good for!) all of these kick in. I want to verify, refine, etc.

Of course some people are on the bandwagon and it is akin to insulting their religion (it is with religious fervor they hold their beliefs!) but at least most folks on hn are just excited and trying to engage

^^ I actually think making this claim is in bad form generally. It's like looking for the existence of aliens on a planet. Absence of evidence is not evidence of absence


If someone comes here and says "<insert programming language> cannot do X" and that is wrong, or perhaps outdated, don't you feel that the reaction would be similar?

If you are trying to make categorical statements about what AI is unable to do, at the very least you should use a state-of-the-art system, which conveniently is easily available for everyone.


Because they're saying it can't do something when they're holding it wrong.

It's a weird thing to get hung up on if you ask me.


Perhaps they are trying to help people get the best out of a tool which they themselves find very useful?


I think ironically there has been an "AI-anti-hype hype", with people like Gary Marcus trying to blow up every single possible issue into a deal breaker. Most of the claims in this article are based on tests performed only on GPT-3, and researchers often seem to make tests in a way that proves their point - see an earlier comment from me here with an example: https://news.ycombinator.com/item?id=37503944

I agree there has been many attention-grabbing headlines that are due to simple issues like contamination. However, I think AI has already proved its business value far beyond those issues, as anyone using ChatGPT with a code base not present in their dataset can attest.


I think some amount of that is necessary, though no? We have people claiming that this generation of AI will replace jobs - and plenty of companies have taken the bait and tried to get started with LLM-based bots. We even had a pretty high-profile case of a Google AI engineer going public with claims that their LaMDA AI was sentient. Regardless of what you think of that individual or Google's AI efforts, this resonates with the public. Additionally a pretty common sentiment I've seen has been non-tech people suggesting AI should handle content moderation - the idea being that since they're not human and don't have "feelings" they won't have biases and won't attempt to "silence" any one political group (without realising that bias can be built in via the training data).

It seems pretty important to counter that and to debunk any wild claims such as these. To provide context and to educate the world on their shortcomings.


I think skepticism is always welcome and we should continue to explore what LLM's can and cannot do. However, what I'm referring to is trying to get a quick win by defeating some inferior version of GPT or trying to apply a test which you don't even expect most humans to pass.

The article is actually fine and pretty balanced, but it is a bit unfortunate that 80% of their examples are not illustrative of current capabilities. At least for me, most of my optimism about the utility of LLM's comes from GPT-4 specifically.


>But there’s a problem: there is little agreement on what those results really mean. Some people are dazzled by what they see as glimmers of human-like intelligence; others aren’t convinced one bit.

I find the whole hype & anti-hype dynamic so tiresome. Some are over-hyping, others are responding with over-anti-hyping. Somewhere in-between are many reasonable, moderate and caveated opinions, but neither the hypesters or anti-hypesters will listen to these (considering all of them to come from people at the opposite extreme), nor will outside commentators (somehow being unable to categorize things as anything more complicated than this binary).


Depends if the hype is invalid - Let's remember that "There will be a computer in every home!" was once considered hype.

There is a possible world where AI will be a truly transformative technology in ways we can't possibly understand.

There is a possible world where this tech fizzles out.

So one of the reasons that there is a broad 'hype' dynamic here is because the range of possibilities is broad.

I sit firmly in the first camp though - I believe it's truly a transformative technology, and struggle to see the perspective of the 'anti-hype' crowd.


I’m in the second camp. To every hyped up tech, all I can say is “prove it”. Give me actual real world results.

There are millions of hustlers out there pushing snake oil. The probability that something is the real deal and not snake oil is small. Better to assuming the glass is half empty.


There will be millions of hustlers regardless of if the technology is transformative or not.

The invention of the PC market was filled with hustlers but that doesn't mean that the PC didn't match the hype.

The .com boom was filled with hustlers, but that doesn't mean that the Internet wasn't transformative.

Actual real world results... well the technology is already responsible for c40% of code on Github. Image recognition technologies are soaring and self driving feels within reach. Few people doubt that a real-world Jarvis will be in your home within 12 months. The turing test is smashed, and LLM's are already replacing live chat operatives. And this is just the start of the technology...


> The .com boom was filled with hustlers, but that doesn't mean that the Internet wasn't transformative.

But a lot of .com projects were BS. If you were to pick at random, the probability you got a winner is low. Thus it’s wise to be skeptical of all hyped stuff until they have proven themselves.

> Actual real world results... well the technology is already responsible for c40% of code on Github.

Quite sure you misread that article. It says 40% of the code checked in by people who use Copilot is AI-generated. Not 40% of all code.

That’s how some programmers are I guess. I have heard of people copy pasting code directly from stack overflow without a second thought about how it works. That’s probably Copilot’s audience.


I think your reasoning is flawed - the fact a lot of .com projects were BS does not imply that the underlying technology (the internet) wasn't transformative.

Are we really saying that people who were saying the internet was a transformative technology in the mid-1990's were wrong? It was transformative, but it was hard to see which parts of the technology would stick around. Of course it doesn't mean that every single company and investment was going to be profitable, that's not true of anything ever. People investing in Amazon and Google were winners though - these are companies that have in many ways reinvented the market they operate in.

> Quite sure you misread that article. It says 40% of the code checked in by people who use Copilot is AI-generated. Not 40% of all code.

Ok, I'll take that it's 40% of Copilot users. That's still 40% of some programmers code!


"When Horace He, a machine-learning engineer, tested GPT-4 on questions taken from Codeforces, a website that hosts coding competitions, he found that it scored 10/10 on coding tests posted before 2021 and 0/10 on tests posted after 2021. Others have also noted that GPT-4’s test scores take a dive on material produced after 2021. Because the model’s training data only included text collected before 2021, some say this shows that large language models display a kind of memorization rather than intelligence."

I'm sure that is just a matter of prompt engineering, though.


But it got 10/10 on pre-2021 questions, with the same prompting method...


> AI hype is built on high test scores

No, it's built on people using DALLE and Midjourney and ChatGPT.


Exactly, chatpgt is double checking my homework problems and pointing out my errors, it's teaching me the material better than any of my lectures. It's writing tons of code I'm getting paid for, with way less overhead than trying to explain the problem to a junior, less mistakes and faster iteration. Test scores, ridiculous


Related paper https://arxiv.org/pdf/2309.08632.pdf

‘Pre-training on the Test Set Is All You Need‘

GPT-4 is really smart to dig information it has seen before, but please don’t use it for any serious reasoning. Always take the answer with a grain of salt.


This is my favorite new AI argument, took me a few months to see it. Enjoyed it at first.

You start with everyone knows there's AI hype from tech bros. Then you introduce a PhD or two at institutions with good names. Then they start grumbling about anthropomorphizing and who knows what AI is anyway.

Somehow, if it's long enough, you forget that this kind of has nothing to do with anything. There is no argument. Just imagining other people must believe crazy things and working backwards from there to find something to critique.

Took me a bit to realize it's not even an argument, just parroting "it's a stochastic parrot!" Assumes other people are dunces and genuinely believe it's a minihuman. I can't believe MIT Tech Review is going for this, the only argument here is the tests are flawed if you think they're supposed to show the AI model is literally human.


I disagree entirely.

The hype is based entirely on the fact that I can talk (in text) to a machine and it responds like a human. It might sometimes make up stuff, but so do humans. I therefore don't consider that a significant downside, or problem. In the end chatgpt is still ... a baby.

The hype builds around the fact that I can run a language model that fits into my graphics cards and responds at faster-than-typing speed, which is sufficient.

The hype builds around the fact that it can create and govern whole text based games for me, if I just properly ask it to do so.

The hype builds around the fact that I can have this everywhere with me, all day long, whenever I want. It never grows tired, it never stops answering, it never scoffs at me, it never hates me, it never tells me that I'm stupid, it never tells me that I'm not capable of doing something.

It always teaches me, always offers me more to learn, it always is willingly helping me, it never intentionally tries to hide the fact that it doesn't know something and never intentionally tries to impress me just to get something from me.

Can it get things wrong? Sure! Happens! Happens to everyone. Me, you, your neighbour, parents, teachers, plumbers.

Not a single minute did I, or dozens of millions of others, give a single flying fuck about test scores.


The only test I need is the amount of time it takes me to do common tasks with and without ChatGPT. I’m aware it’s not perfect but perfect was never necessary.


This was interesting to me but mostly because of a question I thought it was going to focus on, which is how should we interpret these tests when a human takes it?

I wasn't sure that the phenomena they discussed was as relevant to the question of whether AI is overhyped as they made it out to be, but I did think a lot of questions about the meaning of the performances were important.

What's interesting to me is you could flip this all on its head and, instead of asking "what can we infer about the machine processes these test scores are measuring?", we could ask "what does this imply about the human processes these test scores are measuring?"

A lot of these test are well-validated but overinterpreted I think, and leaned on too heavily to make inferences about people. If a machine can pass a test, for instance, what does it say about the test as used in people? Should we be putting as much weight on them as we do?

I'm not arguing these tests are useless or something, just that maybe we read into them too much to begin with.


AI is honestly wrong word to use. These are ML models and they are able to only do the task they have been specifically trained for (not saying the results aren't impressive!). There really isn't competition either as the only people who can train these giant models are those who have the cash.


> These are ML models and they are able to only do the task they have been specifically trained for

Yes, but the models we're talking about have been trained specifically on the task of "complete arbitrary textual input in a way that makes sense to humans", for arbitrary textual input, and then further tuned for "complete it as if you were a person having conversation with a human", again for arbitrary text input - and trained until they could do so convincingly.

(Or, you could say that with instruct fine-tuning, they were further trained to behave as if they were an AI chatbot - the kind of AI people know from sci-fi. Fake it 'till you make it, via backpropagation.)

In short, they've been trained on an open-ended, general task of communicating with humans using plain text. That's very different to typical ML models which are tasked to predict some very specific data in a specialized domain. It's like comparing a Python interpreter to Notepad - both are just regular software, but there's a meaningful difference in capabilities.

As for seeing glimpses of understanding in SOTA LLMs - this makes sense under the compression argument: understanding is lossy compression of observations, and this is what the training process is trying to force to happen, squeezing more and more knowledge into a fixed set of model weights.


Yes, this is why I think the LLM and image generation models are still impressive. Knowing they are ML models in the end and still produce a results that surprise us, makes you wonder what we are in the end. Could we essentially simulate something similar to us given enough inputs and parameters in the network, with enough memory, computing power and a training process that would aim to simulate a human with emotions. I would imagine the training process alone would need bunch of other models to teach the final model "concepts" and from there perhaps "reasoning".

Why I think AI is not the appropriate term is that if it were AI, the AI would have already figured everything out for us (or for itself). LLM can only chain text, it does not really understand the content of the text, and can't come up with new novel solutions (or if it accidentally does, it's due to hallucination), this can be easily confirmed by giving current LLMs some simple puzzles, math problems and so on.. Image models have similar issues.


>AI is honestly wrong word to use

https://en.wikipedia.org/wiki/AI_effect

Just because you don't like how poorly the term AI is defined, doesn't mean it is the wrong term.

AI can never be well defined because the word intelligence itself is not well defined.


As a developer when I work with ChatGPT I can see ChatGPT eventually taking over my JIRA stories. Then ChatGPT will take over management creating product roadmaps, prioritizing and assigning tasks to itself. All dictated by customer feedback. The clock is ticking. But reasoning like a human? No.


Counterpoint: Journalism is dead and has been replaced with algorithms that supply articles on a supply and demand basis.

"25% of the potential target audience dislikes AI and do not have their opinion positively represented in the media they consume. The potential is unsaturated. Maximum saturation estimated at 15 articles per week."

A bit more serious: AI hasn't even scratched the surface. Once we apply LLMs to speech synth and improve the visual generators by just a tiny bit, to fix faces, we can basically tell the AI to "create the best romantic comedy ever made".

"Oh, and repeat 1000 times, please".


Most of the hype comes from the AI grifters who need to find the next sucker to dump their VC shares onto to the next greater fool to purchase their ChatGPT-wrapper snake oil project to at an overvalued asking price.

The ones who have to dismantle the hype are the proper technologies such as Yann LeCun and Grady Booch who know exactly what they are talking about.


*technologists


  “People have been giving human intelligence tests—IQ tests and so on—to machines since the very beginning of AI,” says Melanie Mitchell, an artificial-intelligence researcher at the Santa Fe Institute in New Mexico. “The issue throughout has been what it means when you test a machine like this. It doesn’t mean the same thing that it means for a human.”
The last sentence above is an important point that most people don't consider.


It seems a bit like having a human face off in a race against a car and then concluding that cars have exceeded human physical dexterity.

It's not an apples/apples comparison. The nature of the capability profile of a human vs. any known machine is radically different. Machines are intentionally designed to have extreme peaks of performance in narrow areas. Present-generation AI might be wider in its capabilities than what we've previously built, but it's still rather narrow as you quickly discover if you start trying to use it on real tasks.


Only idiots are basing their excitement about what's possible on those test scores. They're just an attempt to measure one bot against another. There is a strong possibility that they are only measuring how well the bot takes the test, and nothing at all about what the tests themselves purport to measure. I mean, those tests are probably similar to stuff that's in the training data.


Yeah... there's a lot of idiots out there.


Any task that gets solved with AI retroactively becomes something that doesn't require reasoning.


I wouldn’t say that. Chess certainly requires reasoning even if that reasoning is minimax.

I suppose in the context of this article “AI” means statistical language models.


Why does chess require reasoning? Do all of the these [1] "reason"? ChatGPT-4 is supposedly rated worse than 500, in this list (1400 or so, although I think a recent update improved it a bit).

[1] https://ccrl.chessdom.com/ccrl/4040/


Within the domain of chess, searching the domain of possible future positions is synonymous with reasoning. It requires an explicit understanding of latent board representations and an explicit understanding of possible actions and the consequences of those actions.

Whether ChatGPT has any of those things is questionable. From what I’ve seen, it has an unreliable latent representation, an unreliable understanding of possible moves, and a positional understanding that’s little better than a coin flip.


> Within the domain of chess, searching the domain of possible future positions is synonymous with reasoning.

So, with this definition, these chess engines already exhibit (fairly substantial) reasoning? Or, are you saying it would be required, in the context of an LLM?


Didn't it perform well on both the SAT and LSAT though?


This was 2 months ago, irrelevant in AI time




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: