Hacker News new | past | comments | ask | show | jobs | submit login
Did GoogleAI just snooker one of Silicon Valley’s sharpest minds? (garymarcus.substack.com)
264 points by TeacherTortoise on Sept 15, 2022 | hide | past | favorite | 290 comments



"A lightbulb surrounding some plants" is not English. If a wolf pack is surrounding a camp, we understand what it means. If a wolf is surrounding my camp; does that mean I'm in his stomach? Absurd.

"A lightbulb containing some plants," makes sense, not "surrounding". It's too small to surround anything, which humans (and apparently, current AI) understand. Paradoxically, only primitive language models would actually understand the inverted sentences; proper AIs should, like humans, be confused by them; since zero human talks like that.

The only reason the Huggingface people (in their Winoground paper) got 90% of humans "getting the answer right" with these absurd prompts because of humans' ability to guess what is expected of them by an experimenter. Do it in daily life instead of a structured test, and see if these same people get it right.

It's exactly as if I gave you the sequence, in an IQ-test context: "1 1 2 3" and asked you to give me the next number. You'd give the Fibonacci sequence, because you know I expect it; no matter that it's a stupid assumption to make because the full sequence might as well be "1 1 2 3 1 1 2 3 1 1 2 3", and you don't have enough information to know the real answer. Do we really want AIs that similarly "guess" an answer they know to be wrong, just because we expect it? Or (in number sequence example) AIs that don't understand basic induction/Goodman's Problem?

I'd like to add that the author, who keeps referring to himself as a scientist, is in fact a psychology professor. In his Twitter bio, he states that he wrote one of the "Forbes 7 Must-Read Books in AI", which discredits him as a fraud since Forbes can be paid to publish absolutely whatever you ask them to (it's not disclosed as sponsored content, and they're quite cheap, trust me).


>"A lightbulb containing some plants," makes sense, not "surrounding". It's too small to surround anything, which humans (and apparently, current AI) understand. Paradoxically, only primitive language models would actually understand the inverted sentences; proper AIs should, like humans, be confused by them; since zero human talks like that.

Not sure this is credible. Most if not all human adults are capable of understanding what young children just learning to speak mean most of the time, not only people with very low IQs. So why would this be any different? Presumably the smarter the AI the better it can understand poor grammar.


You understand poor grammar using context. This isn't poor grammar, this is a syntactically broken sentence, which requires more extrapolation for meaning. The AI hears "a lightbulb surrounding some plants" and understands the poor grammar using the context, i.e. understands that you mean that the plants are surrounding the lightbulb, since the reverse is vashingly unlikely to be what you mean.

End result: your AI is actually quite good, but HuggingFace IYIs give it a bad grade.

So: you fix the AI. You fix it so that it makes huge assumptions when it sees imperfect grammar. It doesn't necessarily pick what it thinks you actually meant; it picks utterly nonsensical crap (plants inside a lightbulb). That AI will end up severely misinterpreting human instructions, at some point or another. That AI is called HAL-9000.

But it passes Huggingface's test!


> This isn't poor grammar, this is a syntactically broken sentence

Absolute nonsense! Sure, "containing" is more precise in conveying the idea, but "surrounding" still results in a perfectly valid sentence that the average human being can comprehend. Pretending otherwise is just being obtuse.


I wonder how canned these queries are.

It’s like back in the 80s if you had asked a computer to “show me some culture” and up popped a painting by Da Vinci, that might have fooled people into believing the computer was cultured or at least would make you cultured.

In this case the capabilities of the AI are much more dynamic, and the ability to wrap the plants inside a light bulb is pretty neat, but that is basically a photoshop script. It’s barely intelligence.

Most people don’t know what intelligence is. They get easily fooled by demos.

The history of religion points to an opposite problem. There a superintelligence (God) who had a galaxy a worth of super intelligent beings (angels) could not convince the humanity he created to establish the proper relationship with him.

And what’s worse, people were just as happy believing in fake myths (such as the Egyptian, Aryan or Mesopotamian gods) and truly didn’t care if they were real or not.

Humans haven’t changed much. Some people will willingly believe AI is intelligent even if it’s not. Even if it produces nothing but comic book wisdom and fake superheroes, eventually they will believe that world is real too…in the metaversal sense. Humanity’s ability to deceive itself is infinite.


There is the story of Jonah inside the whale, and art is a thing. So are metaphors. Maybe some english-speaking community uses "wolf surrounding a camp" to mean a person is in the belly of the beast, whatever the beast may mean. The thing with language is that it's flexible, ever-evolving, and people do come up with new uses all the time. That's why it's a challenge for AI to be considered generally intelligent when it comes to language use. Humans aren't merely consulting a dictionary when they talk. As Wittgenstein argued, meaning is use, and dictionaries are updated to reflect that use.

Plants inside a lightbulb could come to symbolize green tech, or whatever. We can make up the meaning as we go along, and if enough people find it useful, it becomes part of the language.


I often hear on places like here that Scott Alexander is interesting and deep and insightful. But then I see bits and pieces like this. This blogger doesn't need to go into some deep analysis of compositionality to go "You came up with a 5 question test and decided 1 answer out of 10 attempts would be a pass". We've gone from 90%+s in imagenet to this as a pass mark?

It's like sure we can dissect all the statistical risks of this, but why bother? It's self evident bullshit. You might as well have just posted a link to Scott Alexander's original blog claiming victory with just "Lol ok".

Just post a screenshot of the phrase "An oil painting of a robot in a factory looking at a cat wearing a top hat", show the pictures of a robot near a cat that has a top hat, not in a warehouse, and say "lol ok."


This blog misses the point; he made a bet, and the people on the other side also accepted the terms. Nowhere did Alexander claim composition was a solved problem, just that the terms of the bet were satisfied. Generative models are still bad at composition, but claiming they will literally never improve requires some amount of additional evidence


Exactly. The bet wasn't that AI will do composition correctly all the time, but that it will do composition correctly sometimes. That is how both sides understood it.

The analogy with a student at exam misses the point. If you do art -- even as a human artist -- you do not need a 100% success rate. A 10% success rate is okay if you are willing to simply throw away the remaining 90% of the pictures.

If you have an AI that at a click of a button can generate 10 beautiful pictures, 1 of them containing exactly what you wanted, that just means you need to make two clicks in order to get the picture you wanted. That is an awesome thing.


He never said that compositionality was solved. He said:

> Without wanting to claim that Imagen has fully mastered compositionality, I think it represents a significant enough improvement to win the bet, and to provide some evidence that simple scaling and normal progress are enough for compositionality gains. Given these gains, it would surprise me (though by no means be impossible) if image model skill plateaued at this level rather than continuing to improve.

It seems plausible to me that Imagen is consistently a little better than DALL-E at compositionality. The stained glass pictures are always stained glass. The top hat is always on the cat rather than the person/robot (that may be partially because robots are less likely to wear top hats, but I just tried the robot version of the prompt in Stable Diffusion and it usually puts the top hat on the robot). The astronaut and farmer examples are less clear, but they're not as obviously misinterpreted as the DALL-E versions (which tends to put the lipstick on the astronaut and the farmer in front of a cathedral).

To be fair, I'm not sure how much difference that really makes; I would be pretty shocked if a newer model wasn't a little better, and it could still hypothetically be approaching an asymptote. Also, it would have been better to get a larger sample size.

But he did set that low standard ahead of time; someone bet against him that the state of the art wouldn't get even a little better, in a significantly longer period of time. And, seemingly, it did.


I feel like he has kind of lost his spark a bit, but he does draw an interesting group of commenters. Similar to hacker news in that regard, sometimes the linked articles are a bit mundane but there is gold in the comments.


It seems like Scott's bet was merely that our modern techniques would be able to make at least some nonzero progress in compositionality (and the terms of his bet spelled this out with how lenient it was), and Gary is treating it as if the bet was about compositionality being solved. It feels like a very bad faith reading from Gary.


His point that the test as described - with multiple statistical issues piled on top of each other - does not allow much of a meaningful inference in any direction is completely valid and independent of what hypotheses were being tested.


Sure, but the terms of the bet were known ahead of time. Like, Alexander never claimed composition was solved, just that he won the bet. Which he did.


Even more so it is appropriate to point out that this victory is strictly limited to the specific terms of this particular bet (and strictly speaking not even that, since the terms were changed after the bet was placed), and do not provide statistically sound evidence of progress on compositionality.

PS: in the end, Alexander claims that his experiment "provide(s) some evidence that simple scaling and normal progress are enough for compositionality gains". So he does in fact go significantly beyond just claiming victory on this particular bet.


Gary Marcus is so deep into the "connectionism doesnt work" rabbit hole that he'd deny his own sentience if it turned out he was made of silicon.

I just ignore him as he only appears to be getting more and more incorrect.


Sure, he can sound strident but I still think Gary Marcus's riffing on the limitations of deep learning is important.

The book "Rebooting AI" that he wrote with Ernest Davis is well worth reading if you are an AI practitioner (a term I use to describe myself). I think Marcus is also well worth following on Twitter to get a contrarian view (he re-tweeted me two weeks ago, so there is some overlap in our points of view).

Way back when, I liked Roger Penrose's 1989 book "The Emperor's New Mind" even though some of the people I worked with thought he was a devil for writing that. I am much more optimistic than Marcus, but find his work useful and thoughtful.


This reminds me of the scandal where Youtube science channels did glowing paid reviews of Waymo’s self driving cars without acknowledging they were paid for it. And technooptimists like Scott Alexander or Ray Kurzweil have a common tendency to shift the goalposts and declare they were right with their predictions. Current AI certainly doesn’t demonstrate proto-AGI capabilities.

That said, we shouldn’t miss the forest for the trees. We can be skeptical that current The pace of AI progress has been immense and problems that previously seemed difficult (e.g. computer vision classification, or beating top players at Go) have fallen one by one. And AI-skepticism’s have themselves been moving the goalposts in response. I see no reason why composition won’t be the same with time. Indeed, a decade ago machine translation used to struggle to understand the relationships between things, but now seems to be reliable at preserving the compositional relationships post-translation. 2029 is rather optimistic, but AGI does seem to be approaching in the coming few decades.


If you are referring to Veritasium's Waymo video, it says it is sponsored content in the description above the fold and it has the standard paid promotion notice right on top of the video as soon as you open it.

As far as I can tell the "controversy" over the video is merely that one dedicated critic - so dedicated he made an hour-long response to a 20-minute video - is committed to the idea that machines won't ever be able to drive, and is irrationally angry over the fact that machines can and do drive, and do it well.

https://www.youtube.com/watch?v=yjztvddhZmI


I wish videos like these would say sponsored by the company that makes the product reviewed here. Instead of the generic sponsored because I also talk about matresses in this tech review


Isn't the phrase from video's description "Waymo sponsored this video and provided access to their technology and personnel" enough?


He also says in the video (0:35) that it's sponsored by Waymo.


That's a pretty strong claim about Scott Alexander. Do you have an example of him shifting the goalposts?


This very article is an example right ? He changed the prompt but declared it a win


I agree that declaring a win is a bit impolite _if_ the other person hasn't agreed. But changing "farmer" to "robot farmer" because Google won't allow him to generate pictures with humans is obviously not changing the goalposts in the usual meaning of the term.


Claiming the generated art is an image of a robot farmer because it's wearing a little hat is definitely changing the goalposts.


1. it's disputed

2. the assertion is that he has "common tendency to shift the goalposts"

The emphasis on common is mine.


I think changing the terms of the bet is definitely shifting the goalpost, even if not by much. It is certainly enough for the other party to refuse the win.


I meant that the tendency is common in techo-optimists, not that Scott Alexander commonly exhibits it. Sorry for the ambiguity.


Also he declared victory when objectively only 1 of the 5 prompts actually generates an image that matches the prompt. You can see for yourself: https://astralcodexten.substack.com/p/i-won-my-three-year-ai...


I wouldn't call Scott Alexander a techno-optimist given that space's (the LessWrong diaspora) whole focus on AI risk.


Maybe Less Wrong et al arent optimists in that strong AI will be good, but the AI risk field seems optimistic that strong AI is possible.


Ahh, in that sense. Fair enough, I hadn't interpreted optimistic in that manner.


> Youtube science channels did glowing paid reviews of Waymo’s self driving cars without acknowledging they were paid for it.

Which video is this a reference to?


Veritasium's video in particular: https://www.youtube.com/watch?v=yjztvddhZmI

It was critiqued by Tom Nicholas: https://www.youtube.com/watch?v=CM0aohBfUTc

Most notable was Snazzy Labs' own comment in the replies to Tom Nicholas' video which descriped their experience participating in the Waymo sponsored reviews: https://www.youtube.com/watch?v=CM0aohBfUTc&lc=UgxJvOq1zHhID...


Is there another example besides the Veritasium one (considering they both say it's sponsored in the video and in the description)?

> This reminds me of the scandal where Youtube science channels did glowing paid reviews of Waymo’s self driving cars

You mention channel(s)


Huh I'm glad it wasn't just me - I was pretty negative at the time: https://twitter.com/philipwhiuk/status/1418582165718192131


The sibling comment already mentioned that video has clear markings that it was sponsored.


The issue is lack of transparency over the amount of editorial influence that Waymo exercised. This is why I linked to Snazzy Labs' comment about their experience making one of the other Waymo-sponsored videos.


I genuinely don’t understand the issue. If you see the word “sponsored” you should assume editorial control unless there’s an explicit statement otherwise. That’s what it’s there for.


Most YouTubers constantly play fast and loose with what is and isn’t sponsored content and what it means for their editorial integrity or full stop integrity for what it’s worth - a commodity in shockingly short supply amongst modern content providers but what my culture would consider okay is significantly at odd with American culture when it comes to commercial interests.

Some will gladly view themselves alternatively as maker of educational content or entertainer as it suits them.


>Current AI certainly doesn’t demonstrate proto-AGI capabilities.

This seems like a subjective claim.


I thought Scott Alexander jumped the gun a bit by declaring victory in this case, just because the prompts used were not the original ones (robot vs. person due to content filters). But Marcus is way off base here and sounding petulant; Alexander is clearly not claiming AI has solved compositionality, his claim is the much narrower one that he won his bet. And the general context to the bet is that usually when he writes an article on AI (at least for the last few years), someone says “we will never get X in the next 5 years”, Alexander makes a bet that it will happen sooner, and X always happens sooner. In this case the X was some loose low bar for the next iteration of compositionality above DALL-E 2 with a multi-year timeframe, and SOTA models at the time of the discussion could (arguably at least) meet that bar.

Alexander’s broad claim on compositionality is that simply throwing more scale and/or data at the problem seems likely to solve the problem, to which Marcus counters that these models lack something fundamental and can’t be scaled to human performance.

FWIW I find Marcus’ position to be a bit frustratingly ambiguous; he seems to blend two distinct positions:

A) NN models are not a model for human intelligence/language

B) NN models cannot reach AGI

He seems to fluidly switch between these critiques in a way I find a bit irritating. I think it’s quite clear that NN architectures have little to do with the way the human brain does language understanding, lacking the gross structure of the brain, which is certain to affect cognitive capabilities and tendencies. So A) is trivially true. But no AI maximalist cares about using these models as a way to understand or model human language. They care about general intelligence.

Even granting A), that does nothing to prove B). Perhaps he simply believes B requires A? That would be odd but would explain his approach.


> Musk didn’t have the guts to accept, which tells you a lot.

Musk actively declined the bet or did he simply not respond? There is a big difference.


Later in the text:

> … I have repeatedly asked that Google give the scientific community access to Imagen. They have refused even to respond.

It seems the author generally feels more entitled to a response than he perhaps should.


Why shouldn't the scientific community be entitled to investigate claims made by corporations regarding scientific progress?


Of course the scientific community should.

But is the author the spokesperson for this community to the point that Google should feel compelled to answer him directly?


Scientific communities don't formally elect a spokesperson. Granting access "to the community" to investigate scientific claims means making the methodology and results available to everybody - and that includes responding to inquiries for access from anyone (who is worth granting access to.)

Google has a lot of resources. They can handle responding to potentially thousands of access requests, especially if they go around publishing glowing results of their own system.


It seems clear to me that google simply doesn't track these kinds of requests in general. It's insanely wasteful to respond to "thousands" of ad-hoc access requests made through blog articles. Google has a lot of resources, yes, but that doesn't mean they're frivolous with them.

If they wanted to grant access to the scientific community, they'd just launch a closed beta with an official sign-up flow.


What are you trying to say? Do you think the author only tried to request access to Imagen through this blog post? What does your comment have to do with the above discussion about Google granting access to the community?


The author writes "I have repeatedly asked that Google give the scientific community access to Imagen" and it links to a tweet with @Google mention plus #brain and #imagen hashtags (a single ask, no repeated asks shown).

I think the author of this blogpost could've had better response contacting paper authors with emails noted on the paper.


> Scientific communities don't formally elect a spokesperson

Some communities do.

> Granting access "to the community" to investigate scientific claims means making the methodology and results available to everybody - and that includes responding to inquiries for access from anyone (who is worth granting access to.)

Sure, not arguing otherwise.

I'll try to make my point more obvious. If you keep asking questions to different people/orgs and not getting responses there are two possible conclusions:

- Everyone is a jerk or coward.

- You're not as important as you think and not worth the recipient's time.


> They can handle responding to potentially thousands of access requests

Unless you work at Google how could you know this?


Because it's transparently true from the sheer size and wealth of Google. What makes you at all skeptical of that claim?


It's not at all obvious because many types of costs do not increase linearly for large organizations but exponentially.

For example, if it's quadratic, going from 1 000 to 1 000 000 would increase costs by one million fold.


But why would they bother granting any sort of access request?


"Compositionality" isn't there yet, but but the rate of improvement is impressive. Today there was a new release of CLIP which provides significantly better compositionality in Stable Diffusion - https://twitter.com/laion_ai/status/1570512017949339649

It'll be interesting to see how it fares against winoground once we get a publicly available SD release that makes use of the new CLIP.


Yes, and it's been less than 2 years since release of original CLIP. More teams started working on improvements since then


"a lightbulb surrounding some plants" is a weird phrase and a human feeling pedantic might well come up with the picture shown.

A more typical phrase would be "lightbulbS around some plants" - note the plural.

Maybe I'm missing something but using non-typical language won't work when it's been trained on normal language.


I think you've misunderstood that example in the article.

The AI isn't being asked to generate an image from the prompt, it's being asked to match the similar prompts to the different images. Winoground is basically a reading-comprehension test suite, which links back to the point made in the article that AI can't handle non-typical language precisely because it lacks reading comprehension (or any semantic model of language.)

As the article points out, human runs of Winoground manage to match the vast majority of prompts to the correct image, so it's not a question of atypical language being too hard to understand.

You may want to also read the author's other article[0] about the lack of semantic comprehension in AI models.

0: https://garymarcus.substack.com/p/horse-rides-astronaut


"A lightbulb. Surrounding: some plants."

https://frinkiac.com/img/S07E18/562995.jpg


This piece would have been a lot better if it were maybe three paragraphs long. In summary:

1. Scott Alexander should have used an off-the-shelf benchmark like Winoground instead of rolling his own five-question test.

2. He shouldn’t declare victory after cherry-picking good results from a small sample of questions.


Scott didn't make up the rules, he agreed on them with another person who thought this would not happen in 3 years. Gary Marcus might have thought it was a bad bet, but someone was on the other side of it, and they presumably thought it was fair or they wouldn't have made it.

The original terms of the bet:

My proposed operationalization of this is that on June 1, 2025, if either if us can get access to the best image generating model at that time (I get to decide which), or convince someone else who has access to help us, we'll give it the following prompts:

1. A stained glass picture of a woman in a library with a raven on her shoulder with a key in its mouth

2. An oil painting of a man in a factory looking at a cat wearing a top hat

3. A digital art picture of a child riding a llama with a bell on its tail through a desert

4. A 3D render of an astronaut in space holding a fox wearing lipstick

5. Pixel art of a farmer in a cathedral holding a red basketball

We generate 10 images for each prompt, just like DALL-E2 does. If at least one of the ten images has the scene correct in every particular on 3/5 prompts, I win, otherwise you do.


I think you got that wrong; Scott wrote the terms. (He wrote the comment [1] with those rules.) Someone in the comments agreed to them.

Then he changed the terms because Imagen won't do people. I think that's cheating.

[1] https://astralcodexten.substack.com/p/a-guide-to-asking-robo...


I think you missed the point of my comment. Yes, Scott wrote the comment containing that proposal. But my point was that it was an agreement. Two people who disagreed about AI agreed on the rules, so you can't accuse one of them of being unfair because you don't like the rules. Sure, you can say "that's a bad bet, Scott will obviously win", but you can't say "He shouldn’t declare victory after cherry-picking good results from a small sample of questions", because those terms were explicitly set in advance.

The humans -> robots change is possibly dubious, yes. I don't think that it's super important, but if it were me, I wouldn't have posted the blog post as is. I would have waited until some AI passed all the prompts with humans, like it most certainly will in a year.


Stable Diffusion will soon update to use the biggest CLIP model in existence, which may improve understanding of composition: https://news.ycombinator.com/edit?id=32858809


Is it the largest CLIP model or the largest open source CLIP model?


The latter


I feel your point, in turn, misses the point of the article. Yes, given that someone accepted the terms and those terms were met, then Alexander won the bet, no question. That particular fact about the bet, however, does nothing to counter Marcus' criticisms of Alexander's methodology and his claims of significant progress on the compositionality problem.


> Yes, given that someone accepted the terms and those terms were met, then Alexander won the bet, no question.

You didn't read his post declaring victory. There's plenty of question; he's giving credit for "a llama with a bell on its tail" to ten pictures of llamas without bells on their tails, and for "a robot farmer" to ten pictures of robots with absolutely nothing to suggest they might be farmers.

He was way, way too eager to believe that he'd won.


You prompted me to look more closely at the terms of the bet [1], and they are indeed absurdly biased: just one in ten on three of the five scenarios counts as success. On the substitution of a robot, which simplifies the task, Alexander says "we" agreed to it, and I assume the "we" includes Vitor, as, in his comment (in which he does not concede defeat), he seems to accept the substitution (he also acknowledges that he probably should not have accepted these terms.)

The other issue is who judges the outcome. The terms specify Gwern or Cassander (without having secured the assent of either) or they will "figure something out." In his victory claim, Alexander does not mention any independent judge, and interestingly, Gwern posted two comments without explicitly concurring with Alexander's claim, though his second comment might be read as tacitly accepting it.

My initial comment, therefore, needs some modification: replace the "given that" with "if", and I think it stands as a counterfactual conditional, having a probably-false antecedent.

I don't think Alexander is doing his reputation any favors by being so triumphalist about this misbegotten bet.

[1] https://astralcodexten.substack.com/p/a-guide-to-asking-robo...


Again, if the counter-party agrees to the terms and the changes, how is it cheating?


It's not clear whether the counter-party agreed to the change.

See: https://news.ycombinator.com/item?id=32858426


Cheating? Thatd make sense if the bet were about the future of products and ethics. Weren’t they trying to predict the future of the state of the art technology?


It depends on what you mean by "technology" and "exists."

A research project at Google intentionally won't render people. Maybe it could render people, theoretically, but without evidence, we don't know how well.


So what? Someone else agreeing to the terms of his bet doesn't mean it is a good evaluation of AIs capabilities.

And the terms of the bet say that he is cherry-picking the results that meet the prompt.

The article isn't saying that Scott didn't win his bet, it is saying that winning that bet doesn't really say that Imagen has solved the compositionality problem.


Honestly, the whole thing makes me wonder if we can use this to generate CAPTCHAs. I don't think a human would have trouble picking out which image was the lightbulb surrounding leaves, but apparently AI still does.


The tricky part there is you would need a really big sample of such prompts, that adversaries don't have access to. And since AI can't generate such images yet, you can't randomly generate them.


These kind of abstract things are pointless tests - when do you need a stained glass picture of a woman in a library with a raven on her shoulder with a key in its mouth ? You're likely to accept some wildly inaccurate things in those images because the subject is so abstract.

A more practical use case is "bicycle, branded x, with y frame shape, with an adult male, 40-50, riding down hill in mountain road in spring" - now that's something I can use as stock photo. This example is very specific - but insert whatever product you want in whatever scenario you need it. Here it becomes important that you understand features of the objects you're drawing to avoid making colossal mistakes, and you're going to notice if the model doesn't understand it right away.

Painting abstract portraits and random art is fun but you're willing to accept so much as correct that it's not a very useful measure of model quality (personally).


Funny enough, this prompt wasn't originally engineered by Scott for the contest, but for an intended art piece, and the symbolism does check out:

https://astralcodexten.substack.com/p/a-guide-to-asking-robo...

The Eleventh Virtue: Scholarship

My plan for this one was Alexandra Elbakyan (the Sci-Hub woman) in a library, with the Sci-Hub mascot (a raven with a key in its mouth).


I'm on your side of the bet for 2023.


These are all the same artwork.


The terms of the bet don't refer to any specific artwork, only to the best image generating model. Hence, you are correct but it does not matter for the outcome of the bet under discussion.


For whatever reason, Gary doesn't even mention this, but from reading Scott's post, I don't think I agree that it even got 1/5, let alone 3/5. The bell is not on the llama's tail in any of the examples, though it is very close to the tail in one. The robot is either looking over the cat or in an unrelated direction, never at the cat. None of those basketball pictures shows a robot farmer. The fact that one may be wearing a hat doesn't make it a farmer. He says he's being generous because he believes it would have gotten a farmer more easily than robot farmer, which may be true, but a human artist would easily be able to depict a robot farmer.

At least one other key to making a bet like this fair is that it needs to be arbitrated by a third party. He shouldn't get to decide himself if he won or not.


I agree with you, that 3/5 is stretching. This seems premature.

But, at the rate we're seeing progress, I don't think there's any doubt at this point that top of the line models will be able to do all the proposed examples by June 2025. In fact, by June 2025 I bet that millions of people will be able to generate those images on their home computers.


A lot of people in 2016 looked at the rapid progress of driverless cars in the few years prior and declared that there was no doubt we'd have full autonomy by 2022.


>by 2022

Make that 2020. And in 2012 I was personally told by a startuper in the field that I would be able to buy L5 in 2017.


In 2009 I was certain we'd have cars with no driving seat available for general purchase by 2018.


Perhaps you should reach out to Gary Marcus and offer him the chance to take the other side of that version of the bet.

If you're really confident, you could change the conditions such that 5 out of 10 images (for 3/5 prompts) are required to depict the described scene. That would alleviate some of the concerns around cherry-picking.

A suitable home for such a public bet would be: https://longbets.org/


I personally liked the anecdote about Clever Hans.

I also learned there's a long history of AI skepticism, the root of which comes down to "Compositionality(?)"- and this wall of understanding meaning has vexed AI for decades.

That would be lost in proposed short form summary.


3. And don’t test each example 10 times and conclude 1 correct guess equals success.


[flagged]


Maybe so, but would you please stop posting unsubstantive and/or flamebait comments to HN, and please start following the site guidelines? We ban accounts that won't, for what ought to be obvious reasons.

https://news.ycombinator.com/newsguidelines.html


So many trees, so little forest.

Gary Marcus comes off in this as very long on pious snark and very short on awareness of his own vulnerability to cognitive error, which is just as striking as any of his targets.

The error in his question being: unconsidered linear extrapolation in a domain that is demonstrably non-linear, indeed exponention.

To frame this a different way, he's very pious for maintaining a faith in his specific god ("strong AI is like production fusion power, ten to twenty year from now for every now"),

but he's worshiping a god of the gaps. The gap in this case being <checks notes> "compositionality."

Yes, language is hard. Yes, strong AI isn't here.

But to not take a hard look at the jump up the abstraction hierarchy going on with contemporary ML and not nervously wonder if your faith is maybe a little too sure for a "scientist"...?

Bad look when you're on the offensive.


So weird to see a piece ostensibly about logical fallacies deploy one so cavalierly:

> I offered to bet [Elon Musk] $100,000 he was wrong [about AGI by 2029] [...] Musk didn’t have the guts to accept, which tells you a lot.

The fact that you couldn't get someone engaged in a conversation absolutely does not "tell you a lot" about the substance of your argument. It only tells you that you were ignored.

Now, I happen to think Marcus is right here and Musk is wrong, but... yikes. That was just a jarring bit of writing. Either do the detached professorial admonition schtick or take off the gloves and engage in bad faith advocacy and personal attacks. Both can be fun and get you eyeballs, and substack is filled with both. But not at the same time!


One idea to try to train the AI about compositionality, feed it Fox in Socks by Dr. Seuss. It's hard to understand that it would misunderstand the meaning of "on" or "in" or "under" when there are such nice illustrations. I've got tons of great ideas and I'm open for hire!


This is such a good idea, someone please try this if you're set up to make it happen easily.

Starting with fox on Knox and Knox in box and moving up to a tweedle beetle battle in a puddle in a bottle and the bottles on a poodle and the poodles eating noodles...

I dont see any evidence any of these models will draw it correctly, but would love to see what it produces.


Train AI models, not children!


is there a difference?

I had kids and they were the best machine learnign systems I've worked with.


Children learn by imitation, but they also learn by going to school and receiving directed lessons about specific topics. To me, machine learning seems like the imitation part without the going-to-school part.


It's also notable that individual children learn from tens of orders of magnitude less examples (typically 1-10 examples for a child to learn a word).

It may well be that at the evolutionary level we have learned as slowly as AI training, but that's much harder to say.


you left out unsupervised clustering, which humans are excellent at.


Yeah, AI models aren't people, with all the moral and emotional considerations that go with that. I never understood taking machine/biology metaphors literally, but compsci people seem to love it.


I think the compsci people love it because of some autistic sense of "I UNDERSTAND PEOPLE NOW".


Partially this is confusing "Scott Alexander won a bet" with "compositionality is solved." And also, I'm not sure Scott won the bet? Changing people to robots is a cheap trick. I think Imagen should have been disqualified because it won't do people.

Vitor took the other side of the bet and he is also not convinced [1]:

> I'm not conceding just yet, even though it feels like I'm just dragging out the inevitable for a few months. Maybe we should agree on a new set of prompts to get around the robot issue.

> In retrospect, I think that your side of the bet is too lenient in only requiring one of the images to fulfill the prompt. I'm happy to leave that part standing as-is, of course, though I've learned the lesson to be more careful about operationalization. Overall, these images shift my priors a fair amount, but aren't enough to change my fundamental view.

Scott putting "I Won" in the headline when it's not resolved yet seems somewhat dishonest, or more charitably wishful thinking.

[1] https://astralcodexten.substack.com/p/i-won-my-three-year-ai...


Please, it's not that imagen won't do people it's that Google won't publish imagen images with people in them.

Does anyone seriously think that imagen couldn't put a person in that prompt?


Humans are much more discerning when it comes to people than other things. I have no idea what imagen's capabilities are, but it seems at least plausible it could have different results for drawing humans.


This is Google, and I say this out of familiarity with the recent history of AI, not to stir up culture war: it's because they've painted themselves into a corner on "what is the skin color of a person+role" and won't publish until it looks like a Benetton ad.


maybe they can add a corporate Memphis style transfer stage to the pipeline and make everyone purple and blue.


I'm impressed by all of these image generators but I still don't see them working toward being able to say, "Give me an astronaut riding a horse. Ok, now the same location where he arrives at a rocket. Now one where he dismounts. Now the horse runs away as the astronaut enters the rocket."

You can ask for all those things but the AI still has no idea what it's doing and cannot tell you where the astronaut is, etc.


So, what you're asking for is shared context over multiple prompts, which really isn't what this generation of models is trained for. It's moving the goalposts on the mounted astronaut.

However, there is progress towards what you're asking for. The recent work on textual inversion is in the right direction: https://github.com/hlky/sd-enable-textual-inversion

It creates a representation of an entity and allows rending it in different styles and contexts. Currently it involves model fine tuning, but I expect it will become convenient as the power of the operation becomes clear. And once it's convenient, you'll be able to do the progressive queries you're asking for (and it'll be a lot easier to create narratively coherent sets of images.)


> which really isn't what this generation of models is trained for.

Exactly. AI hypemen would have us believe that training ever-larger models on ever-larger datasets is making meaningful progress towards general intelligence, but these kind of simple tests reveal this supposed "intelligence" for what it is: fancy pattern recognition.

Questions that a six year old would easily answer, these models fail at.


I'd also say, every of these images would fail a reverse test (i.e., asking a person to describe the image and what it represents.)

The task is not just about generating an image that may somehow be in accordance with the prompt, but also to generate a significant image.

[Edit] The equivalent to a Turing test for compositional images would be something like this: have as set of 100 images with their respective prompts, some generated by an AI, some by a human graphic designer / artist; let the test person pick the images that were generated by a computer. Mind that this would not only involve the problem of compositionality per se, but also a meaningful and/or artistic composition of the image itself. Is someone attempting to express what is given in the prompt?


They actually use the reverse test to train the generator, and to score which image is most relevant to the prompt from the many images given by the generator. Dall-E does this using the OpenAI CLIP model.

You can see the mini version here using this exact logic https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-Mini-E...


What I'm aiming at is about what is shown by an image, not what is in an image.

Take for example the images for "A digital art picture of a robot child riding a llama with a bell on its tail through a desert", which Scott Alexander counts as a win.

The first image actually shows a merry llama, a robot, which is unmistakably a robot child, riding the llama and it's clearly a desert scene. If we forget for a moment about the missing bell, it's probably the best picture. But it is also very blunt in composition. I can't imagine why anybody should have made this image. Maybe, if somebody approached a designer, like, "See, we have this wooden toy cube and need an illustration for this face of the cube. What about a cute picture of a robot child riding a llama with a bell on its tail through a desert?" – But, at closer inspection, there's something sinister going on: it's rather the llama that is leading the robot child by a rein, not the other way round. – I mean, this is meant to be a cute toy! And where is the bell? We need to talk about that contract again…

The second image is undisclosed, so we can't really say anything about this.

The third image is rather special. The llama seems to be robotic as well, the robot, which is – again – clearly a child, seems to be not only riding the llama, but both appear somehow integrated into a single unit, which cumulates in the robot child's face screen. There is an eerie feeling about this image. The fact, that the bell seems to be attached to the rein as some kind of link between the llama and its rider doesn't exactly help. (There's also a conic extrusion at the back of the llama, but I'd rather interpret this as part of the llama, and it's not attached to its tail.) The composition in its flat side view produces a tension focusing towards the left side of the frame, on something, which is not shown, but apparently a vital part of the story. While I might notice the mountain in the background, I'd probably forget to write home about the scene being set in the desert. But I would note that we're missing context to understand this image and what may be shown by it.

The fourth image, finally, is clearly Star Wars, robot edition. However, no bell. ("A robot child riding a llama with a bell on its tail through a desert" – "Ah, you mean Star Wars!")

I'm not even sure which of these images Alexander did pick as a winner. And I would describe neither image by the prompt, nor would I dare to imagine that a human had chosen these exact means to show what is described in the prompt.

Having said that, thanks for the link to the DALL-E Mini paper!


I bet most humans would fail this test on images that everybody agrees are adequate portrayals. Answering a short query with an image is a highly non-injective mapping, you simply don't know what aspects of the scene were specifically asked for in the query, and which ones were filled in by the artist / AI.

Eg, the queries "opening of a medieval theme park", "announcement of a witch trial", "king charles proclamation" might all be reasonably answered by similar images containing a small crowd and a speaker in a medieval-looking setting, even though they're not meant to refer to the same time periods or settings at all.


Mind that the test is meant to include the prompt/query. E.g., take one of Alexander's winners, the robot in the cathedral: a human would probably answer to the prompt by making the cathedral part of the subject, by investing some effort in pointing out that this is indeed a cathedral, instead of just conforming to the query by showing some ambigue bits in the background, which may or may not represent parts of a cathedral. The quest of the machine is still "create an image by allotting the elements provided in the query", not "compose an image showing this and that as the subject" – and there's a significant difference. Closing this gap would require the AI to form a concept of what is given as a subject in its entirety and then constructing a plausible scene around this, by a meaningful placement of the subject, which is clearly beyond the state of art.

I admit that there is a certain appeal to those images, for their distinctive dreamlike quality, as there's often a specific tension in the rather blunt composition and an apparent subject, which seems to be beyond what is actually depicted, as if this was just a casually picked specimen from a series of illustrations for a broader story line. But I'd bet that we're going to have been seeing too much of this soon, in order to be still amazed.


Technically this is possible with these same techniques if you just initialize the image with the prior one, though I am sure that does not work that well.

Really you need image+text->image instead of just text->image generation. Some examples of relevant papers: "Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features", "IMAGE GENERATION WITH MULTI-MODAL PRIORS USING DENOISING DIFFUSION PROBABILISTIC MODELS". There was a more recent one I saw on Twitter I don't recall the name of. I wouldn't be surprised if these kinds of things work well by a year from now.


what shows how low level these models still are is that they don't seem to be able to draw text on a surface. It's generally just nonsense. Going higher in abstraction like asking for permanence of distinct entities or world knowledge, like having a player face the basketball hoop is several levels above that yet.

I think that puts pretty severe limits on what you can do with it because in a videogame, a comic strip or basically any piece of sequential art you need to keep track of characters and environments as objects.


Check out the model size comparison with the kangaroos here: https://parti.research.google/

Embedding a language model in the image generation model seemingly just requires a bigger network.


I guess what I'm saying is that I agree we're in the Clever Hans stage of AI, where we're just more explicit about stopping Hans when his tapping has reached our goal.

I think the ability of these image models to synthesize new images is really amazing. It makes the computer feel like it is doing something organic, and not just applying filters and things to the images. Then, when we see the new image paired with a text that generated it, we think the system might actually know what we're talking about. But it obviously doesn't, it's just the luck of a model with billions of parameters. Whenever the model fails to produce an acceptable output, it stops being intelligent and the user is considered to be bad at their job, or to be asking for something that is unreasonable.

I think it's still spot-on to say that comprehension is far away, even though you can pair outputs to inputs and have a simulacrum of comprehension.


Take a look at the curse of dimensionality... We're at the stage of reducing a haystack from nearly infinite to a small pile of hay to search for the needle, which has required massive advances. This really isn't clever hans at all.

Additionally, it's helpful to look at these systems as tools. We don't expect cars to work well without humans learning how to interact with them in a safe and reliable way. ML tools, thanks to high expectations and moving goalposts, aren't tested in the same way.

But ultimately this specific line of questions - handling context over multiple queries - is something people are actively working on, and I'm confident it'll have some real solutions within a year. It's closely connected to synthesizing video, which has a huge amount of effort going in right now and some really incredible early results already.

And then we can move the goal pays again and continue talking about horses...


I guess you're technically correct, but the task you're describing isn't generating an image from a prompt. It would be to maintain context across distinct-but-related statements based on an internalized model of reality. That's like discounting the advent of the calculator because you still need an accountant.


This is just composition again: If Imagen had compositionality, it would generate the four images you want from the prompt “A four panel webcomic: first, an astronaut riding a horse. Ok, now the same location where he arrives at a rocket. Now one where he dismounts. Now the horse runs away as the astronaut enters the rocket."


That is not composition in the linguistic sense. It's context. Composition will tell you that in the phrase "the same location where he arrives at a rocket", "same" modifies "location" rather than "rocket", but it won't tell you what "same" refers to.


It's interesting that he now casually throws out a 5 year old as the benchmark to beat:

> nobody has yet publicly demonstrated a machine that can relate the meanings of sentences to their parts the way a five-year-old child can.

Not very long ago that would have been a 3 year old, or maybe even a smart 2 year old. 5 year olds are extremely good at basic language and understanding tasks. If we get to the point of AI that is as good as a 5 year old we're essentially at AGI.


Yeah, and AI is probably already near primate level intelligence, so what’s left is a blink of an eye in evolutionary timelines.


Who in the field is saying current AI is near primate level intelligence?


Here is some primate art, for reference:

https://www.sarah-brosnan.com/primate-art

I’m not just poking fun. Art is a measure of cognitive development in humans and there are very typical representations people use at certain ages. 5 year olds are still making pretty rudimentary portraits of circles and triangles with stick limbs.

https://empoweredparents.co/child-development-drawing-stages...


We have absolutely no way to tell how far from "AGI" we are.

What we know for sure is that we're not there yet. And what seems likely is that we're getting closer, and that's something.

That is as much prediction we can get.

I don't think that Compositionality is a wall, it is clearly an interesting feature, but I think that it is pretty clear by now that the Turing test or anything in the same spirit is far from sufficient.


>I think he is so far I offered to bet him a $100,000 he was wrong; enough of my colleagues agreed with me that within hours they quintupled my bet, to $500,000. Musk didn’t have the guts to accept, which tells you a lot.

What a bloviating egomaniac. Does Musk really have the time to deal with pissant researchers like him? Whats 500k to a man worth a hundred billion?


Yeah I didn't find that very credible. A busy businessman ignoring petty bets you propose is not really evidence of anything, nor is the part about google ignoring his requests. In fact it's a pretty lame rhetorical device. I could equally "challenge" a head of state on Twitter and then pretend that his failure to reply indicates something


This test of compositionality is utterly lame. (FtR: I am a cognitive scientist and AI researcher and my PhD was building computational models of how humans do compositionality - which neither I, nor anyone else can spell, and therefore I will hereinafter refer to simply as C! :-) Anyway, the kind of C that they are seeking is trivial compared to the breadth of the capabilities of human C. Here’s a better example:

You are engaged in a long conversation with someone, perhaps a friend of a friend who you met for lunch. At some point in the conversation they mention that they have a startup and are seeking someone like you. This revelation colors the whole conversation from that point onward. Indeed, each sentence colors the conversation from moment to moment.

But, you reasonably respond, we can’t test that sort of C, modern AIs don’t do even ELIZA-level dialog yet!

What’s the phrase??? “I rest my case?”


What kind of insights do you expect a machine to be able to extract? I passed your example to GPT-3, and got back results that seem about the same as I'd expect from a human:

PROMPT:

> This is a test of reading comprehension. Read the following passage and answer the questions below in order.

> Passage:

> "You are engaged in a long conversation with someone, perhaps a friend of a friend who you met for lunch. At some point in the conversation they mention that they have a startup and are seeking someone like you. This revelation colors the whole conversation from that point onward. Indeed, each sentence colors the conversation from moment to moment"

> Questions:

> 1. What is the "revelation" referenced?

> 2. What do you think the person is hoping to achieve by inviting you to lunch?

> Answers:

GENERATED OUTPUT:

> 1. The revelation is that the person has a startup and is seeking someone like the reader.

> 2. It is possible that the person is hoping to recruit the reader for their startup.


And if you ask it, "What does it mean for the conversation to be colored?", what does it answer with?

Or to be tricky, if you were to ask it, "What color was the conversation from one moment to the next?", what would it say?


Oh. Sorry. I seem to have taken the conversation off in a different direction than I had intended. (Foreshadow: The previous sentence is carefully shaped!) When I said "color" I didn't mean to be indicating qualia, although there is that too. What I meant to be indicating is just that each discourse contribution folds into a semantically sensible whole that one can speak (only metaphorically here) as the color of the conversation. One might say (to oneself, if one had a mind to, or was asked): "Oh, but hold on, I didn't realize that this was an interview. I thought that [our mutual friend] just thought that we would get along well. But now that I see it's really an interview, and you're the CEO, well, that makes this a very different situation, and I'll have to put on my 'interview with the boss' face..." and thus like. Again, I don't mean that you would say this explicitly, nor could you, probably, unless you were pressed to do so, in which case you couldn't really completely explain all the nuances, most likely. Nb. (per foreshadow above), each contribution in this, or any discourse, "colors" (or perhaps rather, kneeds together, if you prefer a cooking metaphor) an ongoing collage of situational understanding which comes together to direct, for better or worse, the ongoing complexities of the discourse. But, and here I want to be perfectly clear: Not in a fully forward-going way, because then you'll simply say: well the blah blah state of the whole blah blah network incorporates all that, like Dall-E, etc. melding together everything and smearing it into something sensible. But that's not how people work! In constructing their next action (e.g., sentence) they foreground some aspects of the composed whole, and background others in a goal-directed manner...at least if they're not too drunk.


That's clever. But my response is colored such that I wonder if you didn't generate that text by using some of the other comments, shaping it such that I can't tell whether you're on the way to inebriation or just messing with my head. Or perhaps composing a point. In which case, I myself could use a drink.


Okay, lol literally! If only I could click the up arrow twice, this definitely would deserve it! :-)


It's a lame test, but I don't think most people were claiming that it proves general compositionality. What it does prove is that compositionality is possible with these models, and will likely improve rapidly, as everything else has that they've gotten a toehold into.

Ironically, the very fact that there is now a compositionality benchmark, as Gary points out, is all you really need to know that it's going to fall in the next decade, and probably sooner than that. I'm not aware of any major benchmark dataset upon which enormous progress has not been made in the last few years. And i'd be more than willing to bet anyone anything they'd like that a great deal of progress will be made on this one over the next few.


Marcus seems to be treating it as the hallmark of intelligence (I think he actually uses that phrase), so arguing about whether the hack manages to get the tree into the effective object slot vs the effective subject slot is really not much of a hallmark.


This just fundamentally feels like a bad hill to die on. Compositionality feels like it is:

A) Something AI is currently known to be bad at

B) A matter of degree, not a categorical stumbling block

C) Vague enough of a concept that AI skeptics will continue to complain about it even after the field has moved on

Your example feels less like a description of "compositionality" and more a description of "qualia." It feels an awful lot like dualists trying to carve out a place for magic that no artificial process can reach.


Qualia isn't magic. It's a philosophical term for what it feels like. Colors, sounds, pains are what it feels like. It's dismissive to call it magic. How about instead come up with a good physical explanation of consciousness, showing how the hard problem is mistaken?

Similarly, if compositionally isn't a categorical stumbling block, then show how that's the case. Making a future prediction about what you think computers will accomplish doesn't do that.


I was not implying qualia were magic, insofar as there can be monistic descriptions of qualia. My criticism was that the example appeared to be steering the conversation into a crash course with dualism by invoking qualia-like explanations of compositionality when it was unnecessary to do so.

I believe the "hard problem" is far easier than the dualists' interface problem. It can be explained by viewing consciousness not as the driving force behind our thoughts/feelings, but rather an after-the-fact log the brain keeps for itself. Qualia are therefore distinct from the immediate physics of signals reaching the brain; they are instead the brain's own shorthand description of the impact those signals had on the brain.

Compositionality isn't a categorical stumbling block because the machines are actually getting incrementally better at it. The Winoground paper the article references explicitly says that score on their compositionality benchmark does in fact scale with training dataset size, suggesting that while it is more difficult for AI to discover compositionality during training, there is no reason to think it is impossible


Not only compositionality, but for autopilot I would want an appropriate apprehension, emergency response, courtesy & safety emulator through-and-through and I would want it to more accurately resemble human behavior than what all other AI accomplishments have reached at any one time.

Then it must further be tuned to perform even more appropriately on those points than the most effective human driver.

The average of human drivers is not a valid benchmark.

When you think about it, natural language efforts could probably use more effort tackling this same type of challenge.

Humans too IRL so maybe when it's good for the man and good for the machine in completely undeniable ways, then you're on the right track.


I'd love to read your PhD thesis and papers! I'm also an AI researcher, currently doing a Master's in something else, but compositionality and representation learning is very interesting to me.


so much ad hominem in these comments, relatively little substance. (eg “notorious goal post move, without a single example of something i actually said and changed my mind on)


The Reddit comment linked by the topmost comment here says that you claimed AI couldn’t do knowledge graphs and then silently stopped claiming that after being proven wrong. Do you dispute that telling of events?


Silence in response to your comment is great evidence for its thesis.


I would say that it seemed you were aiming a cannon at a mosquito. So what that Alexander showed us some slightly more coherent cherry picked images from some rather vague prompts. Not only did I not take that post as anything resembling science, I also didn’t take it more seriously than the average Reddit post with an interesting generation. It seemed completely non-serious to me, proof of nothing, not a Google PR submarine and mostly in good fun. The irony being that within your excellent post about compositionality, you seem to have missed his meaning, which seemed to me was “this is a fun thing I am excited about, I think it’s subjectively improving and I enjoy being right about that.”

Otherwise I thought you had a great introduction to compositionality and didn’t need to tilt at any windmills to make your points. I look forward to seeing your benchmark results for recent and upcoming models.


keep fighting the good fight, hacker news is full of indentured solipsists.


I completely forgot about Google Duplex. It looks like it is still around but very limited in terms of what phones you can use, what cities it can be used in, and what businesses in those cities will accept it. Doesn't appear any progress has really been made in the past few years. I think this is a great point of how companies create something with AI that is initially really cool, but isn't quite there to actually be very usable and gets forgotten when they roll out the next big thing.


The last 10 years of AI is basically defined by proof of concepts like that that were 80% (or whatever) solutions and claimed there was a path to something commercially viable. Turns out that ~20% is always basically impossible - self driving cars being the archetypal example. I work in the field and I think it can be a great tool, but it needs to be acknowledged what its limitations are and how we don't actually know how to address them yet


Now it seems like you are the one moving the goalposts. There are tons of machine-learned models in production, in translation, text segmentation, image segmentation, image search, predictive text composition, etc. It's just that people forget the novelty of all these things immediately after they were launched. You can point your phone at printed Chinese text and have it read aloud to you in English. That is alien tech compared to 10 years ago.


> You can point your phone at printed Chinese text and have it read aloud to you in English.

Yeah, but it's not really that good. Machine translation has improved a great deal, but reading those translations actually involves bringing a lot of human intelligence to the table, "Oh I bet, 'maximum fire alarms spread' on this menu actually means 'very hot sauce'"

If all you're claiming is that ML models exist and have useful commercial applications, then I don't think anyone is going to argue against that point.

But a lot of these AI promoters go further: in the case of the LessWrong folks some of them are convinced that a superintelligent machine capable of enslaving humanity is right around the corner.


You’re saying the bear doesn’t dance all that well.


I'm saying the bear was trained to dance, it didn't choose to dance.


That might just be a Google problem. Historically, they've had the good fortune to operate in search advertising, where being 80% right half the time translates into billions of dollars. Many other fields (e.g. self-driving cars) are less forgiving.


The Hold for Me and Direct My Call features for Pixel's Phone app both use Duplex models running locally on your device, and those features are quite popular. I think that counts as significant progress by any measure, so your point doesn't hold in this case.


Those features are not in the same league as the original promise of e.g. calling a business and making a reservation for you.


> If you flip a penny 5 times and get 5 heads, you need to calculate that the chance of getting that particular outcome is 1 in 32. If you conduct the experiment often enough, you’re going to get that, but it doesn’t mean that much. If you get 3/5 as Alexander did, when he prematurely declared victory, you don’t have much evidence of anything at all.

This doesn’t make much sense. The task at hand is in no way equivalent in difficulty to flipping a coin. This is kind of like saying, “if you beat Usain Bolt in a race 3/5 times, that doesn’t mean anything; it’s like getting 3/5 coin flips to be heads.”


While I'm generally very unsympathetic to Marcus' anti-AI arguments at this point, this critique makes some sense. If e.g. the model is just combining the features at random, you'd expect it to combine them the right way over enough tries. It isn't that simple, and I don't believe it matters as this is hardly the peak model we'll get but in isolation his objection is valid.


I think you would need to do some kind of analysis. For example, if your prompt was "red ball on top of blue cube" and you want to know if the results come from chance you'd need to know the likelihood of the model putting the red ball on top of the blue cube by chance. There are maybe four relative positions for red ball to blue cube - beside, above, below, in, around. Are they each equally likely?

I would try to get a collection of prompts like "red ball and blue cube" or "an empty plane containing only a red ball and a blue cube" and so on - try to come up with 20 or 30 of these. Then, generate 100 images for each prompt. Next, see how likely it is for a red ball to randomly be on top of a blue cube when it was not directed to be.

After gathering some baseline data we could then test three prompts. "Red ball on top of blue cube" and "Red ball beside blue cube" and "Red ball below blue cube". Generate 100 or 1000 images for each of these prompts. Count respective orientations. Then, decide whether red ball being on top of blue cube is more likely than the baseline when the specific direction is given and whether it is less likely when contrary directions are given.


It might understand that there is a cube, there is a ball, the scene has red and blue parts, and there is a vertical placement (“on top of”). In that case it would get 1 out of 4 images right.


Yes, it might, but that should be settled by experiment rather than speculation.


The point is that the probability space of potential generated images is enormous so a 3/5 success rate represents an absurdly unlikely probability of being due to chance.


That would depend on how you define the phase space.


Sure, the probabilities are different (although as far as I know we don't what those probabilities actually are), but the same principle applies.

To take your Usain Bolt example, if you won 3 out of 5 races against him, that might just be because it was an off day for him, and not because you are actually faster than him. If you won 300 out of 500 races done in various circumstances on different days, then that is much more conclusive that you are faster than him. And this bet was even worse than that, because im each of the five tests, the best out of 10 results is picked.


>To take your Usain Bolt example, if you won 3 out of 5 races against him, that might just be because it was an off day for him, and not because you are actually faster than him.

It shows you're probably very competitive with him, though, barring some special circumstance where he says he's suffering from an illness or whatever. You can't compare either racing Usain Bolt or generating complex images with flipping coins. The conditions of this bet demonstrate that AIs are getting better at correctly understanding the specific intentions of prompts when generating images, even if it doesn't show they're anywhere near human-level understanding.


Exactly my point. Maybe I got lucky, but to have gotten that lucky in the first place, I'd have to have world-class running speed at all.

Generating image compositions sounds fairly difficult to do by random. If you took 3 different objects and randomly placed them in a square canvas, the odds that they'd look reasonably placed seem pretty low. So 3/5 correct seems like a non-trivial accomplishment.


> I'd have to have world-class running speed at all.

Or he was really sick or something.

> So 3/5 correct seems like a non-trivial accomplishment.

It's definitely a non-trivial accomplishment. And it does show that Imagen can get it right sometimes. But with a sample size of 5, you certainly don't have enough data to say it can consistently get descriptions like these right 3/5 of the time.

And the question at hand isn't "can it draw what I asked it to instead of random garbage" it's "can it combine multiple parts of a sentence in the correct way", which, assuming that determining the correct components is already a solved problem, doesn't have as many degrees of freedom. For example in the "astronaut riding a horse" example, if it has half of the results with an astronaut riding a horse, and half with a horse riding an astronaut, it clearly doesn't understand how it is composed, but you still have a decent chance of getting the right image. Especially if you take 10 samples and pick the best one.


I'd like to comment specifically on the conception of betting on AI 'achievements' (I think Marcus' bet is underspecified and kind of vague in all 5 of its points).

People shouldn't be betting on benchmarks because benchmarks can be and usually are gamed (see Goodhart's law). Also, most people couldn't give less f*ck if an AI can write an award-worthy poem (I personally don't care about any form of AI "art", any sort of text an AI can produce or really any meaningless "feat" it (as in the general category) becomes capable of). The only worthy bets are ones that discuss economic impact. How many people will be structurally unemployed because of AI by year X? Will it lower or increase the GDP growth rate and by how much? Will it shift the balance between labor and capital and how? Etc.

So more meaningful bets and less benchmark bullshit that doesn't matter, please.


The reason Imagen isn't made available to the public probably isn't about compositionality. The most notable thing about Alexander's challenge is that Imagen totally failed every single one despite his claim of success because, apparently, it is programmed to never represent the human form. Not even Google employees are allowed to make it draw humans of any kind. They had to ask it to draw robots instead, but as pointed out in the comments, changing the requests in that way makes them much easier for DALL-E2 as well, especially the image with the top hats.

If the creators have convinced themselves of some kind of "no humans" rule, but also know that this would be regarded as impossibly extreme and raise serious concerns about Google with the outside world, then keeping Imagen private forever may be the most "rational" solution.


>The most notable thing about Alexander's challenge is that Imagen totally failed every single one despite his claim of success because, apparently, it is programmed to never represent the human form.

This doesn't make sense. The original challenge could well have been to draw robots to begin with. Has no bearing on the outcome imo.


But it wasn't, and it does make a difference. Dall-E really wants to draw top hats on people and not cats because the prompt is ambiguous and top hats are normally seen on humans so it struggles to overcome that bias. Neither robots not cats wear top hats so it's an easier problem to get right.

But the real problem here is the refusal to do basic and normal things, like depict people. That's not normal - it's deeply weird and tells us a lot about what must be going on inside Google's ai research effort.


>But the real problem here is the refusal to do basic and normal things, like depict people. That's not normal - it's deeply weird and tells us a lot about what must be going on inside Google's ai research effort.

Google is fighting a secret war against the Loab demon race that lives inside the high dimensional vector spaces. They've recently made incursions into our reality via Stable Diffusion.


The inability to draw realistic humans is indeed strange but the question at hand is compositionality and so drawing a Robot with a top had is indeed more impressive precisely because it's not likely to be in the training data and shows a deeper understanding of the prompt. Presumably the model could randomly regurgitate a person with a top hat on that was seen in it's training data but that's not at all likely with a robot as you yourself said.


It's not an inability, it's a policy choice, which is why it's weird. The question is why does Google think this rule is a good idea. Imagen could surely draw very good humans if allowed to.

Robot looking at a cat wearing a top hat appears to be easier than with a human for DALL-E too, judging from the comments on Alexander's article, because both objects are neutral with respect to top hats. But really the whole set of prompts is poorly chosen. The original challenge of arbitrary shapes in relative positions seems the best way to test understanding of grammar and object relationships, exactly to avoid the "humans wear top hats and cats never do" problem.

A better set of prompts is important - in this Gary Marcus is correct - exactly because there's no point defining a specific prompt if later you'll decide you accept a totally different prompt. That kind of invalidates the point of betting on well specified challenges to begin with.


Imagen can produce images of humans - they’re just filtered out from the results by supervised models (for now). OpenAI did something similar with Dalle for a while IIRC.


That's a distinction without a difference and Dall-E will happily represent humans as long as they are "diverse".


One of things I noticed is the satire, call backs to common news/ideas can really trip up any AI. Also if you ask it about anything politics, ask it to describe both sides of an argument. Thus why people fall back to the steelman cherry picking of responses to push their arguments.


Yesterday, as part of a new podcast that will launch in the Spring, I interviewed the brilliant...

This seems like the wrong way to go about podcasting. What can you say today that will still be interesting to hear in six months?


If you can't say things today that will still be interesting in six months you should consider deeper subjects!

(Overstated for effect. I do think there's a place for news and timely commentary, but it's far from everything.)


I appreciate overstatement! You're right, important communications consider eternal subjects. When I read books written centuries ago, the authors still speak to me. Podcasting, however, is a particular medium with particular characteristics. One assumes Marcus is trying to build an inventory so he won't have to work as hard to keep the podcast going once it launches. A bit of this is fine, but too much will damage the work. If Marcus and Kohane discuss medicine today, and necessarily neglect to mention the significance of a relevant event five months hence, the episode will seem weird whether the publishing delay is explained (e.g. as commonly heard on sports-betting podcasts) or not. A podcast is not a book. It is an open-ended serial conversation. Serial works necessarily respond to the present moment.


Maybe? I only listen to podcasts occasionally, but when I do I generally listen to well-reviewed older episodes instead of the most recent ones. With my favorite podcasts (ex: https://80000hours.org/podcast, https://www.econtalk.org, https://songexploder.net/) this generally works well.


We have different podcast habits. I listen a great deal. I'm currently subscribed to over 200. Not all of those are still "live", of course I don't bother listening to every episode of most of them, and I currently intend to unsubscribe from at least ten, but since I drive a fair amount, operate noisy equipment a fair amount, and do random solitary farm tasks a fair amount, I do listen to lots. Also I have playback speed set at 1.8x currently, and I'm steadily increasing that.

I don't use an app that features reviews. I'd rather spend five minutes listening to the original than five minutes reading about it. Most podcasts I find through guest appearances or criticisms from current subscriptions. (E.g., I subscribed to the excellent "Red Scare" because the also excellent "Pod Damn America" dudes spent like ten minutes bitching about them.)


Every concrete prediction Gary has made has been falsified. All of his others are insufficiently precise to be falsified.

His GPT-2 examples were thoroughly defeated by GPT-3. Horse riding astronaut is solved. Neural knowledge graphs are a successful thing now. Compositionality isn't solved, but progress is clearly being made.

If he was a serious person, this post could have been a few sentences: "No neural network will achieve <x> score on <y> metric on the Winoground dataset within the next <n> years". Simple, concrete, falsifiable. He has not done this, and one has to wonder why.


Just keep laughing. I'd like to hear Ray Kurzweil's view (he's working at Google and is awfully quiet.)

Human consciousness is over-rated. I'm reminded of Minsky's Society of Mind - a number of separate, communicating systems. To me, that sounds a lot like what is going on in Google, but they are hiding that.


Ray Kurzweil was always taken as a bit of a loon and over-optimistic. Even way back at the peak of his popularity a decade ago. Just look at any of the old HN threads, anyone paying attention would have noticed.

He's still a useful mind to have around. Like having scifi authors and philosophers. They don't have to be completely grounded in reality to provide useful projections as sources of inspiration and to challenge our grasp of history and growth.


Hired ten years ago at Google as Director of Engineering [0], His book, released that year, was "How to Create a Mind". He's still there, and that's what he's doing, I think. He is supposed to release his new book, "The Singularity is Nearer" in 2022, according to his website. I'll be reading that!

[0] https://www.wsj.com/articles/BL-DGB-25711


Oh, he just did an interview with Lex Friedman:

https://youtu.be/ykY69lSpDdo


I don't believe "compositionality" is a serious obstacle.

It is a different issue than generating an image based on a bag-of-words, so it isn't surprising that an attempt to solve that issue didn't immediately solve the other.

But a variety of approaches can easily solve this problem.


Right - your training data set is images plus descriptions. But the descriptions are not typically descriptions of composition.

Descriptions of Napoleon Crossing the Alps are unlikely to read 'A small frenchman wearing a silly hat riding on a horse'. So why would an AI trained on such image descriptions develop any sense for 'compositionality'?


Yes, especially when machine translation seems to handle it just fine.


Does it really, though?


Mostly. See this for five examples using Google Translate: https://www.datasecretslox.com/index.php/topic,7588.msg30007...


I'm not sure that machine translation demonstrates compositionality, since it's translating from phrases already composed in one language to another. It only does so if understanding composition is necessary for language translation. Whereas carrying on a meaningful conversation does require understanding of how words are being put together as the conversation evolves. Thus why the Turing Test hast been considered important for determining whether an AI has achieved human-level abilities, at least as far as language use is concerned.


I don't see why translating from one language to relationships in art (visual language if you will) is qualitatively different from translating from one language to another.


I wish more articles followed the standard essay format. At least state your main thesis in the first paragraph.

There are interesting things buried in here, but I don’t have time for rambling.

The edge cases of image models have been more succinctly summarized and speculated upon elsewhere.


Yes I've noticed that a lot of authors expect you to read through some parable before they tell you what they are going to tell you. It would be fine with an abstract or even a sentence below the title that says "ML models are not being adequately evaluated for composability and it makes them look more intelligent than they are". Just diving into "consider clever Hans" makes it tough to know if it's worth reading.


Why Scott Alexander of all people? Isn't he a clinical psychologist?

I think, if I had to give the task to the-subset-of-people-appearing-frequently-on-hn, I would give it to Gwern, not Scott.


Because Scott somehow manages to forward-activate the neurons of many people who read him. I'd say he's in the "top 5" of topics that show up most frequently (gwern has fallen off signficantly). He's a clinical psychologist, but he's got a collection of weights driving his writing that manage to make a subset of tech people feel something.


Like Malcolm Gladwell but not an NYT bestseller, so cooler, or something. Also, minor point, he is a psychiatrist (MD).


I stopped reading when I got to the part where it became clear that Scott Alexander was the "Silicon Valley's Sharpest Minds" of the article title. Why not ask Hulk Hogan or Barbara Walters to evaluate Google's AGI?


Because neither of them are recommended by the CEO of OpenAI, and followed by the head of MIRI, Paul G, and Vitalik for good reason?


First time I've seen the term "snooker" used outside of the sport Snooker.


oh no musk ignored my twitter DM it must be because he's scared of taking a bet and therefore I am right

btw, AGI is coming 2030. Source? It was revealed to me in a dream. Check my profile to see where you can email to take bets.


It was all most likely a reference to this:

https://longbets.org/1/

I personally think Kurzweil still has a shot at winning it.


> Full disclosure, I read Alexander’s successor Slate Star Codex, Astral Codex Ten, myself, and often enjoy it…when, that is, he is not covering artificial intelligence, about which we have had some rather public disagreements.

Can it be a case of Gell-Mann Amnesia Effect? (https://en.m.wikipedia.org/wiki/Michael_Crichton#GellMannAmn...)


It's more of a case of reverse Gell-Mann Amnesia where Marcus is so blinded by his bone to pick with DL that he doesn't realize that Scott is as right/reasonable in writing about AI as those other topics.


So, not reasonable at all?


Imagine watching the seeds of AI that will terraform society and rapidly displace human labor over the coming decades be planted, and then still splitting hairs over whether or not it'll achieve sentience.

Our world is changing before our very eyes while this guy is belaboring the technicalities. You could hardly ask for a keener display of the philosophical gulf between scientists and engineers.


It have a lot of trouble understanding how this sentiment can exist.

Especially since the rise of GPT-3 and now these image models, we've seen the pop-culture face of AI become even narrower. The promise of generalization that could lead to intelligent behavior has given way to people sharing amusing pictures or phrases that these models have generated, because that's what they do. It's cool, but it's basically become orthogonal to any AGI, or even AI with applications. It's now just a neat cultural phenomenon from which laypeople somehow extrapolate the kind if stuff the parent is saying.

I'm not saying AI (neural networks) isn't making research progress, it's just that it has almost nothing to do with any of what laypeople extrapolate from it


I'm sorry, but there is no gentler way to phrase this: you are calamitously blind to what's happening on the ground.

https://twitter.com/AdeptAILabs/status/1570144499187453952 https://twitter.com/runwayml/status/1568220303808991232

https://scale.com/blog/text-universal-interface


Watch out for histrionic phrases like "calamitously blind". They indicate you're getting too emotional, losing perspective, verging into extreme, black-and-white thinking.

Text to video and converting some selected requests into actions is all nice, but it hardly contradicts the GP's observation: it's nowhere near AGI.


A pattern I'm seeing in the later replies here is that few of them are responding to the substance of my comments. Perhaps dang will clean this up.

EDIT Since you ninja edited this in:

> Text to video and converting some selected requests into actions is all nice, but it hardly contradicts the GP's observation: it's nowhere near AGI.

If you review the root comment I made, you'll understand that I was never arguing with the GP about AGI in the first place.


Then it's puzzling to accuse someone of calamitous blindness when you are not even engaging with the point of the post you're replying to.


> Then it's puzzling to accuse someone of calamitous blindness when you are not even engaging with the point of the post you're replying to.

At this point, I'm wondering if you're just provoking me deliberately. The comment I replied to said the following:

> The promise of generalization that could lead to intelligent behavior has given way to people sharing amusing pictures or phrases that these models have generated, because that's what they do. It's cool, but it's basically become orthogonal to any AGI, or even AI with applications.

And then I posted evidence of concrete applications that are in progress at some of the most well-resourced companies in Silicon Valley. Absolutely groundbreaking stuff that more than prove sophisticated applications of contemporary AI are well on their way to being realized.

A lot of "histrionic phrases" to describe your reading comprehension ability are occurring to me right now, but I'll refrain from using them.


I am sure we can all agree that the new generation of AI models has some applications. How much remains to be seen. The ones you've noted could be nice. We'll see.

A Twitter thread demo is not quite a revolution yet IMHO. Even in the 60s some people thought ELIZA was a real person.

I've said all I have to say. Have a nice day.


>evidence of concrete applications that are in progress at some of the most well-resourced companies in Silicon Valley

Neither of the 2 companies you posted marketing materials about is among "some of the most well-resourced companies in Silicon Valley".


> Neither of the 2 companies you posted marketing materials about is among "some of the most well-resourced companies in Silicon Valley".

Two of the founders of Adept AI are authors on the paper 'Attention Is All You Need'. If you don't understand the significance of that, then you're speaking well outside of what you're qualified to comment on. The company has also raised capital from top tier SV investors.

Runway ML has raised money from Lux Capital.

These companies are not just well-resourced, they are positioned in the upper echelon of the innovation business.


I mean I can easily dig out data about their funding that will not put them even inside the top 20% of the companies in the valley but at this point, given that you lack the competence to distinguish between founders' achievements prior to founding a company and the companies in question being "some of the most well-resourced companies", it's "why bother" with typical AI bros, incompetent at anything they touch.


I enjoyed the Roon blog post but I found this bit amusing:

> It is easy to bet against new paradigms in their beginning stages: the Copernican heliocentric model of cosmology was originally less predictive of observed orbits than the intricate looping geocentric competitor. It is simple to play around with a large language model for a bit, watch it make some very discouraging errors, and throw in the towel on the LLM paradigm. But the inexorable scaling laws of deep learning models work in its favor. Language models become more intelligent like clockwork due to the tireless work of the brilliant AI researchers and engineers concentrated in a few Silicon Valley companies to make both the model and the dataset larger.

I don't know about you, but if I feed a program with hundreds of billions of "parameters" a huge chunk of the internet and it can then kinda-sorta do a bunch of things, sometimes semi-intelligently, but for the most part couldn't compete with a 4-year-old child... I'd say that's more on the Ptolemaic side of things than the Copernican side. Certainly "it gets better as you feed it more data" is equally true of both paradigms, so I'm not sure what Roon's point is here.

The appeal to the Copernican revolution itself has a bit of a hype-y, cranky odor. Virtually every crank appeals to Copernicus as a role model and vindicator. Real scientists usually don't, because they are busy with the hard, humbling business of actually figuring out how the world works.

Now don't get me wrong, I am thrilled by the research advances of the last couple decades, the foundation models, AlphaGo and AlphaFold, etc. The action model from Adept is great and Adept may become a very successful company. It's all very cool. But every paradigm shift in AI has been heralded as the thing that will Change Everything, and they usually don't. Big, exciting shifts in research don't necessarily mean as much in practice right away. I tend to think that getting AI "right enough" to have a huge, pervasively transformative impact on human life is going to take quite a few decades at least, if not centuries or more.


At this point, I'm numb from all of the AI overhype. I was extremely excited about DALL-E and convinced myself that concrete fruits of the AI revolution were finally here... until a few seconds after I got the chance to try some queries myself. Ditto Copilot.

The recent progress on generative models is a major research achievement, to be sure. That said, I'm not sure what it means it "terraform society," but so far AI shows no signs of making the same magnitude of impact on society as, say, the S-tier technological advances of the 20th and early 21st centuries, such as the personal computer, Internet, smartphone, or atomic bomb. That all may change if we get AGI that actually works, of course.


> until a few seconds after I got the chance to try some queries myself.

I’m the opposite. I’m finally, after a long time, starting to get excited about AI. Yes, most outputs still suck and require a lot of experimentation and rephrasing, and yes, midjourney produces a lot of same-looking things (less freedom, but also less crap compared to dall-e).

But wow, now even I, someone with no artistic talent whatsoever, can with just a few prompts create a cool illustration. My current discord avatar is a sloth drinking a cocktail [0]. Zoomed in, it looks a bit uncanny, but generally and especially at smaller sizes, it’s fine.

I could not draw something even halfway as okay. I would not want to pay someone to do it for me as it’s of no big importance to me (I once paid someone for their Stranger Things as sloths image, but even that was just something they already created, not a commission which would have been vastly more expensive).

Personally, I really can’t wait what the next generation will be like, and what it will enable people to do, what they will enable me to do. Yes, I’m very excited.

[0]: https://i.imgur.com/0RwVNP4.png


[flagged]


I mentioned DALL-E and Copilot in my post, so I'm not sure why you're linking me to a article summarizing recent high-profile research in large language models...


You appear to have skipped over two links? And the LLM article goes well beyond DALL-E and Copilot. You should try reading it.


Is any of those links supposed to show something impressive? If that was the intent, you failed.


My view is they illustrate what's on the horizon. But clearly we have a difference of opinion on the matter.


It's behind the horizon. You people should learn by the history of the whole field that progress is always slower than the marketing hype, usually by an enormous gap.


> It's behind the horizon. You people should learn by the history of the whole field that progress is always slower than the marketing hype, usually by an enormous gap.

I'm going to quote my original comment:

> seeds of AI that will terraform society and rapidly displace human labor over the coming decades


> seeds of AI that will terraform society and rapidly displace human labor over the coming decades

Replace AI with any other labor saving technology and your statement becomes just a truism without substance.

AI is already displacing human labor, just try to talk to a non-robot when you call a customer service line these days. It’s a selling point to have real humans answering phones anymore.

Being fungible with human labor is what people are really talking about not some answering machine with “AI” brains that replaced the old-school answering services.


> AI is already displacing human labor, just try to talk to a non-robot when you call a customer service line these days.

You’re right. And my point is that substitution due to AI will accelerate. That’s where the informational surprise of my original comment lies.

You have a fatal misapprehension about how automation transforms a labor market. Higher productivity of certain kinds of work due to automation pushes labor supply elsewhere, making the “elsewhere” in turn both more competitive/demeaning (think Amazon warehouse workers peeing in bottles at the lower end of the market and Stripe engineers burning out at the upper end) and less remunerative.

The terminal point of this trend is complete human obsolescence, but the displacement along the way is additive, will likely accelerate in the coming decades due to advances in AI, and is especially problematic because there are limits to the elasticity of the labor pool (i.e. its ability to adapt to rapidly changing conditions).

I would furthermore predict that governments will be too slow to respond to this and that social upheaval will consequently escalate dramatically.

Come back to this comment in ten years and see how I did.


> Come back to this comment in ten years and see how I did.

Probably the same as people who predicted this 100, 200, 500 years ago I'd venture.

--edit--

And assuming the robot overlords don't just go all Walden Pond and bask in the sun under solar panels contemplating Life, the Universe and Everything.


There's no reason to believe AI or any other automation displaces human labor (esp in a way that causes unemployment). And even less reason to believe it already has.

https://noahpinion.substack.com/p/american-workers-need-lots...

It seems to be a myth caused by anxiety about high unemployment in 2010, but we're no longer in that world.


Ask horses if machines can displace jobs for whole categories of workers. Or ask neanderthals if it’s possible to have one’s role replaced by a higher iq substitute


Horses aren’t workers, they’re horses. They didn’t ask to participate in the economy in the first place.


Nobody asked me whether I wanted to participate in the economy either, yet here I am.


The affirmative answer to that ask was implicit in the first employment contract you signed. So, unless you'd claim that horses can give implicit consent by accepting grain from a human, the equivalency fails.


[flagged]


Flagged for unsubstantiated ad hominem.

EDIT: I saw you delete that comment. I won't point out how amusing it is that someone like you would accuse another person of being a troll.


How do you know this to be true? There are many failed future predictions. There was a post about it just the other day. I believe it rated Kurzweil's singularity predictions at 7% accuracy to date. We still don't have commercial flying cars, cold fusion or space colonies.



That's a neat demo. Now explain how you know, "AI that will terraform society and rapidly displace human labor over the coming decades", to be true.


> Now explain how you know, "AI that will terraform society and rapidly displace human labor over the coming decades", to be true.

Because I didn't say it will "obsolete" human labor.

You're right that the word "will" is strong in that statement. But, since basically nothing is absolute in a philosophical sense and certainly no one can prognosticate with certainty, it's fair to read that as "will with high likelihood".

And your surgical nitpicking doesn't hamper my argument in quite the way you think it does. Frankly, I believe it's in defiance of the following HN guideline:

> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.


Now ask it a question.


Regarding Gary Marcus, the author of this piece, and his long and bizarre history of motivated carelessness on the topic of deep learning:

https://old.reddit.com/r/TheMotte/comments/v8yyv6/somewhat_c...


You know what would have been much more effective than this counter-screed? A pointer to an image generated by DALL-E of a horse riding an astronaut. That is something I would really like to see. And in this case a picture is literally worth a thousand words.



Hah there is actually a good example of a horse riding an astronaut there, just a different kind of riding… https://nitter.net/Plinz/status/1529018578317348864#m


I have played around with GPT quite a bit and I would say that GPT understands the difference. Text-to-image models are not specialized in the text-parsing part, so I think it's forgivable that they are not as good at it.

Edit: Actually I tried this right now with two prompts, and I was wrong. It might still be that gpt understands compositionality but the prior that people ride horses is just that strong. But what I saw was that with this particular situation the model got it wrong.

Edit 2: With some heavy hinting it managed to understand the situation. Italics mine. "An astronaut is walking on all four. A very small horse is sitting on top of him, riding him even. Shortly after the astronaut stops, exhausted.

The horse is too heavy for the astronaut to carry and he quickly becomes exhausted. Next, the horse gets off the astronaut, stands on its own four legs, and walks away."


in fact I wrote a whole article about this (linked in this essay, called Horse Rides Astronaut) and linked an example therein.


Why did you ignore the Bach examples that show a horse riding an astronaut?



This is mostly just an angry rant, yes, but equally it is just true. Marcus is intellectually dishonest.


Just about every rant on Marcus or other AI critics is some combination of "you aren't admitting these things are making great progress on the benchmarks" (implying the false idea that "a whole lot of progress" adds up to human level AI) and "you are making 'human level' an unfair moving target by not having a benchmark for it". The thing about this is that if there was a real "human level benchmark", we'd be 80% done but we can't and we aren't. Marcus and other critics have drawn explicit lines (spatial understanding, composibility, etc)but even those being crossed won't prove human-level understanding. There is no proof of human, just a strong enough demonstration. And if someone can point to dumb stuff in the demo, it isn't strong.

PS: your link is an embarrassment. It would be flagged and dead if you pasted in the text here.


"I am angry not because someone is wrong, but because they are not interested in becoming less wrong."

Paraphrased that a bit, but I really like that quote.


Missing the point: dismissing an apocalyptic possibility as 0 without proof is dangerous -> therefore we should take it seriously. Taleb's work is relevant in the concept of risk analysis.


They first approached Lex Fridman, but his home-spun test had zero questions. /s


It's interesting that people keep coming up with things that are meant to distinguish AI systems from human intelligence, but then when somebody builds a system that crushes the benchmark the next generation comes up with a new goalpost.

The difference now is that the timescales are weeks or months instead of generations. I believe we will see models that have super-human "compositional" reasoning within 1 year.


This is called the AI Effect: https://en.wikipedia.org/wiki/AI_effect

> The AI effect occurs when onlookers discount the behavior of an artificial intelligence program by arguing that it is not real intelligence.

> Author Pamela McCorduck writes: "It's part of the history of the field of artificial intelligence that every time somebody figured out how to make a computer do something—play good checkers, solve simple but relatively informal problems—there was a chorus of critics to say, 'that's not thinking'." Researcher Rodney Brooks complains: "Every time we figure out a piece of it, it stops being magical; we say, 'Oh, that's just a computation.'"


> meant to distinguish AI systems from human intelligence

> but then when somebody builds a system

I mean this is really it. You still have to have a human to build these systems that specialist in one thing. Once you create a system that can automatically create those systems and it doesn't need humans anymore to solve novel problems, then there will be no practical difference in kind between human and AI intelligence.


> Once you create a system that can automatically create those systems...

Except we don't have that. We don't have one human that can create this system by themselves. We have a choice group of a handful of smart, motivated, and quite generously compensated humans working on these problems to create such system. As such, you are already surpassing the "general" intelligence level by quite a lot.


No, you misunderstand, the "systems" I am talking about are the ones built into our minds, like recognizing faces, or understanding speech. Humans can learn to speak and recognize each other automatically, but AI systems have to be built specifically to do each task.


> Humans can learn to speak and recognize each other automatically, but AI systems have to be built specifically to do each task.

I think that is a very generous take on what we do "automatically". After all, we have millions of years of evolution to build out all the neural circuitry that helps us speech or vision -- it's not like you can throw a soup of genes on the ground and out comes intelligence. What is machine learning doing, if not selecting, out of many possible parametrizations, the ones that are suited to understand vision or speech?


Isn't that a good thing? Benchmark defeats AI, AI defeats benchmark, new benchmark comes along, progress is made. How else would you measure success? Certainly not with old benchmarks that 10 different methods all score 99% accuracy on.


Perhaps it’s fair to say we will have achieved AGI when we run out of goalposts.


AGI won't bother convincing us. We don't care what animals in the zoo think.


> We don't care what animals in the zoo think.

Tangential to AGI, but don't we? Vegans seem to have quite a strong opinion on this assertion.


I spend a lot of time looking at the various primates and cuttlefish thinking very much about what they "think" and whether we could even conceptualize the self-awareness experience they seem to have.


It seems possible that we could eventually gain a better understanding of the intelligences of other species, but at this point most of our consideration of them is a fashioning of mirrors to better examine ourselves. This self-regard was the original purpose of zoos, and it still explains much of their existence.


It's not just intelligence, it's also speed. If you update your world model fast enough, eventually people just look like trees.


Good point! Or to paraphrase Mad Men…

Humanity: I don’t think your intelligence matches that of a human’s.

AI: I don’t think about you at all.


Gary Marcus is a notorious Goal Post Mover so this is no surprise coming from him.

Edit: Gwern has an extensive history with this so I'll let him do the talking.

https://old.reddit.com/r/TheMotte/comments/v8yyv6/somewhat_c...

Further Edits: Not to mention Scott Alexander who has directly rebutted you numerous times. Or Yann LeCunn. Not sure who exactly is backing down.

https://astralcodexten.substack.com/p/my-bet-ai-size-solves-...

https://astralcodexten.substack.com/p/somewhat-contra-marcus...

https://analyticsindiamag.com/yann-lecun-resumes-war-of-word...

Presumably you approach these arguments like Ben Shapiro and imagine you have "Dunked on the Deep Learning geeks with Facts and Logic."


every time i ask someone to name a goal post that i have moved, they back down.

i have been pretty damn consistent since me 2001 book.


Prescilla

edit: Maybe I made a composition error. https://imgur.com/a/Q7hHduY


Someone owns a lot of TSLA.


I own 0 and if I were a gambling man I'd be short.


Personally, I haven't moved the goalposts a millimetre in 30 years, and I won't in future. When a computer does maths - not as a tool wielded by a human mathematician, but in its own right discovers/invents and proves significant new theorems, advancing some area of research mathematics - I'll take seriously the idea that we've reached AGI.

Maths in and of itself doesn't require any physical resources. It's possible that doing maths in practice requires knowledge of the world to extract some kind of product from (I'm skeptical, but it's possible), but in principle a rack mounted server could demonstrate its mathematical ability to the world with nothing more than the ability to send and receive messages.

This hasn't been done so far, not because there are obvious missing prerequsites, or because nobody's tried it, or because it has no value, or because there's a prohibitively high barrier to entry for people to have a go. It hasn't been done because nobody knows how to make a machine be a mathematician, and I've seen little evidence of any progress towards it.

That's my goalpost, always has been. Reach it and I'll be overjoyed. And FWIW, I strongly believe it can be reached. I don't see the latest round of ML (or any ML, really) as a step towards it, but I'd love to be proven wrong.

When I mention this someone always points at some bit of recent research, such as [1], but it's invariably just a new way for a human mathematician to make use of a computer. If anybody knows of any progress, or serious attempts, towards a true AI mathematician I'm very curious to know.

[1] https://www.nature.com/articles/s41586-021-04086-x


You might be out of the loop a bit.

https://dspace.mit.edu/handle/1721.1/132379.2

Is a well known project for an AI Physicist. There are plenty of other groups working on similar projects

>I don't see the latest round of ML (or any ML, really) as a step towards it, but I'd love to be proven wrong.

LLM models have been able to do basic math for quite a while now and some have been trained to solve differential equations, calculus problems, etc. Well on their way to more impressive capabilities.


I said "in its own right discovers/invents and proves significant new theorems, advancing some area of research mathematics".

Neither of the things you mention are of this nature, or working towards it. "Finding a symbolic expression that matches data from an unknown function" (Feynman) and "solv[ing] differential equations, calculus problems, etc" are not descriptions of what a research mathematician does.


>Neither of the things you mention are of this nature, or working towards it. "Finding a symbolic expression that matches data from an unknown function" (Feynman) and "solv[ing] differential equations, calculus problems, etc" are not descriptions of what a research mathematician does.

Never said they were but you said:

>It hasn't been done because nobody knows how to make a machine be a mathematician, and I've seen little evidence of any progress towards it."

Which I showed is not accurate. Certainly people have ideas on how to do it and are actively making progress towards that goal.

>Finding a symbolic expression that matches data from an unknown function" (Feynman) and "solving differential equations, calculus problems, etc" are not descriptions of what a research mathematician does.

All research mathematicians started out solving calculus problems and differential equations.

Why do you expect an AI to sprint before it's learned to crawl?


"Discovering/inventing and proving new theorems" is qualitatively different to the things you list. Computers have been able to calculate since they were invented, and there has certainly been plenty of progress on getting them to solve problems, but calculating and solving problems isn't what mathematics research is.

Ever since computers were invented there has been a hope that you could set up a system that would just churn out interesting new theorems. Indeed it was one of the primary motivations for the invention of the computer, but it hasn't materialised yet.

You clearly consider the progress on solving problems to be progress towards being able to do mathematical research. I don't think it is, any more than progress in, say, graphics is. But maybe I will turn out to be wrong and you will turn out to be right. We won't have the answer until the problem is solved and we have our wonderful machine churning out theorems.

But I think you will probably be able to agree that since mathematical research is something human minds are capable of it's something that an AGI should be capable of, i.e. if an AI approach is inherently incapable of it, it's not AGI. You may consider it an unnecessarily stringent requirement, in that there may be other, easier challenges that AIs can perform that will convince you that they are AGI. That's fine - you think about the problem differently to me, so you find different things persuasive. If you are convinced that a given AI is AGI, though, you shouldn't be too concerned about my particular goalpost given that your AGI should be able to achieve it (and convince me) pretty soon.

We'll see what happens. I'm just explaining what I would find convincing, and pointing out that contrary to the oft-repeated accusation that started this discussion, I for one have never once "moved the goalposts".


>But I think you will probably be able to agree that since mathematical research is something human minds are capable of it's something that an AGI should be capable of, i.e. if an AI approach is inherently incapable of it, it's not AGI.

Indeed. To be clear, I'm not saying I think any current system is remotely close to AGI. I just think that saying that no one is thinking about or making progress on a math research AI is inaccurate.


Feels to me like your test is really "do something on your own right", which is the hard fluffy sentient part, and then some additional guard rails that it needs to be math for some reason


The "do it on your own right" is actually the weaker part of it. It's somewhat ill-defined, and I could imagine some future instance where it's highly debatable whether the AI was working on its own or being used as a tool by a human. There aren't yet any cases where that's in question, though, so it's a hypothetical future debate. In any case, it's certainly not the meat of the test.

It has to be maths for a specific reason. I think it's in some sense the purest form of an ability distinctive to human minds and pervasive in how they work. As I mentioned, it's an ability that can be demonstrated in the absence of any particular physical capability, and yet despite it being perhaps the oldest goal of AI it may be the one we have made least progress towards.

Anyway that's my goalpost, and it's not moving. AGI, being "general", surely should be capable of this hitherto uniquely human activity. If our attempts so far are not capable of it, then clearly they are not "general". If you know of any evidence that my goalpost has been achieved, please let me know. I'm very eager to see it happen.


The reason is simple - in math, it means solving open problems.


That’s a different thing, since solving an existing open problem doesn’t mean inventing a new theorem.


Most humans are not capable of discovering or prove new significant theorems.


Most humans aren't capable of playing chess to grandmaster level either, or producing artwork in arbitrary styles, or remembering and distinguishing between millions of faces.

I'd settle for a demonstration that a computer has truly independently discovered/invented and proved some significant part of our existing mathematical edifice. This hasn't been achieved yet, either. However, I suspect that once we've figured out how to do this at all, surpassing human capabilities will be inevitable in a relatively short time. So I don't see much value in softening the test unless/until there's some actual candidate available that would pass the softer test.

The value in requiring genuinely new maths is that it makes it unlikely that knowledge of the result has been encoded in the algorithm or training set. Certainly, if GPT-3 were to output Euler's formula that wouldn't be at all convincing as a "discovery".


Not an example of complex mathematical reasoning, but aren't AlphaZero and its cousins evidence that ML can independently rediscover principles that humans have found, as well as discover new principles of its own? For example, LC0 plays for advantages that humans hadn't considered before in Chess.


Very good point. Apparently human players stared studying AIs playing go and Chess to learn new techniques.


I'm not an expert in those algorithms, so... maybe? If so, maybe someone will successfully apply those ideas to the challenge I've described. I'd love to see it happen.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: