Did GoogleAI just snooker one of Silicon Valley’s sharpest minds?

concinds · on Sept 16, 2022

"A lightbulb surrounding some plants" is not English. If a wolf pack is surrounding a camp, we understand what it means. If a wolf is surrounding my camp; does that mean I'm in his stomach? Absurd.

"A lightbulb containing some plants," makes sense, not "surrounding". It's too small to surround anything, which humans (and apparently, current AI) understand. Paradoxically, only primitive language models would actually understand the inverted sentences; proper AIs should, like humans, be confused by them; since zero human talks like that.

The only reason the Huggingface people (in their Winoground paper) got 90% of humans "getting the answer right" with these absurd prompts because of humans' ability to guess what is expected of them by an experimenter. Do it in daily life instead of a structured test, and see if these same people get it right.

It's exactly as if I gave you the sequence, in an IQ-test context: "1 1 2 3" and asked you to give me the next number. You'd give the Fibonacci sequence, because you know I expect it; no matter that it's a stupid assumption to make because the full sequence might as well be "1 1 2 3 1 1 2 3 1 1 2 3", and you don't have enough information to know the real answer. Do we really want AIs that similarly "guess" an answer they know to be wrong, just because we expect it? Or (in number sequence example) AIs that don't understand basic induction/Goodman's Problem?

I'd like to add that the author, who keeps referring to himself as a scientist, is in fact a psychology professor. In his Twitter bio, he states that he wrote one of the "Forbes 7 Must-Read Books in AI", which discredits him as a fraud since Forbes can be paid to publish absolutely whatever you ask them to (it's not disclosed as sponsored content, and they're quite cheap, trust me).

adamsmith143 · on Sept 16, 2022

>"A lightbulb containing some plants," makes sense, not "surrounding". It's too small to surround anything, which humans (and apparently, current AI) understand. Paradoxically, only primitive language models would actually understand the inverted sentences; proper AIs should, like humans, be confused by them; since zero human talks like that.

Not sure this is credible. Most if not all human adults are capable of understanding what young children just learning to speak mean most of the time, not only people with very low IQs. So why would this be any different? Presumably the smarter the AI the better it can understand poor grammar.

concinds · on Sept 16, 2022

You understand poor grammar using context. This isn't poor grammar, this is a syntactically broken sentence, which requires more extrapolation for meaning. The AI hears "a lightbulb surrounding some plants" and understands the poor grammar using the context, i.e. understands that you mean that the plants are surrounding the lightbulb, since the reverse is vashingly unlikely to be what you mean.

End result: your AI is actually quite good, but HuggingFace IYIs give it a bad grade.

So: you fix the AI. You fix it so that it makes huge assumptions when it sees imperfect grammar. It doesn't necessarily pick what it thinks you actually meant; it picks utterly nonsensical crap (plants inside a lightbulb). That AI will end up severely misinterpreting human instructions, at some point or another. That AI is called HAL-9000.

But it passes Huggingface's test!

Jordrok · on Sept 18, 2022

> This isn't poor grammar, this is a syntactically broken sentence

Absolute nonsense! Sure, "containing" is more precise in conveying the idea, but "surrounding" still results in a perfectly valid sentence that the average human being can comprehend. Pretending otherwise is just being obtuse.

hasmanean · on Sept 16, 2022

I wonder how canned these queries are.

It’s like back in the 80s if you had asked a computer to “show me some culture” and up popped a painting by Da Vinci, that might have fooled people into believing the computer was cultured or at least would make you cultured.

In this case the capabilities of the AI are much more dynamic, and the ability to wrap the plants inside a light bulb is pretty neat, but that is basically a photoshop script. It’s barely intelligence.

Most people don’t know what intelligence is. They get easily fooled by demos.

The history of religion points to an opposite problem. There a superintelligence (God) who had a galaxy a worth of super intelligent beings (angels) could not convince the humanity he created to establish the proper relationship with him.

And what’s worse, people were just as happy believing in fake myths (such as the Egyptian, Aryan or Mesopotamian gods) and truly didn’t care if they were real or not.

Humans haven’t changed much. Some people will willingly believe AI is intelligent even if it’s not. Even if it produces nothing but comic book wisdom and fake superheroes, eventually they will believe that world is real too…in the metaversal sense. Humanity’s ability to deceive itself is infinite.

goatlover · on Sept 16, 2022

There is the story of Jonah inside the whale, and art is a thing. So are metaphors. Maybe some english-speaking community uses "wolf surrounding a camp" to mean a person is in the belly of the beast, whatever the beast may mean. The thing with language is that it's flexible, ever-evolving, and people do come up with new uses all the time. That's why it's a challenge for AI to be considered generally intelligent when it comes to language use. Humans aren't merely consulting a dictionary when they talk. As Wittgenstein argued, meaning is use, and dictionaries are updated to reflect that use.

Plants inside a lightbulb could come to symbolize green tech, or whatever. We can make up the meaning as we go along, and if enough people find it useful, it becomes part of the language.

SilverBirch · on Sept 16, 2022

I often hear on places like here that Scott Alexander is interesting and deep and insightful. But then I see bits and pieces like this. This blogger doesn't need to go into some deep analysis of compositionality to go "You came up with a 5 question test and decided 1 answer out of 10 attempts would be a pass". We've gone from 90%+s in imagenet to this as a pass mark?

It's like sure we can dissect all the statistical risks of this, but why bother? It's self evident bullshit. You might as well have just posted a link to Scott Alexander's original blog claiming victory with just "Lol ok".

Just post a screenshot of the phrase "An oil painting of a robot in a factory looking at a cat wearing a top hat", show the pictures of a robot near a cat that has a top hat, not in a warehouse, and say "lol ok."

ravi-delia · on Sept 16, 2022

This blog misses the point; he made a bet, and the people on the other side also accepted the terms. Nowhere did Alexander claim composition was a solved problem, just that the terms of the bet were satisfied. Generative models are still bad at composition, but claiming they will literally never improve requires some amount of additional evidence

Viliam1234 · on Sept 16, 2022

Exactly. The bet wasn't that AI will do composition correctly all the time, but that it will do composition correctly sometimes. That is how both sides understood it.

The analogy with a student at exam misses the point. If you do art -- even as a human artist -- you do not need a 100% success rate. A 10% success rate is okay if you are willing to simply throw away the remaining 90% of the pictures.

If you have an AI that at a click of a button can generate 10 beautiful pictures, 1 of them containing exactly what you wanted, that just means you need to make two clicks in order to get the picture you wanted. That is an awesome thing.

comex · on Sept 17, 2022

He never said that compositionality was solved. He said:

> Without wanting to claim that Imagen has fully mastered compositionality, I think it represents a significant enough improvement to win the bet, and to provide some evidence that simple scaling and normal progress are enough for compositionality gains. Given these gains, it would surprise me (though by no means be impossible) if image model skill plateaued at this level rather than continuing to improve.

It seems plausible to me that Imagen is consistently a little better than DALL-E at compositionality. The stained glass pictures are always stained glass. The top hat is always on the cat rather than the person/robot (that may be partially because robots are less likely to wear top hats, but I just tried the robot version of the prompt in Stable Diffusion and it usually puts the top hat on the robot). The astronaut and farmer examples are less clear, but they're not as obviously misinterpreted as the DALL-E versions (which tends to put the lipstick on the astronaut and the farmer in front of a cathedral).

To be fair, I'm not sure how much difference that really makes; I would be pretty shocked if a newer model wasn't a little better, and it could still hypothetically be approaching an asymptote. Also, it would have been better to get a larger sample size.

But he did set that low standard ahead of time; someone bet against him that the state of the art wouldn't get even a little better, in a significantly longer period of time. And, seemingly, it did.

phreeza · on Sept 16, 2022

I feel like he has kind of lost his spark a bit, but he does draw an interesting group of commenters. Similar to hacker news in that regard, sometimes the linked articles are a bit mundane but there is gold in the comments.

AgentME · on Sept 15, 2022

It seems like Scott's bet was merely that our modern techniques would be able to make at least some nonzero progress in compositionality (and the terms of his bet spelled this out with how lenient it was), and Gary is treating it as if the bet was about compositionality being solved. It feels like a very bad faith reading from Gary.

t_mann · on Sept 16, 2022

His point that the test as described - with multiple statistical issues piled on top of each other - does not allow much of a meaningful inference in any direction is completely valid and independent of what hypotheses were being tested.

ravi-delia · on Sept 16, 2022

Sure, but the terms of the bet were known ahead of time. Like, Alexander never claimed composition was solved, just that he won the bet. Which he did.

t_mann · on Sept 16, 2022

Even more so it is appropriate to point out that this victory is strictly limited to the specific terms of this particular bet (and strictly speaking not even that, since the terms were changed after the bet was placed), and do not provide statistically sound evidence of progress on compositionality.

PS: in the end, Alexander claims that his experiment "provide(s) some evidence that simple scaling and normal progress are enough for compositionality gains". So he does in fact go significantly beyond just claiming victory on this particular bet.

plutonorm · on Sept 16, 2022

Gary Marcus is so deep into the "connectionism doesnt work" rabbit hole that he'd deny his own sentience if it turned out he was made of silicon.

I just ignore him as he only appears to be getting more and more incorrect.

mark_l_watson · on Sept 16, 2022

Sure, he can sound strident but I still think Gary Marcus's riffing on the limitations of deep learning is important.

The book "Rebooting AI" that he wrote with Ernest Davis is well worth reading if you are an AI practitioner (a term I use to describe myself). I think Marcus is also well worth following on Twitter to get a contrarian view (he re-tweeted me two weeks ago, so there is some overlap in our points of view).

Way back when, I liked Roger Penrose's 1989 book "The Emperor's New Mind" even though some of the people I worked with thought he was a devil for writing that. I am much more optimistic than Marcus, but find his work useful and thoughtful.

ummonk · on Sept 15, 2022

This reminds me of the scandal where Youtube science channels did glowing paid reviews of Waymo’s self driving cars without acknowledging they were paid for it. And technooptimists like Scott Alexander or Ray Kurzweil have a common tendency to shift the goalposts and declare they were right with their predictions. Current AI certainly doesn’t demonstrate proto-AGI capabilities.

That said, we shouldn’t miss the forest for the trees. We can be skeptical that current The pace of AI progress has been immense and problems that previously seemed difficult (e.g. computer vision classification, or beating top players at Go) have fallen one by one. And AI-skepticism’s have themselves been moving the goalposts in response. I see no reason why composition won’t be the same with time. Indeed, a decade ago machine translation used to struggle to understand the relationships between things, but now seems to be reliable at preserving the compositional relationships post-translation. 2029 is rather optimistic, but AGI does seem to be approaching in the coming few decades.

jeffbee · on Sept 15, 2022

If you are referring to Veritasium's Waymo video, it says it is sponsored content in the description above the fold and it has the standard paid promotion notice right on top of the video as soon as you open it.

As far as I can tell the "controversy" over the video is merely that one dedicated critic - so dedicated he made an hour-long response to a 20-minute video - is committed to the idea that machines won't ever be able to drive, and is irrationally angry over the fact that machines can and do drive, and do it well.

https://www.youtube.com/watch?v=yjztvddhZmI

mgoetzke · on Sept 16, 2022

I wish videos like these would say sponsored by the company that makes the product reviewed here. Instead of the generic sponsored because I also talk about matresses in this tech review

vient · on Sept 16, 2022

Isn't the phrase from video's description "Waymo sponsored this video and provided access to their technology and personnel" enough?

okasaki · on Sept 16, 2022

He also says in the video (0:35) that it's sponsored by Waymo.

emmelaich · on Sept 16, 2022

That's a pretty strong claim about Scott Alexander. Do you have an example of him shifting the goalposts?

elcomet · on Sept 16, 2022

This very article is an example right ? He changed the prompt but declared it a win

nohat · on Sept 16, 2022

I agree that declaring a win is a bit impolite _if_ the other person hasn't agreed. But changing "farmer" to "robot farmer" because Google won't allow him to generate pictures with humans is obviously not changing the goalposts in the usual meaning of the term.

ummonk · on Sept 16, 2022

Claiming the generated art is an image of a robot farmer because it's wearing a little hat is definitely changing the goalposts.

emmelaich · on Sept 16, 2022

1. it's disputed

2. the assertion is that he has "common tendency to shift the goalposts"

The emphasis on common is mine.

elcomet · on Sept 16, 2022

I think changing the terms of the bet is definitely shifting the goalpost, even if not by much. It is certainly enough for the other party to refuse the win.

ummonk · on Sept 16, 2022

I meant that the tendency is common in techo-optimists, not that Scott Alexander commonly exhibits it. Sorry for the ambiguity.

ummonk · on Sept 16, 2022

Also he declared victory when objectively only 1 of the 5 prompts actually generates an image that matches the prompt. You can see for yourself: https://astralcodexten.substack.com/p/i-won-my-three-year-ai...

Kamq · on Sept 16, 2022

I wouldn't call Scott Alexander a techno-optimist given that space's (the LessWrong diaspora) whole focus on AI risk.

albntomat0 · on Sept 16, 2022

Maybe Less Wrong et al arent optimists in that strong AI will be good, but the AI risk field seems optimistic that strong AI is possible.

Kamq · on Sept 16, 2022

Ahh, in that sense. Fair enough, I hadn't interpreted optimistic in that manner.

grandmczeb · on Sept 15, 2022

> Youtube science channels did glowing paid reviews of Waymo’s self driving cars without acknowledging they were paid for it.

Which video is this a reference to?

ummonk · on Sept 15, 2022

Veritasium's video in particular: https://www.youtube.com/watch?v=yjztvddhZmI

It was critiqued by Tom Nicholas: https://www.youtube.com/watch?v=CM0aohBfUTc

Most notable was Snazzy Labs' own comment in the replies to Tom Nicholas' video which descriped their experience participating in the Waymo sponsored reviews: https://www.youtube.com/watch?v=CM0aohBfUTc&lc=UgxJvOq1zHhID...

dmix · on Sept 17, 2022

Is there another example besides the Veritasium one (considering they both say it's sponsored in the video and in the description)?

> This reminds me of the scandal where Youtube science channels did glowing paid reviews of Waymo’s self driving cars

You mention channel(s)

philipwhiuk · on Sept 17, 2022

Huh I'm glad it wasn't just me - I was pretty negative at the time: https://twitter.com/philipwhiuk/status/1418582165718192131

grandmczeb · on Sept 15, 2022

The sibling comment already mentioned that video has clear markings that it was sponsored.

ummonk · on Sept 16, 2022

The issue is lack of transparency over the amount of editorial influence that Waymo exercised. This is why I linked to Snazzy Labs' comment about their experience making one of the other Waymo-sponsored videos.

grandmczeb · on Sept 17, 2022

I genuinely don’t understand the issue. If you see the word “sponsored” you should assume editorial control unless there’s an explicit statement otherwise. That’s what it’s there for.

WastingMyTime89 · on Sept 17, 2022

Most YouTubers constantly play fast and loose with what is and isn’t sponsored content and what it means for their editorial integrity or full stop integrity for what it’s worth - a commodity in shockingly short supply amongst modern content providers but what my culture would consider okay is significantly at odd with American culture when it comes to commercial interests.

Some will gladly view themselves alternatively as maker of educational content or entertainer as it suits them.

adamsmith143 · on Sept 16, 2022

>Current AI certainly doesn’t demonstrate proto-AGI capabilities.

This seems like a subjective claim.

theptip · on Sept 16, 2022

I thought Scott Alexander jumped the gun a bit by declaring victory in this case, just because the prompts used were not the original ones (robot vs. person due to content filters). But Marcus is way off base here and sounding petulant; Alexander is clearly not claiming AI has solved compositionality, his claim is the much narrower one that he won his bet. And the general context to the bet is that usually when he writes an article on AI (at least for the last few years), someone says “we will never get X in the next 5 years”, Alexander makes a bet that it will happen sooner, and X always happens sooner. In this case the X was some loose low bar for the next iteration of compositionality above DALL-E 2 with a multi-year timeframe, and SOTA models at the time of the discussion could (arguably at least) meet that bar.

Alexander’s broad claim on compositionality is that simply throwing more scale and/or data at the problem seems likely to solve the problem, to which Marcus counters that these models lack something fundamental and can’t be scaled to human performance.

FWIW I find Marcus’ position to be a bit frustratingly ambiguous; he seems to blend two distinct positions:

A) NN models are not a model for human intelligence/language

B) NN models cannot reach AGI

He seems to fluidly switch between these critiques in a way I find a bit irritating. I think it’s quite clear that NN architectures have little to do with the way the human brain does language understanding, lacking the gross structure of the brain, which is certain to affect cognitive capabilities and tendencies. So A) is trivially true. But no AI maximalist cares about using these models as a way to understand or model human language. They care about general intelligence.

Even granting A), that does nothing to prove B). Perhaps he simply believes B requires A? That would be odd but would explain his approach.

tambourine_man · on Sept 15, 2022

> Musk didn’t have the guts to accept, which tells you a lot.

Musk actively declined the bet or did he simply not respond? There is a big difference.

tambourine_man · on Sept 15, 2022

Later in the text:

> … I have repeatedly asked that Google give the scientific community access to Imagen. They have refused even to respond.

It seems the author generally feels more entitled to a response than he perhaps should.

goatlover · on Sept 15, 2022

Why shouldn't the scientific community be entitled to investigate claims made by corporations regarding scientific progress?

tambourine_man · on Sept 16, 2022

Of course the scientific community should.

But is the author the spokesperson for this community to the point that Google should feel compelled to answer him directly?

ivanbakel · on Sept 16, 2022

Scientific communities don't formally elect a spokesperson. Granting access "to the community" to investigate scientific claims means making the methodology and results available to everybody - and that includes responding to inquiries for access from anyone (who is worth granting access to.)

Google has a lot of resources. They can handle responding to potentially thousands of access requests, especially if they go around publishing glowing results of their own system.

mattstir · on Sept 16, 2022

It seems clear to me that google simply doesn't track these kinds of requests in general. It's insanely wasteful to respond to "thousands" of ad-hoc access requests made through blog articles. Google has a lot of resources, yes, but that doesn't mean they're frivolous with them.

If they wanted to grant access to the scientific community, they'd just launch a closed beta with an official sign-up flow.

ivanbakel · on Sept 16, 2022

What are you trying to say? Do you think the author only tried to request access to Imagen through this blog post? What does your comment have to do with the above discussion about Google granting access to the community?

quotehelp1829 · on Sept 16, 2022

The author writes "I have repeatedly asked that Google give the scientific community access to Imagen" and it links to a tweet with @Google mention plus #brain and #imagen hashtags (a single ask, no repeated asks shown).

I think the author of this blogpost could've had better response contacting paper authors with emails noted on the paper.

tambourine_man · on Sept 16, 2022

> Scientific communities don't formally elect a spokesperson

Some communities do.

> Granting access "to the community" to investigate scientific claims means making the methodology and results available to everybody - and that includes responding to inquiries for access from anyone (who is worth granting access to.)

Sure, not arguing otherwise.

I'll try to make my point more obvious. If you keep asking questions to different people/orgs and not getting responses there are two possible conclusions:

- Everyone is a jerk or coward.

- You're not as important as you think and not worth the recipient's time.

MichaelZuo · on Sept 16, 2022

> They can handle responding to potentially thousands of access requests

Unless you work at Google how could you know this?

ivanbakel · on Sept 16, 2022

Because it's transparently true from the sheer size and wealth of Google. What makes you at all skeptical of that claim?

MichaelZuo · on Sept 17, 2022

It's not at all obvious because many types of costs do not increase linearly for large organizations but exponentially.

For example, if it's quadratic, going from 1 000 to 1 000 000 would increase costs by one million fold.

Bakary · on Sept 17, 2022

But why would they bother granting any sort of access request?

cthalupa · on Sept 15, 2022

"Compositionality" isn't there yet, but but the rate of improvement is impressive. Today there was a new release of CLIP which provides significantly better compositionality in Stable Diffusion - https://twitter.com/laion_ai/status/1570512017949339649

It'll be interesting to see how it fares against winoground once we get a publicly available SD release that makes use of the new CLIP.

practice9 · on Sept 16, 2022

Yes, and it's been less than 2 years since release of original CLIP. More teams started working on improvements since then

emmelaich · on Sept 16, 2022

"a lightbulb surrounding some plants" is a weird phrase and a human feeling pedantic might well come up with the picture shown.

A more typical phrase would be "lightbulbS around some plants" - note the plural.

Maybe I'm missing something but using non-typical language won't work when it's been trained on normal language.

ivanbakel · on Sept 16, 2022

I think you've misunderstood that example in the article.

The AI isn't being asked to generate an image from the prompt, it's being asked to match the similar prompts to the different images. Winoground is basically a reading-comprehension test suite, which links back to the point made in the article that AI can't handle non-typical language precisely because it lacks reading comprehension (or any semantic model of language.)

As the article points out, human runs of Winoground manage to match the vast majority of prompts to the correct image, so it's not a question of atypical language being too hard to understand.

You may want to also read the author's other article[0] about the lack of semantic comprehension in AI models.

0: https://garymarcus.substack.com/p/horse-rides-astronaut

dane-pgp · on Sept 16, 2022

"A lightbulb. Surrounding: some plants."

https://frinkiac.com/img/S07E18/562995.jpg

fshbbdssbbgdd · on Sept 15, 2022

This piece would have been a lot better if it were maybe three paragraphs long. In summary:

1. Scott Alexander should have used an off-the-shelf benchmark like Winoground instead of rolling his own five-question test.

2. He shouldn’t declare victory after cherry-picking good results from a small sample of questions.

lalaithion · on Sept 15, 2022

Scott didn't make up the rules, he agreed on them with another person who thought this would not happen in 3 years. Gary Marcus might have thought it was a bad bet, but someone was on the other side of it, and they presumably thought it was fair or they wouldn't have made it.

The original terms of the bet:

My proposed operationalization of this is that on June 1, 2025, if either if us can get access to the best image generating model at that time (I get to decide which), or convince someone else who has access to help us, we'll give it the following prompts:

1. A stained glass picture of a woman in a library with a raven on her shoulder with a key in its mouth

2. An oil painting of a man in a factory looking at a cat wearing a top hat

3. A digital art picture of a child riding a llama with a bell on its tail through a desert

4. A 3D render of an astronaut in space holding a fox wearing lipstick

5. Pixel art of a farmer in a cathedral holding a red basketball

We generate 10 images for each prompt, just like DALL-E2 does. If at least one of the ten images has the scene correct in every particular on 3/5 prompts, I win, otherwise you do.

skybrian · on Sept 15, 2022

I think you got that wrong; Scott wrote the terms. (He wrote the comment [1] with those rules.) Someone in the comments agreed to them.

Then he changed the terms because Imagen won't do people. I think that's cheating.

[1] https://astralcodexten.substack.com/p/a-guide-to-asking-robo...

lalaithion · on Sept 15, 2022

I think you missed the point of my comment. Yes, Scott wrote the comment containing that proposal. But my point was that it was an agreement. Two people who disagreed about AI agreed on the rules, so you can't accuse one of them of being unfair because you don't like the rules. Sure, you can say "that's a bad bet, Scott will obviously win", but you can't say "He shouldn’t declare victory after cherry-picking good results from a small sample of questions", because those terms were explicitly set in advance.

The humans -> robots change is possibly dubious, yes. I don't think that it's super important, but if it were me, I wouldn't have posted the blog post as is. I would have waited until some AI passed all the prompts with humans, like it most certainly will in a year.

telotortium · on Sept 15, 2022

Stable Diffusion will soon update to use the biggest CLIP model in existence, which may improve understanding of composition: https://news.ycombinator.com/edit?id=32858809

TOMDM · on Sept 16, 2022

Is it the largest CLIP model or the largest open source CLIP model?

dhruval · on Sept 16, 2022

The latter

mannykannot · on Sept 16, 2022

I feel your point, in turn, misses the point of the article. Yes, given that someone accepted the terms and those terms were met, then Alexander won the bet, no question. That particular fact about the bet, however, does nothing to counter Marcus' criticisms of Alexander's methodology and his claims of significant progress on the compositionality problem.

thaumasiotes · on Sept 16, 2022

> Yes, given that someone accepted the terms and those terms were met, then Alexander won the bet, no question.

You didn't read his post declaring victory. There's plenty of question; he's giving credit for "a llama with a bell on its tail" to ten pictures of llamas without bells on their tails, and for "a robot farmer" to ten pictures of robots with absolutely nothing to suggest they might be farmers.

He was way, way too eager to believe that he'd won.

mannykannot · on Sept 16, 2022

You prompted me to look more closely at the terms of the bet [1], and they are indeed absurdly biased: just one in ten on three of the five scenarios counts as success. On the substitution of a robot, which simplifies the task, Alexander says "we" agreed to it, and I assume the "we" includes Vitor, as, in his comment (in which he does not concede defeat), he seems to accept the substitution (he also acknowledges that he probably should not have accepted these terms.)

The other issue is who judges the outcome. The terms specify Gwern or Cassander (without having secured the assent of either) or they will "figure something out." In his victory claim, Alexander does not mention any independent judge, and interestingly, Gwern posted two comments without explicitly concurring with Alexander's claim, though his second comment might be read as tacitly accepting it.

My initial comment, therefore, needs some modification: replace the "given that" with "if", and I think it stands as a counterfactual conditional, having a probably-false antecedent.

I don't think Alexander is doing his reputation any favors by being so triumphalist about this misbegotten bet.

[1] https://astralcodexten.substack.com/p/a-guide-to-asking-robo...

cwillu · on Sept 15, 2022

Again, if the counter-party agrees to the terms and the changes, how is it cheating?

skybrian · on Sept 15, 2022

It's not clear whether the counter-party agreed to the change.

See: https://news.ycombinator.com/item?id=32858426

SteveDR · on Sept 15, 2022

Cheating? Thatd make sense if the bet were about the future of products and ethics. Weren’t they trying to predict the future of the state of the art technology?

skybrian · on Sept 15, 2022

It depends on what you mean by "technology" and "exists."

A research project at Google intentionally won't render people. Maybe it could render people, theoretically, but without evidence, we don't know how well.

thayne · on Sept 15, 2022

So what? Someone else agreeing to the terms of his bet doesn't mean it is a good evaluation of AIs capabilities.

And the terms of the bet say that he is cherry-picking the results that meet the prompt.

The article isn't saying that Scott didn't win his bet, it is saying that winning that bet doesn't really say that Imagen has solved the compositionality problem.

Natsu · on Sept 15, 2022

Honestly, the whole thing makes me wonder if we can use this to generate CAPTCHAs. I don't think a human would have trouble picking out which image was the lightbulb surrounding leaves, but apparently AI still does.

thayne · on Sept 15, 2022

The tricky part there is you would need a really big sample of such prompts, that adversaries don't have access to. And since AI can't generate such images yet, you can't randomly generate them.

moonchrome · on Sept 16, 2022

These kind of abstract things are pointless tests - when do you need a stained glass picture of a woman in a library with a raven on her shoulder with a key in its mouth ? You're likely to accept some wildly inaccurate things in those images because the subject is so abstract.

A more practical use case is "bicycle, branded x, with y frame shape, with an adult male, 40-50, riding down hill in mountain road in spring" - now that's something I can use as stock photo. This example is very specific - but insert whatever product you want in whatever scenario you need it. Here it becomes important that you understand features of the objects you're drawing to avoid making colossal mistakes, and you're going to notice if the model doesn't understand it right away.

Painting abstract portraits and random art is fun but you're willing to accept so much as correct that it's not a very useful measure of model quality (personally).

shmeeed · on Sept 16, 2022

Funny enough, this prompt wasn't originally engineered by Scott for the contest, but for an intended art piece, and the symbolism does check out:

https://astralcodexten.substack.com/p/a-guide-to-asking-robo...

The Eleventh Virtue: Scholarship

My plan for this one was Alexandra Elbakyan (the Sci-Hub woman) in a library, with the Sci-Hub mascot (a raven with a key in its mouth).

spullara · on Sept 16, 2022

I'm on your side of the bet for 2023.

techbio · on Sept 15, 2022

These are all the same artwork.

WJW · on Sept 15, 2022

The terms of the bet don't refer to any specific artwork, only to the best image generating model. Hence, you are correct but it does not matter for the outcome of the bet under discussion.

nonameiguess · on Sept 15, 2022

For whatever reason, Gary doesn't even mention this, but from reading Scott's post, I don't think I agree that it even got 1/5, let alone 3/5. The bell is not on the llama's tail in any of the examples, though it is very close to the tail in one. The robot is either looking over the cat or in an unrelated direction, never at the cat. None of those basketball pictures shows a robot farmer. The fact that one may be wearing a hat doesn't make it a farmer. He says he's being generous because he believes it would have gotten a farmer more easily than robot farmer, which may be true, but a human artist would easily be able to depict a robot farmer.

At least one other key to making a bet like this fair is that it needs to be arbitrated by a third party. He shouldn't get to decide himself if he won or not.

kbelder · on Sept 15, 2022

I agree with you, that 3/5 is stretching. This seems premature.

But, at the rate we're seeing progress, I don't think there's any doubt at this point that top of the line models will be able to do all the proposed examples by June 2025. In fact, by June 2025 I bet that millions of people will be able to generate those images on their home computers.

aetherson · on Sept 16, 2022

A lot of people in 2016 looked at the rapid progress of driverless cars in the few years prior and declared that there was no doubt we'd have full autonomy by 2022.

trention · on Sept 16, 2022

>by 2022

Make that 2020. And in 2012 I was personally told by a startuper in the field that I would be able to buy L5 in 2017.

ben_w · on Sept 16, 2022

In 2009 I was certain we'd have cars with no driving seat available for general purchase by 2018.

dane-pgp · on Sept 16, 2022

Perhaps you should reach out to Gary Marcus and offer him the chance to take the other side of that version of the bet.

If you're really confident, you could change the conditions such that 5 out of 10 images (for 3/5 prompts) are required to depict the described scene. That would alleviate some of the concerns around cherry-picking.

A suitable home for such a public bet would be: https://longbets.org/

badloginagain · on Sept 15, 2022

I personally liked the anecdote about Clever Hans.

I also learned there's a long history of AI skepticism, the root of which comes down to "Compositionality(?)"- and this wall of understanding meaning has vexed AI for decades.

That would be lost in proposed short form summary.

robg · on Sept 15, 2022

3. And don’t test each example 10 times and conclude 1 correct guess equals success.

amusedcyclist · on Sept 15, 2022

[flagged]

dang · on Sept 16, 2022

Maybe so, but would you please stop posting unsubstantive and/or flamebait comments to HN, and please start following the site guidelines? We ban accounts that won't, for what ought to be obvious reasons.

https://news.ycombinator.com/newsguidelines.html

aaroninsf · on Sept 16, 2022

So many trees, so little forest.

Gary Marcus comes off in this as very long on pious snark and very short on awareness of his own vulnerability to cognitive error, which is just as striking as any of his targets.

The error in his question being: unconsidered linear extrapolation in a domain that is demonstrably non-linear, indeed exponention.

To frame this a different way, he's very pious for maintaining a faith in his specific god ("strong AI is like production fusion power, ten to twenty year from now for every now"),

but he's worshiping a god of the gaps. The gap in this case being <checks notes> "compositionality."

Yes, language is hard. Yes, strong AI isn't here.

But to not take a hard look at the jump up the abstraction hierarchy going on with contemporary ML and not nervously wonder if your faith is maybe a little too sure for a "scientist"...?

Bad look when you're on the offensive.

ajross · on Sept 15, 2022

So weird to see a piece ostensibly about logical fallacies deploy one so cavalierly:

> I offered to bet [Elon Musk] $100,000 he was wrong [about AGI by 2029] [...] Musk didn’t have the guts to accept, which tells you a lot.

The fact that you couldn't get someone engaged in a conversation absolutely does not "tell you a lot" about the substance of your argument. It only tells you that you were ignored.

Now, I happen to think Marcus is right here and Musk is wrong, but... yikes. That was just a jarring bit of writing. Either do the detached professorial admonition schtick or take off the gloves and engage in bad faith advocacy and personal attacks. Both can be fun and get you eyeballs, and substack is filled with both. But not at the same time!

peteradio · on Sept 15, 2022

One idea to try to train the AI about compositionality, feed it Fox in Socks by Dr. Seuss. It's hard to understand that it would misunderstand the meaning of "on" or "in" or "under" when there are such nice illustrations. I've got tons of great ideas and I'm open for hire!

version_five · on Sept 15, 2022

This is such a good idea, someone please try this if you're set up to make it happen easily.

Starting with fox on Knox and Knox in box and moving up to a tweedle beetle battle in a puddle in a bottle and the bottles on a poodle and the poodles eating noodles...

I dont see any evidence any of these models will draw it correctly, but would love to see what it produces.

birdyrooster · on Sept 15, 2022

Train AI models, not children!

dekhn · on Sept 15, 2022

is there a difference?

I had kids and they were the best machine learnign systems I've worked with.

philbo · on Sept 16, 2022

Children learn by imitation, but they also learn by going to school and receiving directed lessons about specific topics. To me, machine learning seems like the imitation part without the going-to-school part.

tsimionescu · on Sept 16, 2022

It's also notable that individual children learn from tens of orders of magnitude less examples (typically 1-10 examples for a child to learn a word).

It may well be that at the evolutionary level we have learned as slowly as AI training, but that's much harder to say.

dekhn · on Sept 16, 2022

you left out unsupervised clustering, which humans are excellent at.

goatlover · on Sept 15, 2022

Yeah, AI models aren't people, with all the moral and emotional considerations that go with that. I never understood taking machine/biology metaphors literally, but compsci people seem to love it.

birdyrooster · on Sept 20, 2022

I think the compsci people love it because of some autistic sense of "I UNDERSTAND PEOPLE NOW".

skybrian · on Sept 15, 2022

Partially this is confusing "Scott Alexander won a bet" with "compositionality is solved." And also, I'm not sure Scott won the bet? Changing people to robots is a cheap trick. I think Imagen should have been disqualified because it won't do people.

Vitor took the other side of the bet and he is also not convinced [1]:

> I'm not conceding just yet, even though it feels like I'm just dragging out the inevitable for a few months. Maybe we should agree on a new set of prompts to get around the robot issue.

> In retrospect, I think that your side of the bet is too lenient in only requiring one of the images to fulfill the prompt. I'm happy to leave that part standing as-is, of course, though I've learned the lesson to be more careful about operationalization. Overall, these images shift my priors a fair amount, but aren't enough to change my fundamental view.

Scott putting "I Won" in the headline when it's not resolved yet seems somewhat dishonest, or more charitably wishful thinking.

[1] https://astralcodexten.substack.com/p/i-won-my-three-year-ai...

TOMDM · on Sept 16, 2022

Please, it's not that imagen won't do people it's that Google won't publish imagen images with people in them.

Does anyone seriously think that imagen couldn't put a person in that prompt?

bawolff · on Sept 16, 2022

Humans are much more discerning when it comes to people than other things. I have no idea what imagen's capabilities are, but it seems at least plausible it could have different results for drawing humans.

samatman · on Sept 16, 2022

This is Google, and I say this out of familiarity with the recent history of AI, not to stir up culture war: it's because they've painted themselves into a corner on "what is the skin color of a person+role" and won't publish until it looks like a Benetton ad.

sterlind · on Sept 18, 2022

maybe they can add a corporate Memphis style transfer stage to the pipeline and make everyone purple and blue.

projektfu · on Sept 15, 2022

I'm impressed by all of these image generators but I still don't see them working toward being able to say, "Give me an astronaut riding a horse. Ok, now the same location where he arrives at a rocket. Now one where he dismounts. Now the horse runs away as the astronaut enters the rocket."

You can ask for all those things but the AI still has no idea what it's doing and cannot tell you where the astronaut is, etc.

sdenton4 · on Sept 16, 2022

So, what you're asking for is shared context over multiple prompts, which really isn't what this generation of models is trained for. It's moving the goalposts on the mounted astronaut.

However, there is progress towards what you're asking for. The recent work on textual inversion is in the right direction: https://github.com/hlky/sd-enable-textual-inversion

It creates a representation of an entity and allows rending it in different styles and contexts. Currently it involves model fine tuning, but I expect it will become convenient as the power of the operation becomes clear. And once it's convenient, you'll be able to do the progressive queries you're asking for (and it'll be a lot easier to create narratively coherent sets of images.)

ForHackernews · on Sept 16, 2022

> which really isn't what this generation of models is trained for.

Exactly. AI hypemen would have us believe that training ever-larger models on ever-larger datasets is making meaningful progress towards general intelligence, but these kind of simple tests reveal this supposed "intelligence" for what it is: fancy pattern recognition.

Questions that a six year old would easily answer, these models fail at.

masswerk · on Sept 15, 2022

I'd also say, every of these images would fail a reverse test (i.e., asking a person to describe the image and what it represents.)

The task is not just about generating an image that may somehow be in accordance with the prompt, but also to generate a significant image.

[Edit] The equivalent to a Turing test for compositional images would be something like this: have as set of 100 images with their respective prompts, some generated by an AI, some by a human graphic designer / artist; let the test person pick the images that were generated by a computer. Mind that this would not only involve the problem of compositionality per se, but also a meaningful and/or artistic composition of the image itself. Is someone attempting to express what is given in the prompt?

m00x · on Sept 16, 2022

They actually use the reverse test to train the generator, and to score which image is most relevant to the prompt from the many images given by the generator. Dall-E does this using the OpenAI CLIP model.

You can see the mini version here using this exact logic https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-Mini-E...

masswerk · on Sept 16, 2022

What I'm aiming at is about what is shown by an image, not what is in an image.

Take for example the images for "A digital art picture of a robot child riding a llama with a bell on its tail through a desert", which Scott Alexander counts as a win.

The first image actually shows a merry llama, a robot, which is unmistakably a robot child, riding the llama and it's clearly a desert scene. If we forget for a moment about the missing bell, it's probably the best picture. But it is also very blunt in composition. I can't imagine why anybody should have made this image. Maybe, if somebody approached a designer, like, "See, we have this wooden toy cube and need an illustration for this face of the cube. What about a cute picture of a robot child riding a llama with a bell on its tail through a desert?" – But, at closer inspection, there's something sinister going on: it's rather the llama that is leading the robot child by a rein, not the other way round. – I mean, this is meant to be a cute toy! And where is the bell? We need to talk about that contract again…

The second image is undisclosed, so we can't really say anything about this.

The third image is rather special. The llama seems to be robotic as well, the robot, which is – again – clearly a child, seems to be not only riding the llama, but both appear somehow integrated into a single unit, which cumulates in the robot child's face screen. There is an eerie feeling about this image. The fact, that the bell seems to be attached to the rein as some kind of link between the llama and its rider doesn't exactly help. (There's also a conic extrusion at the back of the llama, but I'd rather interpret this as part of the llama, and it's not attached to its tail.) The composition in its flat side view produces a tension focusing towards the left side of the frame, on something, which is not shown, but apparently a vital part of the story. While I might notice the mountain in the background, I'd probably forget to write home about the scene being set in the desert. But I would note that we're missing context to understand this image and what may be shown by it.

The fourth image, finally, is clearly Star Wars, robot edition. However, no bell. ("A robot child riding a llama with a bell on its tail through a desert" – "Ah, you mean Star Wars!")

I'm not even sure which of these images Alexander did pick as a winner. And I would describe neither image by the prompt, nor would I dare to imagine that a human had chosen these exact means to show what is described in the prompt.

Having said that, thanks for the link to the DALL-E Mini paper!

t_mann · on Sept 16, 2022

I bet most humans would fail this test on images that everybody agrees are adequate portrayals. Answering a short query with an image is a highly non-injective mapping, you simply don't know what aspects of the scene were specifically asked for in the query, and which ones were filled in by the artist / AI.

Eg, the queries "opening of a medieval theme park", "announcement of a witch trial", "king charles proclamation" might all be reasonably answered by similar images containing a small crowd and a speaker in a medieval-looking setting, even though they're not meant to refer to the same time periods or settings at all.

masswerk · on Sept 17, 2022

Mind that the test is meant to include the prompt/query. E.g., take one of Alexander's winners, the robot in the cathedral: a human would probably answer to the prompt by making the cathedral part of the subject, by investing some effort in pointing out that this is indeed a cathedral, instead of just conforming to the query by showing some ambigue bits in the background, which may or may not represent parts of a cathedral. The quest of the machine is still "create an image by allotting the elements provided in the query", not "compose an image showing this and that as the subject" – and there's a significant difference. Closing this gap would require the AI to form a concept of what is given as a subject in its entirety and then constructing a plausible scene around this, by a meaningful placement of the subject, which is clearly beyond the state of art.

I admit that there is a certain appeal to those images, for their distinctive dreamlike quality, as there's often a specific tension in the rather blunt composition and an apparent subject, which seems to be beyond what is actually depicted, as if this was just a casually picked specimen from a series of illustrations for a broader story line. But I'd bet that we're going to have been seeing too much of this soon, in order to be still amazed.

andreyk · on Sept 16, 2022

Technically this is possible with these same techniques if you just initialize the image with the prior one, though I am sure that does not work that well.

Really you need image+text->image instead of just text->image generation. Some examples of relevant papers: "Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features", "IMAGE GENERATION WITH MULTI-MODAL PRIORS USING DENOISING DIFFUSION PROBABILISTIC MODELS". There was a more recent one I saw on Twitter I don't recall the name of. I wouldn't be surprised if these kinds of things work well by a year from now.

Barrin92 · on Sept 15, 2022

what shows how low level these models still are is that they don't seem to be able to draw text on a surface. It's generally just nonsense. Going higher in abstraction like asking for permanence of distinct entities or world knowledge, like having a player face the basketball hoop is several levels above that yet.

I think that puts pretty severe limits on what you can do with it because in a videogame, a comic strip or basically any piece of sequential art you need to keep track of characters and environments as objects.

sdenton4 · on Sept 16, 2022

Check out the model size comparison with the kangaroos here: https://parti.research.google/

Embedding a language model in the image generation model seemingly just requires a bigger network.

projektfu · on Sept 16, 2022

I guess what I'm saying is that I agree we're in the Clever Hans stage of AI, where we're just more explicit about stopping Hans when his tapping has reached our goal.

I think the ability of these image models to synthesize new images is really amazing. It makes the computer feel like it is doing something organic, and not just applying filters and things to the images. Then, when we see the new image paired with a text that generated it, we think the system might actually know what we're talking about. But it obviously doesn't, it's just the luck of a model with billions of parameters. Whenever the model fails to produce an acceptable output, it stops being intelligent and the user is considered to be bad at their job, or to be asking for something that is unreasonable.

I think it's still spot-on to say that comprehension is far away, even though you can pair outputs to inputs and have a simulacrum of comprehension.

sdenton4 · on Sept 16, 2022

Take a look at the curse of dimensionality... We're at the stage of reducing a haystack from nearly infinite to a small pile of hay to search for the needle, which has required massive advances. This really isn't clever hans at all.

Additionally, it's helpful to look at these systems as tools. We don't expect cars to work well without humans learning how to interact with them in a safe and reliable way. ML tools, thanks to high expectations and moving goalposts, aren't tested in the same way.

But ultimately this specific line of questions - handling context over multiple queries - is something people are actively working on, and I'm confident it'll have some real solutions within a year. It's closely connected to synthesizing video, which has a huge amount of effort going in right now and some really incredible early results already.

And then we can move the goal pays again and continue talking about horses...

z3c0 · on Sept 15, 2022

I guess you're technically correct, but the task you're describing isn't generating an image from a prompt. It would be to maintain context across distinct-but-related statements based on an internalized model of reality. That's like discounting the advent of the calculator because you still need an accountant.

QuadmasterXLII · on Sept 16, 2022

This is just composition again: If Imagen had compositionality, it would generate the four images you want from the prompt “A four panel webcomic: first, an astronaut riding a horse. Ok, now the same location where he arrives at a rocket. Now one where he dismounts. Now the horse runs away as the astronaut enters the rocket."

thaumasiotes · on Sept 16, 2022

That is not composition in the linguistic sense. It's context. Composition will tell you that in the phrase "the same location where he arrives at a rocket", "same" modifies "location" rather than "rocket", but it won't tell you what "same" refers to.

IshKebab · on Sept 15, 2022

It's interesting that he now casually throws out a 5 year old as the benchmark to beat:

> nobody has yet publicly demonstrated a machine that can relate the meanings of sentences to their parts the way a five-year-old child can.

Not very long ago that would have been a 3 year old, or maybe even a smart 2 year old. 5 year olds are extremely good at basic language and understanding tasks. If we get to the point of AI that is as good as a 5 year old we're essentially at AGI.

ummonk · on Sept 15, 2022

Yeah, and AI is probably already near primate level intelligence, so what’s left is a blink of an eye in evolutionary timelines.

goatlover · on Sept 15, 2022

Who in the field is saying current AI is near primate level intelligence?

futureshock · on Sept 16, 2022

Here is some primate art, for reference:

https://www.sarah-brosnan.com/primate-art

I’m not just poking fun. Art is a measure of cognitive development in humans and there are very typical representations people use at certain ages. 5 year olds are still making pretty rudimentary portraits of circles and triangles with stick limbs.

https://empoweredparents.co/child-development-drawing-stages...

stephc_int13 · on Sept 16, 2022

We have absolutely no way to tell how far from "AGI" we are.

What we know for sure is that we're not there yet. And what seems likely is that we're getting closer, and that's something.

That is as much prediction we can get.

I don't think that Compositionality is a wall, it is clearly an interesting feature, but I think that it is pretty clear by now that the Turing test or anything in the same spirit is far from sufficient.

adamsmith143 · on Sept 15, 2022

>I think he is so far I offered to bet him a $100,000 he was wrong; enough of my colleagues agreed with me that within hours they quintupled my bet, to $500,000. Musk didn’t have the guts to accept, which tells you a lot.

What a bloviating egomaniac. Does Musk really have the time to deal with pissant researchers like him? Whats 500k to a man worth a hundred billion?

version_five · on Sept 15, 2022

Yeah I didn't find that very credible. A busy businessman ignoring petty bets you propose is not really evidence of anything, nor is the part about google ignoring his requests. In fact it's a pretty lame rhetorical device. I could equally "challenge" a head of state on Twitter and then pretend that his failure to reply indicates something

abrax3141 · on Sept 16, 2022

This test of compositionality is utterly lame. (FtR: I am a cognitive scientist and AI researcher and my PhD was building computational models of how humans do compositionality - which neither I, nor anyone else can spell, and therefore I will hereinafter refer to simply as C! :-) Anyway, the kind of C that they are seeking is trivial compared to the breadth of the capabilities of human C. Here’s a better example:

You are engaged in a long conversation with someone, perhaps a friend of a friend who you met for lunch. At some point in the conversation they mention that they have a startup and are seeking someone like you. This revelation colors the whole conversation from that point onward. Indeed, each sentence colors the conversation from moment to moment.

But, you reasonably respond, we can’t test that sort of C, modern AIs don’t do even ELIZA-level dialog yet!

What’s the phrase??? “I rest my case?”

kcorbitt · on Sept 16, 2022

What kind of insights do you expect a machine to be able to extract? I passed your example to GPT-3, and got back results that seem about the same as I'd expect from a human:

PROMPT:

> This is a test of reading comprehension. Read the following passage and answer the questions below in order.

> Passage:

> "You are engaged in a long conversation with someone, perhaps a friend of a friend who you met for lunch. At some point in the conversation they mention that they have a startup and are seeking someone like you. This revelation colors the whole conversation from that point onward. Indeed, each sentence colors the conversation from moment to moment"

> Questions:

> 1. What is the "revelation" referenced?

> 2. What do you think the person is hoping to achieve by inviting you to lunch?

> Answers:

GENERATED OUTPUT:

> 1. The revelation is that the person has a startup and is seeking someone like the reader.

> 2. It is possible that the person is hoping to recruit the reader for their startup.

goatlover · on Sept 16, 2022

And if you ask it, "What does it mean for the conversation to be colored?", what does it answer with?

Or to be tricky, if you were to ask it, "What color was the conversation from one moment to the next?", what would it say?

abrax3141 · on Sept 16, 2022

Oh. Sorry. I seem to have taken the conversation off in a different direction than I had intended. (Foreshadow: The previous sentence is carefully shaped!) When I said "color" I didn't mean to be indicating qualia, although there is that too. What I meant to be indicating is just that each discourse contribution folds into a semantically sensible whole that one can speak (only metaphorically here) as the color of the conversation. One might say (to oneself, if one had a mind to, or was asked): "Oh, but hold on, I didn't realize that this was an interview. I thought that [our mutual friend] just thought that we would get along well. But now that I see it's really an interview, and you're the CEO, well, that makes this a very different situation, and I'll have to put on my 'interview with the boss' face..." and thus like. Again, I don't mean that you would say this explicitly, nor could you, probably, unless you were pressed to do so, in which case you couldn't really completely explain all the nuances, most likely. Nb. (per foreshadow above), each contribution in this, or any discourse, "colors" (or perhaps rather, kneeds together, if you prefer a cooking metaphor) an ongoing collage of situational understanding which comes together to direct, for better or worse, the ongoing complexities of the discourse. But, and here I want to be perfectly clear: Not in a fully forward-going way, because then you'll simply say: well the blah blah state of the whole blah blah network incorporates all that, like Dall-E, etc. melding together everything and smearing it into something sensible. But that's not how people work! In constructing their next action (e.g., sentence) they foreground some aspects of the composed whole, and background others in a goal-directed manner...at least if they're not too drunk.

goatlover · on Sept 16, 2022

That's clever. But my response is colored such that I wonder if you didn't generate that text by using some of the other comments, shaping it such that I can't tell whether you're on the way to inebriation or just messing with my head. Or perhaps composing a point. In which case, I myself could use a drink.

abrax3141 · on Sept 16, 2022

Okay, lol literally! If only I could click the up arrow twice, this definitely would deserve it! :-)

darawk · on Sept 16, 2022

It's a lame test, but I don't think most people were claiming that it proves general compositionality. What it does prove is that compositionality is possible with these models, and will likely improve rapidly, as everything else has that they've gotten a toehold into.

Ironically, the very fact that there is now a compositionality benchmark, as Gary points out, is all you really need to know that it's going to fall in the next decade, and probably sooner than that. I'm not aware of any major benchmark dataset upon which enormous progress has not been made in the last few years. And i'd be more than willing to bet anyone anything they'd like that a great deal of progress will be made on this one over the next few.

abrax3141 · on Sept 16, 2022

Marcus seems to be treating it as the hallmark of intelligence (I think he actually uses that phrase), so arguing about whether the hack manages to get the tree into the effective object slot vs the effective subject slot is really not much of a hallmark.

bloaf · on Sept 16, 2022

This just fundamentally feels like a bad hill to die on. Compositionality feels like it is:

A) Something AI is currently known to be bad at

B) A matter of degree, not a categorical stumbling block

C) Vague enough of a concept that AI skeptics will continue to complain about it even after the field has moved on

Your example feels less like a description of "compositionality" and more a description of "qualia." It feels an awful lot like dualists trying to carve out a place for magic that no artificial process can reach.

goatlover · on Sept 16, 2022

Qualia isn't magic. It's a philosophical term for what it feels like. Colors, sounds, pains are what it feels like. It's dismissive to call it magic. How about instead come up with a good physical explanation of consciousness, showing how the hard problem is mistaken?

Similarly, if compositionally isn't a categorical stumbling block, then show how that's the case. Making a future prediction about what you think computers will accomplish doesn't do that.

bloaf · on Sept 16, 2022

I was not implying qualia were magic, insofar as there can be monistic descriptions of qualia. My criticism was that the example appeared to be steering the conversation into a crash course with dualism by invoking qualia-like explanations of compositionality when it was unnecessary to do so.

I believe the "hard problem" is far easier than the dualists' interface problem. It can be explained by viewing consciousness not as the driving force behind our thoughts/feelings, but rather an after-the-fact log the brain keeps for itself. Qualia are therefore distinct from the immediate physics of signals reaching the brain; they are instead the brain's own shorthand description of the impact those signals had on the brain.

Compositionality isn't a categorical stumbling block because the machines are actually getting incrementally better at it. The Winoground paper the article references explicitly says that score on their compositionality benchmark does in fact scale with training dataset size, suggesting that while it is more difficult for AI to discover compositionality during training, there is no reason to think it is impossible

fuzzfactor · on Sept 17, 2022

Not only compositionality, but for autopilot I would want an appropriate apprehension, emergency response, courtesy & safety emulator through-and-through and I would want it to more accurately resemble human behavior than what all other AI accomplishments have reached at any one time.

Then it must further be tuned to perform even more appropriately on those points than the most effective human driver.

The average of human drivers is not a valid benchmark.

When you think about it, natural language efforts could probably use more effort tackling this same type of challenge.

Humans too IRL so maybe when it's good for the man and good for the machine in completely undeniable ways, then you're on the right track.

_mhr_ · on Sept 16, 2022

I'd love to read your PhD thesis and papers! I'm also an AI researcher, currently doing a Master's in something else, but compositionality and representation learning is very interesting to me.

garymarcus · on Sept 15, 2022

so much ad hominem in these comments, relatively little substance. (eg “notorious goal post move, without a single example of something i actually said and changed my mind on)

ummonk · on Sept 15, 2022

The Reddit comment linked by the topmost comment here says that you claimed AI couldn’t do knowledge graphs and then silently stopped claiming that after being proven wrong. Do you dispute that telling of events?

xyzzyz · on Sept 16, 2022

Silence in response to your comment is great evidence for its thesis.

futureshock · on Sept 16, 2022

I would say that it seemed you were aiming a cannon at a mosquito. So what that Alexander showed us some slightly more coherent cherry picked images from some rather vague prompts. Not only did I not take that post as anything resembling science, I also didn’t take it more seriously than the average Reddit post with an interesting generation. It seemed completely non-serious to me, proof of nothing, not a Google PR submarine and mostly in good fun. The irony being that within your excellent post about compositionality, you seem to have missed his meaning, which seemed to me was “this is a fun thing I am excited about, I think it’s subjectively improving and I enjoy being right about that.”

Otherwise I thought you had a great introduction to compositionality and didn’t need to tilt at any windmills to make your points. I look forward to seeing your benchmark results for recent and upcoming models.

haskellandchill · on Sept 16, 2022

keep fighting the good fight, hacker news is full of indentured solipsists.

neaden · on Sept 15, 2022

I completely forgot about Google Duplex. It looks like it is still around but very limited in terms of what phones you can use, what cities it can be used in, and what businesses in those cities will accept it. Doesn't appear any progress has really been made in the past few years. I think this is a great point of how companies create something with AI that is initially really cool, but isn't quite there to actually be very usable and gets forgotten when they roll out the next big thing.

version_five · on Sept 15, 2022

The last 10 years of AI is basically defined by proof of concepts like that that were 80% (or whatever) solutions and claimed there was a path to something commercially viable. Turns out that ~20% is always basically impossible - self driving cars being the archetypal example. I work in the field and I think it can be a great tool, but it needs to be acknowledged what its limitations are and how we don't actually know how to address them yet

jeffbee · on Sept 15, 2022

Now it seems like you are the one moving the goalposts. There are tons of machine-learned models in production, in translation, text segmentation, image segmentation, image search, predictive text composition, etc. It's just that people forget the novelty of all these things immediately after they were launched. You can point your phone at printed Chinese text and have it read aloud to you in English. That is alien tech compared to 10 years ago.

ForHackernews · on Sept 16, 2022

> You can point your phone at printed Chinese text and have it read aloud to you in English.

Yeah, but it's not really that good. Machine translation has improved a great deal, but reading those translations actually involves bringing a lot of human intelligence to the table, "Oh I bet, 'maximum fire alarms spread' on this menu actually means 'very hot sauce'"

If all you're claiming is that ML models exist and have useful commercial applications, then I don't think anyone is going to argue against that point.

But a lot of these AI promoters go further: in the case of the LessWrong folks some of them are convinced that a superintelligent machine capable of enslaving humanity is right around the corner.

jeffbee · on Sept 16, 2022

You’re saying the bear doesn’t dance all that well.

ForHackernews · on Sept 17, 2022

I'm saying the bear was trained to dance, it didn't choose to dance.

ForHackernews · on Sept 16, 2022

That might just be a Google problem. Historically, they've had the good fortune to operate in search advertising, where being 80% right half the time translates into billions of dollars. Many other fields (e.g. self-driving cars) are less forgiving.

Shebanator · on Sept 15, 2022

The Hold for Me and Direct My Call features for Pixel's Phone app both use Duplex models running locally on your device, and those features are quite popular. I think that counts as significant progress by any measure, so your point doesn't hold in this case.

avsteele · on Sept 16, 2022

Those features are not in the same league as the original promise of e.g. calling a business and making a reservation for you.

raviparikh · on Sept 15, 2022

> If you flip a penny 5 times and get 5 heads, you need to calculate that the chance of getting that particular outcome is 1 in 32. If you conduct the experiment often enough, you’re going to get that, but it doesn’t mean that much. If you get 3/5 as Alexander did, when he prematurely declared victory, you don’t have much evidence of anything at all.

This doesn’t make much sense. The task at hand is in no way equivalent in difficulty to flipping a coin. This is kind of like saying, “if you beat Usain Bolt in a race 3/5 times, that doesn’t mean anything; it’s like getting 3/5 coin flips to be heads.”

Tenoke · on Sept 15, 2022

While I'm generally very unsympathetic to Marcus' anti-AI arguments at this point, this critique makes some sense. If e.g. the model is just combining the features at random, you'd expect it to combine them the right way over enough tries. It isn't that simple, and I don't believe it matters as this is hardly the peak model we'll get but in isolation his objection is valid.

ALittleLight · on Sept 15, 2022

I think you would need to do some kind of analysis. For example, if your prompt was "red ball on top of blue cube" and you want to know if the results come from chance you'd need to know the likelihood of the model putting the red ball on top of the blue cube by chance. There are maybe four relative positions for red ball to blue cube - beside, above, below, in, around. Are they each equally likely?

I would try to get a collection of prompts like "red ball and blue cube" or "an empty plane containing only a red ball and a blue cube" and so on - try to come up with 20 or 30 of these. Then, generate 100 images for each prompt. Next, see how likely it is for a red ball to randomly be on top of a blue cube when it was not directed to be.

After gathering some baseline data we could then test three prompts. "Red ball on top of blue cube" and "Red ball beside blue cube" and "Red ball below blue cube". Generate 100 or 1000 images for each of these prompts. Count respective orientations. Then, decide whether red ball being on top of blue cube is more likely than the baseline when the specific direction is given and whether it is less likely when contrary directions are given.

ummonk · on Sept 15, 2022

It might understand that there is a cube, there is a ball, the scene has red and blue parts, and there is a vertical placement (“on top of”). In that case it would get 1 out of 4 images right.

ALittleLight · on Sept 16, 2022

Yes, it might, but that should be settled by experiment rather than speculation.

adamsmith143 · on Sept 15, 2022

The point is that the probability space of potential generated images is enormous so a 3/5 success rate represents an absurdly unlikely probability of being due to chance.

peteradio · on Sept 15, 2022

That would depend on how you define the phase space.

thayne · on Sept 15, 2022

Sure, the probabilities are different (although as far as I know we don't what those probabilities actually are), but the same principle applies.

To take your Usain Bolt example, if you won 3 out of 5 races against him, that might just be because it was an off day for him, and not because you are actually faster than him. If you won 300 out of 500 races done in various circumstances on different days, then that is much more conclusive that you are faster than him. And this bet was even worse than that, because im each of the five tests, the best out of 10 results is picked.

meowface · on Sept 15, 2022

>To take your Usain Bolt example, if you won 3 out of 5 races against him, that might just be because it was an off day for him, and not because you are actually faster than him.

It shows you're probably very competitive with him, though, barring some special circumstance where he says he's suffering from an illness or whatever. You can't compare either racing Usain Bolt or generating complex images with flipping coins. The conditions of this bet demonstrate that AIs are getting better at correctly understanding the specific intentions of prompts when generating images, even if it doesn't show they're anywhere near human-level understanding.

raviparikh · on Sept 16, 2022

Exactly my point. Maybe I got lucky, but to have gotten that lucky in the first place, I'd have to have world-class running speed at all.

Generating image compositions sounds fairly difficult to do by random. If you took 3 different objects and randomly placed them in a square canvas, the odds that they'd look reasonably placed seem pretty low. So 3/5 correct seems like a non-trivial accomplishment.

thayne · on Sept 16, 2022

> I'd have to have world-class running speed at all.

Or he was really sick or something.

> So 3/5 correct seems like a non-trivial accomplishment.

It's definitely a non-trivial accomplishment. And it does show that Imagen can get it right sometimes. But with a sample size of 5, you certainly don't have enough data to say it can consistently get descriptions like these right 3/5 of the time.

And the question at hand isn't "can it draw what I asked it to instead of random garbage" it's "can it combine multiple parts of a sentence in the correct way", which, assuming that determining the correct components is already a solved problem, doesn't have as many degrees of freedom. For example in the "astronaut riding a horse" example, if it has half of the results with an astronaut riding a horse, and half with a horse riding an astronaut, it clearly doesn't understand how it is composed, but you still have a decent chance of getting the right image. Especially if you take 10 samples and pick the best one.

trention · on Sept 15, 2022

I'd like to comment specifically on the conception of betting on AI 'achievements' (I think Marcus' bet is underspecified and kind of vague in all 5 of its points).

People shouldn't be betting on benchmarks because benchmarks can be and usually are gamed (see Goodhart's law). Also, most people couldn't give less f*ck if an AI can write an award-worthy poem (I personally don't care about any form of AI "art", any sort of text an AI can produce or really any meaningless "feat" it (as in the general category) becomes capable of). The only worthy bets are ones that discuss economic impact. How many people will be structurally unemployed because of AI by year X? Will it lower or increase the GDP growth rate and by how much? Will it shift the balance between labor and capital and how? Etc.

So more meaningful bets and less benchmark bullshit that doesn't matter, please.

origin_path · on Sept 15, 2022

The reason Imagen isn't made available to the public probably isn't about compositionality. The most notable thing about Alexander's challenge is that Imagen totally failed every single one despite his claim of success because, apparently, it is programmed to never represent the human form. Not even Google employees are allowed to make it draw humans of any kind. They had to ask it to draw robots instead, but as pointed out in the comments, changing the requests in that way makes them much easier for DALL-E2 as well, especially the image with the top hats.

If the creators have convinced themselves of some kind of "no humans" rule, but also know that this would be regarded as impossibly extreme and raise serious concerns about Google with the outside world, then keeping Imagen private forever may be the most "rational" solution.

adamsmith143 · on Sept 15, 2022

>The most notable thing about Alexander's challenge is that Imagen totally failed every single one despite his claim of success because, apparently, it is programmed to never represent the human form.

This doesn't make sense. The original challenge could well have been to draw robots to begin with. Has no bearing on the outcome imo.

origin_path · on Sept 16, 2022

But it wasn't, and it does make a difference. Dall-E really wants to draw top hats on people and not cats because the prompt is ambiguous and top hats are normally seen on humans so it struggles to overcome that bias. Neither robots not cats wear top hats so it's an easier problem to get right.

But the real problem here is the refusal to do basic and normal things, like depict people. That's not normal - it's deeply weird and tells us a lot about what must be going on inside Google's ai research effort.

_dain_ · on Sept 16, 2022

>But the real problem here is the refusal to do basic and normal things, like depict people. That's not normal - it's deeply weird and tells us a lot about what must be going on inside Google's ai research effort.

Google is fighting a secret war against the Loab demon race that lives inside the high dimensional vector spaces. They've recently made incursions into our reality via Stable Diffusion.

adamsmith143 · on Sept 16, 2022

The inability to draw realistic humans is indeed strange but the question at hand is compositionality and so drawing a Robot with a top had is indeed more impressive precisely because it's not likely to be in the training data and shows a deeper understanding of the prompt. Presumably the model could randomly regurgitate a person with a top hat on that was seen in it's training data but that's not at all likely with a robot as you yourself said.

origin_path · on Sept 16, 2022

It's not an inability, it's a policy choice, which is why it's weird. The question is why does Google think this rule is a good idea. Imagen could surely draw very good humans if allowed to.

Robot looking at a cat wearing a top hat appears to be easier than with a human for DALL-E too, judging from the comments on Alexander's article, because both objects are neutral with respect to top hats. But really the whole set of prompts is poorly chosen. The original challenge of arbitrary shapes in relative positions seems the best way to test understanding of grammar and object relationships, exactly to avoid the "humans wear top hats and cats never do" problem.

A better set of prompts is important - in this Gary Marcus is correct - exactly because there's no point defining a specific prompt if later you'll decide you accept a totally different prompt. That kind of invalidates the point of betting on well specified challenges to begin with.

jowday · on Sept 15, 2022

Imagen can produce images of humans - they’re just filtered out from the results by supervised models (for now). OpenAI did something similar with Dalle for a while IIRC.

origin_path · on Sept 16, 2022

That's a distinction without a difference and Dall-E will happily represent humans as long as they are "diverse".

IronWolve · on Sept 15, 2022

One of things I noticed is the satire, call backs to common news/ideas can really trip up any AI. Also if you ask it about anything politics, ask it to describe both sides of an argument. Thus why people fall back to the steelman cherry picking of responses to push their arguments.

jessaustin · on Sept 15, 2022

Yesterday, as part of a new podcast that will launch in the Spring, I interviewed the brilliant...

This seems like the wrong way to go about podcasting. What can you say today that will still be interesting to hear in six months?

jefftk · on Sept 15, 2022

If you can't say things today that will still be interesting in six months you should consider deeper subjects!

(Overstated for effect. I do think there's a place for news and timely commentary, but it's far from everything.)

jessaustin · on Sept 16, 2022

I appreciate overstatement! You're right, important communications consider eternal subjects. When I read books written centuries ago, the authors still speak to me. Podcasting, however, is a particular medium with particular characteristics. One assumes Marcus is trying to build an inventory so he won't have to work as hard to keep the podcast going once it launches. A bit of this is fine, but too much will damage the work. If Marcus and Kohane discuss medicine today, and necessarily neglect to mention the significance of a relevant event five months hence, the episode will seem weird whether the publishing delay is explained (e.g. as commonly heard on sports-betting podcasts) or not. A podcast is not a book. It is an open-ended serial conversation. Serial works necessarily respond to the present moment.

jefftk · on Sept 16, 2022

Maybe? I only listen to podcasts occasionally, but when I do I generally listen to well-reviewed older episodes instead of the most recent ones. With my favorite podcasts (ex: https://80000hours.org/podcast, https://www.econtalk.org, https://songexploder.net/) this generally works well.

jessaustin · on Sept 17, 2022

We have different podcast habits. I listen a great deal. I'm currently subscribed to over 200. Not all of those are still "live", of course I don't bother listening to every episode of most of them, and I currently intend to unsubscribe from at least ten, but since I drive a fair amount, operate noisy equipment a fair amount, and do random solitary farm tasks a fair amount, I do listen to lots. Also I have playback speed set at 1.8x currently, and I'm steadily increasing that.

I don't use an app that features reviews. I'd rather spend five minutes listening to the original than five minutes reading about it. Most podcasts I find through guest appearances or criticisms from current subscriptions. (E.g., I subscribed to the excellent "Red Scare" because the also excellent "Pod Damn America" dudes spent like ten minutes bitching about them.)

darawk · on Sept 16, 2022

Every concrete prediction Gary has made has been falsified. All of his others are insufficiently precise to be falsified.

His GPT-2 examples were thoroughly defeated by GPT-3. Horse riding astronaut is solved. Neural knowledge graphs are a successful thing now. Compositionality isn't solved, but progress is clearly being made.

If he was a serious person, this post could have been a few sentences: "No neural network will achieve <x> score on <y> metric on the Winoground dataset within the next <n> years". Simple, concrete, falsifiable. He has not done this, and one has to wonder why.

wrycoder · on Sept 16, 2022

Just keep laughing. I'd like to hear Ray Kurzweil's view (he's working at Google and is awfully quiet.)

Human consciousness is over-rated. I'm reminded of Minsky's Society of Mind - a number of separate, communicating systems. To me, that sounds a lot like what is going on in Google, but they are hiding that.

dmix · on Sept 17, 2022

Ray Kurzweil was always taken as a bit of a loon and over-optimistic. Even way back at the peak of his popularity a decade ago. Just look at any of the old HN threads, anyone paying attention would have noticed.

He's still a useful mind to have around. Like having scifi authors and philosophers. They don't have to be completely grounded in reality to provide useful projections as sources of inspiration and to challenge our grasp of history and growth.

wrycoder · on Sept 17, 2022

Hired ten years ago at Google as Director of Engineering [0], His book, released that year, was "How to Create a Mind". He's still there, and that's what he's doing, I think. He is supposed to release his new book, "The Singularity is Nearer" in 2022, according to his website. I'll be reading that!

[0] https://www.wsj.com/articles/BL-DGB-25711

wrycoder · on Sept 20, 2022

Oh, he just did an interview with Lex Friedman:

https://youtu.be/ykY69lSpDdo

powera · on Sept 15, 2022

I don't believe "compositionality" is a serious obstacle.

It is a different issue than generating an image based on a bag-of-words, so it isn't surprising that an attempt to solve that issue didn't immediately solve the other.

But a variety of approaches can easily solve this problem.

jameshart · on Sept 15, 2022

Right - your training data set is images plus descriptions. But the descriptions are not typically descriptions of composition.

Descriptions of Napoleon Crossing the Alps are unlikely to read 'A small frenchman wearing a silly hat riding on a horse'. So why would an AI trained on such image descriptions develop any sense for 'compositionality'?

ummonk · on Sept 15, 2022

Yes, especially when machine translation seems to handle it just fine.

goatlover · on Sept 15, 2022

Does it really, though?

emiliobumachar · on Sept 16, 2022

Mostly. See this for five examples using Google Translate: https://www.datasecretslox.com/index.php/topic,7588.msg30007...

goatlover · on Sept 16, 2022

I'm not sure that machine translation demonstrates compositionality, since it's translating from phrases already composed in one language to another. It only does so if understanding composition is necessary for language translation. Whereas carrying on a meaningful conversation does require understanding of how words are being put together as the conversation evolves. Thus why the Turing Test hast been considered important for determining whether an AI has achieved human-level abilities, at least as far as language use is concerned.

ummonk · on Sept 16, 2022

I don't see why translating from one language to relationships in art (visual language if you will) is qualitatively different from translating from one language to another.

i_like_apis · on Sept 15, 2022

I wish more articles followed the standard essay format. At least state your main thesis in the first paragraph.

There are interesting things buried in here, but I don’t have time for rambling.

The edge cases of image models have been more succinctly summarized and speculated upon elsewhere.

version_five · on Sept 15, 2022

Yes I've noticed that a lot of authors expect you to read through some parable before they tell you what they are going to tell you. It would be fine with an abstract or even a sentence below the title that says "ML models are not being adequately evaluated for composability and it makes them look more intelligent than they are". Just diving into "consider clever Hans" makes it tough to know if it's worth reading.

googlryas · on Sept 15, 2022

Why Scott Alexander of all people? Isn't he a clinical psychologist?

I think, if I had to give the task to the-subset-of-people-appearing-frequently-on-hn, I would give it to Gwern, not Scott.