Hacker News new | past | comments | ask | show | jobs | submit login

I think the implications go much further than just the image/video considerations.

This model shows a very good (albeit not perfect) understanding of the physics of objects and relationships between them. The announcement mentions this several times.

The OpenAI blog post lists "Archeologists discover a generic plastic chair in the desert, excavating and dusting it with great care." as one of the "failed" cases. But this (and "Reflections in the window of a train traveling through the Tokyo suburbs.") seem to me to be 2 of the most important examples.

- In the Tokyo one, the model is smart enough to figure out that on a train, the reflection would be of a passenger, and the passenger has Asian traits since this is Tokyo. - In the chair one, OpenAI says the model failed to model the physics of the object (which hints that it did try to, which is not how the early diffusion models worked ; they just tried to generate "plausible" images). And we can see one of the archeologists basically chasing the chair down to grab it, which does correctly model the interaction with a floating object.

I think we can't underestimate how crucial that is to the building of a general model that has a strong model of the world. Not just a "theory of mind", but a litteral understanding of "what will happen next", independently of "what would a human say would happen next" (which is what the usual text-based models seem to do).

This is going to be much more important, IMO, than the video aspect.




Wouldn't having a good understanding of physics mean you know that a women doesn't slide down the road when she walks? Wouldn't it know that a woolly mammoth doesn't emit profuse amounts steam when walking on frozen snow? Wouldn't the model know that legs are solid objects in which other object cannot pass through?

Maybe I'm missing the big picture here, but the above and all the weird spatial errors, like miniaturization of people make me think you're wrong.

Clearly the model is an achievement and doing something interesting to produce these videos, and they are pretty cool, but understanding physics seems like quite a stretch?

I also don't really get the excitement about the girl on the train in Tokyo:

In the Tokyo one, the model is smart enough to figure out that on a train, the reflection would be of a passenger, and the passenger has Asian traits since this is Tokyo

I don't know a lot about how this model works personally, but I'm guessing in the training data the vast majority of people riding trains in Tokyo featured asian people in them, assuming this model works on statistics like all of the other models I've seen recently from Open AI, then why is it interesting the girl in the reflection was Asian? Did you not expect that?


> Wouldn't having a good understanding of physics mean you know that a women doesn't slide down the road when she walks? Wouldn't it know that a woolly mammoth doesn't emit profuse amounts steam when walking on frozen snow? Wouldn't the model know that legs are solid objects in which other object cannot pass through?

This just hit me but humans do not have a good understanding of physics; or maybe most of humans have no understanding of physics. We just observe and recognize whether it's familiar or not.

AI will need to be, that being the case, way more powerful than a human mind. Maybe orders of magnitude more "neural networks" than a human brain has.


Well we feel the world, it's pretty wild when you think about how much data the body must be receiving and processing constantly.

I was watching my child in the bath the other day, they were having the most incredible time splashing, feeling the water, throwing balls up and down, and yes, they have absolutely no knowledge of "physics" yet navigating and interacting with it as if it was the best thing they've ever done. Not even 12 months old yet.

It was all just happening on feel and yeah, I doubt they could describe how to generate a movie.


Operating a human takes an incredible intuition of physics, just because you can't write or explain the math doesn't mean your mind doesn't understand it. Further to that, we are able to apply our patterns of physics to novel external situations on the fly sometimes within miliseconds of encountering the situation.

You only need to see a ball bounce once and your brain has done some rough approximations of it's properties and will calc both where it's going and how to get your gangly menagerie pivots, levers, meat servos and sockets to intercept them at just the right time.

Think also about how well people can come to understand the physics of cars and bikes in motorsport and the like. The internal model of a cars suspension in operation is non-trivial but people can put it in their head.


Humans have an intuitive understanding of physics, not a mathy science one.

I know I can't put my hand through solid objects. I know that if I drop my laptop from chest height it will likely break it, the display will crack or shatter, the case will get a dent. If it hits my foot it will hurt. Depending on the angle it may break a bone. It may even draw blood. All of that is from my intuitive knowledge of physics. No book smarts needed.


I agree, to me the most clear example is how the rocks in the sea vanish/transform after the wave: The generated frames are hyperreal for sure, but the represented space looks as consistent as a dream.


They could test this by trying to generate the same image but set in New York, etc. I bet it would still be asain.


Give it a year


Ok bro


The answer could be in between. Who said delusion models are limited to 2d pixel generations?


Did you mean diffusion ?


> very good... understanding of the physics of objects and relationships between them

I am always torn here. A real physics engine has a better "understanding" but I suspect that word applies to neither Sora nor a physics engine: https://www.wikipedia.org/wiki/Chinese_room

An understanding of physics would entail asking this generative network to invert gravity, change the density or energy output of something, or atypically reduce a coefficient of friction partway through a video. Perhaps Sora can handle these, but I suspect it is mimicking the usual world rather than understanding physics in any strong sense.

None of which is to say their accomplishment isn't impressive. Only that "understand" merits particularly careful use these days.


Question is - how much do you need to understand something in order to mimick it?

The Chinese Room seems to however point to some sort of prewritten if-else type of algorithm type of situation. E.g. someone following scripted algorithmic procedures might not understand the content, but obviously this simplification is not the case with LLMs or this video generation, as the algorithmic scripting requires pre-written scripts.

Chinese Room seems to more refer to cases like "if someone tells me "xyz", then respond with "abc" - of course then you don't understand what xyz or abc mean, but it's not referring to neural networks training on ton of material to build this model representation of things.


Good points.

Perhaps building the representation is building understanding. But humans did that for Sora and for all the other architectures too (if you'll allow a little meta-building).

But evaluation alone is not understanding. Evaluation is merely following a rote sequence of operations, just like the physics engine or the Chinese room.

People recognize this distinction all the time when kids memorize mathematical steps in elementary school but they do not yet know which specific steps to apply for a particular problem. This kid does not yet understand because this kid guesses. Sora just happens to guess with an incredibly complicated set of steps.

(I guess.)


I think this is a good insight. But if the kid gets sufficiently good at guessing, does it matter anymore..?

I mean, at this point the question is so vague… maybe it’s kinda silly. But I do think that there’s some point of “good-at-guessing” that makes an LLM just as valuable as humans for most things, honestly.


Agreed.

For low-stakes interpolation, give me the guesser.

For high-stakes interpolation or any extrapolation, I want someone who does not guess (any more than is inherent to extrapolating).


That matches how philosophers typically talk about the Chinese room. However the Chinese room is supposed to "behaves as if it understands Chinese" and can engage in a conversation (let us assume via text). To do this the room must "remember" previously mentioned facts, people, etc. Furthermore it must line up ambiguous references correctly (both in reading and writing).

As we now know from more than 60 years of good old fashioned AI efforts, plus recent learning based AI, this CAN be done using computers but CANNOT be done using just ordinary if - then - else type rules no matter how complicated. Searle wrote before we had any systems that could actually (behave as if they) understood language and could converse like humans, so he can be forgiven for failing to understand this.

Now that we do know how to build these systems, we can still imagine a Chinese room. The little guy in the room will still be "following pre-written scripted algorithmic procedures." He'll have archives of billions of weights for his "dictionary". He will have to translate each character he "reads" into one or more vectors of hundreds or thousands of numbers, perform billions of matrix multiplies on the results, and translate the output of the calculations -- more vectors -- into characters to reply. (We may come up with something better, but the brain can clearly do something very much like this.)

Of course this will take the guy hundreds or thousands of years from "reading" some Chinese to "writing" a reply. Realistically if we use error correcting codes to handle his inevitable mistakes that will increase the time greatly.

Implication: Once we expand our image of the Chinese room enough to actually fulfill Searle's requirements, I can no longer imagine the actual system concretely, and I'm not convinced that the ROOM ITSELF "doesn't have a mind" that somehow emerges from the interaction of all these vectors and weights.

Too bad Searle is dead, I'd love to have his reply to this.


Facebook released something in that direction today https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-jo...


Wow this is a huge announcement too, I can't believe this hasn't made the front page yet.


This seems to be completely in line with the previous "AI is good when it's not news" type of work:

Non-news: Dog bites a man.

News: Man bites a dog.

Non-news: "People riding Tokyo train" - completely ordinary, tons of similar content.

News: "Archaeologists dust off a plastic chair" - bizarre, (virtually) no similar content exists.


I found the one about the people in Lagos pretty funny. The camera does about a 360deg spin in total, in the beginning there are markets, then suddenly there are skyscrapers in the background. So there's only very limited object permanence.

> A beautiful homemade video showing the people of Lagos, Nigeria in the year 2056. Shot with a mobile phone camera.

> https://cdn.openai.com/sora/videos/lagos.mp4


Also the women in red next to the people is very tiny and the market stall is also a mini market stall, and the table is made out of a bike.

For everyone that's carrying on about this thing understanding physics and has a model of the world...it's an odd world.


The thing is -- over time I'm not sure people will care. People will adapt to these kinds of strange things and normalize them -- as long as they are compelling visually. The thing about that scene is it looks weird only if you think about it. Otherwise it seems like the sort of pan you would see in some 30 second commercial for coffee or something.

If anything it tells a story: going from market, to people talking as friends, to the giant world (of Lagos).


I'm not so sure.

My instagram feed is full of AI people, I can tell with pretty good accuracy when the image is "AI" or real, the lighting and just the framing and the scene itself, just something is off.

I think a similar thing will happen here, over the next few months we'll adapt to these videos and the problems will become very obvious.

When I first looked at the videos I was quite impressed, but I looked again and I saw a bunch of werid stuff going on. I think our brains are just wired to save energy, and accepting whatever we see on a video or an image as being good enough is pretty efficient / low risk thing.


Agreed, at first glance of the woman walking I was so focused on how well they were animating that the surreal scene went unnoticed. Once I'd stopped noticing the surreal scene, I started picking up on weird motion in the walk too.

Where I think this will get used a lot is in advertising. Short videos, lots going on, see it once and it's gone, no time to inspect. Lady laughing with salad pans to a beach scene, here's a product, buy and be as happy as salad lady.


This will be classified unconsciously as cheap and uninteresting by the brain real quick. It'll have its place in the tides of cheap content, but if overall quality was to be overlooked that easily, producers would never have increased production budget that much, ever, just for the sake of it.


In the video of the girl walking down the Tokyo city street, she's wearing a leather jacket. After the closeup on her face they pull back and the leather jacket has hilariously large lapels that weren't there before.


Object permanence (just from images/video) seems like a particularly hard problem for a super-smart prediction engine. Is it the old thing, or a new thing?


There are also perspective issues: the relative sizes of the foreground (the people sitting at the café) and the background (the market) are incoherent. Same with the "snowy Tokyo with cherry blossoms" video.


Though I'm not sure your point here -- outside of America -- in Asia and Africa -- these sorts of markets mixed in with skyscrapers are perfectly normal. There is nothing unusual about it.


Yeah, some of the continuity errors in that one feel horrifying.


> then suddenly there are skyscrapers in the background. So there's only very limited object permanence.

Ah but you see that is artistic liberty. The director wanted it shot that way.


It doesn't understand physics.

It just computes next frame based on current one and what it learned before, it's a plausible continuation.

In the same way, ChatGPT struggles with math without code interpreter, Sora won't have accurate physics without a physics engine and rendering 3d objects.

Now it's just a "what is the next frame of this 2D image" model plus some textual context.


> It just computes next frame based on current one and what it learned before, it's a plausible continuation.

...

> Now it's just a "what is the next frame of this 2D image" model plus some textual context.

This is incorrect. Sora is not an autoregressive model like GPT, but a diffusion transformer. From the technical report[1], it is clear that it predicts the entire sequence of spatiotemporal patches at once.

[1]: https://openai.com/research/video-generation-models-as-world...


Good link.

But, even there it says:

> Sora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object states

Regardless whether all the frames are generated at once, or one by one, you can see in their examples it's still just pixel based. See the first example with the dog with blue hat, the woman has a blue thing suddenly spawn into her hand because her hand went over another blue area of the image.


I'm not denying that there are obvious limitations. However, attributing them to being "pixel-based" seems misguided. First off, the model acts in latent space, not directly on pixels. Secondly, there is no fundamental limitation here. The model has already acquired limited-yet-impressive ability to understand movement, texture, social behavior, etc., just from watching videos.

I learned to understand reality by interpreting photons and various sensory inputs. Does that make my model of reality fundamentally flawed? In the sense that I only have a partial intuitive understanding of it, yes. But I don't need to know Maxwell's equations to get a sense of what happens when I open the blinds or turn on my phone.

I think many of the limitations we are seeing here - poor glass physics, flawed object permanence - will be overcome given enough training data and compute.

We will most likely need to incorporate exploration, but we can get really far with astute observation.


Actually your comment gives me hope that we will never have AI singularity, since how the brain works is flawed, and were trying to copy it.

Heck a super AI might not even be possible, what if we're peak intelligence with our millions of years of evolution?

Just adding compute speed will not help much -- say the goal of an intelligence is to win a war. If you're tasked with it then it doesn't matter if you have a month or a decade (assume that time is.frozen while you do your research), its a too complex problem and simply cannot be solved, and the same goes for an AI.

Or it will be like with chess solvers, machines will be more intelligent than us simply because they can load much more context to solve a problem than us in their "working memory"


> Actually your comment gives me hope that we will never have AI singularity, since how the brain works is flawed, and were trying to copy it.

As someone working in the field, the vast majority of AI research isn't concerned with copying the brain, simply with building solutions that work better than what came before. Biomimetism is actually quite limited in practice.

The idea of observing the world in motion in order to internalize some of its properties is a very general one. There are countless ways to concretize it; child development is but one of them.

> If you're tasked with it then it doesn't matter if you have a month or a decade (assume that time is.frozen while you do your research), its a too complex problem and simply cannot be solved, and the same goes for an AI.

I highly disagree.

Let's assume a superintelligent AI can break down a problem into subproblems recursively, find patterns and loopholes in absurd amounts of data, run simulations of the potential consequences of its actions while estimating the likelihood of various scenarios, and do so much faster than humans ever could.

To take your example of winning a war, the task is clearly not unsolvable. In some capacity, military commanders are tasked with it on a regular basis (with varying degrees of success).

With the capabilities described above, why couldn't the AI find and exploit weaknesses in the enemy's key infrastructure (digital and real-world) and people? Why couldn't it strategically sow dissent, confuse, corrupt, and efficiently acquire intelligence to update its model of the situation minute-by-minute?

I don't think it's reasonable to think of a would-be superintelligence as an oracle that gives you perfect solutions. It will still be bound by the constraints of reality, but it might be able to work within them with incredible efficiency.


This is an excellent comparison and I agree with you.

Unfortunately we are flawed. We do know how physics work intuitively and can somewhat predict them, but not perfectly. We can imagine how a ball will move, but the image is blurry and trajectory only partially correct. This is why we invented math and physics studies, to be able to accurately calculate, predict and reproduce those events.

We are far off from creating something as efficient as the human brain. It will take insane amounts of compute power to simply match our basic innacurate brains, imagine how much will be needed to create something that is factually accurate.


Indeed. But a point that is often omitted from comparisons with organic brains is how much "compute equivalent" we spent through evolution. The brain is not a blank slate; it has clear prior structure that is genetically encoded. You can see this as a form of pretraining through a RL process wherein reward ~= surviving and procreating. If you see things this way, data-efficiency comparisons are more appropriate in the context of learning a new task or piece of information, and foundation models tend to do this quite well.

Additionally, most of the energy cost comes from pretraining, but once we have the resulting weights, downstream fine-tuning or inference are comparatively quite cheap. So even if the energy cost is high, it may be worth it if we get powerful generalist models that we can specialize in many different ways.

> This is why we invented math and physics studies, to be able to accurately calculate, predict and reproduce those events.

We won't do away without those, but an intuitive understanding of the world can go a long way towards knowing when and how to use precise quantitative methods.


GPT-4 doesn't "struggle with math". It does fine. Most humans aren't any better.

Sora is not autoregressive anyway but there's nothing "just" and next frame/token prediction.


It absolutely struggles with math. It's not solving anything. It sometimes gets the answer right only because it's seen the question before. It's rote memorization at best.


No it doesn't. I know because I've actually used the thing and you clearly haven't.

And if Terence Tao finds some use for GPT-4 as well as Khan Academy employing it as a Math tutor then I don't think I have some wild opinion either.

Now Math isn't just Arithmetic but do you know easy it is to go out of training for say Arithmetic ?


Yesterday, it failed to give me the correct answer to 4 + 2 / 2. It said 3...


Just tried in chatGpt-4. It gives the correct output (5), along with a short explanations of the order of operations (which you probably need to know, if you're asking the question).


Correct based upon whom? If someone of authority asks the question and receives a detailed response back that is plausible but not necessarily correct, and that version of authority says the answer is actually three, how would you disagree?

In order to combat Authority you need to both appeal to a higher authority, and that has been lost. One follows AI. Another follows Old Men from long ago who's words populated the AI.


The TV show American Gods becoming reality...


We shouldn't necessarily regard 5 as the correct output. Sure, almost all of us choose to make division higher precedence than addition, but there's no reason that has to be the case. I think a truly intelligent system would reply with 5 (which follows the usual convention, and would therefore mimic the standard human response), but immediately ask if perhaps you had intended a different order of operations (or even other meanings for the symbols), and suggest other possibilities and mention the fact that your question could be considered not well-defined...which is basically what it did.


I guess you might think 'math' means arithmetic. It definitely does struggle with mathematical reasoning, and I can tell you that because I and many others have tried it.

Mind you, it's not brilliant at arithmetic either...


I'm not talking about Arithmetic


> In the Tokyo one, the model is smart enough to figure out that on a train, the reflection would be of a passenger, and the passenger has Asian traits since this is Tokyo.

How is this any more accurate than saying that the model has mostly seen Asian people in footage of Tokyo, and thus it is most likely to generate Asian-features for a video labelled "Tokyo"? Similarly, how many videos looking out a train window do you think it's seen where there was not a reflection of a person in the window when it's dark?


I'm hoping to see progress towards consistent characters, objects, scenes etc. So much of what I'd want to do creatively hinges on needing persisting characters who don't change appearance/clothing/accessories from usage to usage. Or creating a "set" for a scene to take place in repeatedly.

I know with stable diffusion there's things like lora and controlnet, but they are clunky. We still seem to have a long way to go towards scene and story composition.

Once we do, it will be a game changer for redefining how we think about things like movies and television when you can effectively have them created on demand.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: