Once you try the demos, the animated image at the top feels misleading. Each segment cuts at just the right point to make you think you’d be able continue exploring these vast worlds, but in practice you can only walk a couple os steps before hitting an invisible wall, which becomes more frustrating than not being able to move at all. It feels like being trapped in a box. My reaction went from impressed to disappointed fast.
I get these are early steps, but they oversold it.
You can bypass the "Out of bound" message by setting a Javascript breakpoint after
`let t = JSON.parse(d[e].config_str)`
and then run
`Object.values(t.camera.presets).map(o=>o.max_distance=50&&o)`
in the console.
It breaks down pretty quickly once you get outside the default bounds, as expected, though.
I wonder how much of the remaining work boils down to generating a new scene based on the camera's POV when the player hits one of the bounds, and keeping these generated scenes in a tree structure, joining scenes at boundaries.
Yes, and you wouldn't even need to do it in realtime as a user walks around.
Generate incrementally using a pathfinding system for a bot to move around and "create the world" as it goes, as if a Google street view car followed the philosophy of George Berkeley.
Like a bizarro cousin of loop closure in SLAM— which is recognizing when you've found a different path to a place you've been before.
Except this time there is no underlying consistent world, so it would be up to the algorithm to use dead reckoning or some kind of coordinate system to recognize that you're approaching a place you've "been" before, and incorporate whatever you found there into the new scenes it produces.
I was imagining a few limitations to help with consistency: all scenes have the same number of edges (say, 10) ensuring there's a limited set of scenes you can navigate to from the current one and previously generated scenes can get reused, and no flying, that way we can only worry about generating prism-shaped rooms with a single ceiling and floor edge.
I suppose this is the easy part, actually; for me the real trouble might be collision based on the non-deterministic thing that was generated, i.e. how to decide which scene edges the player should be able to travel through, interact with, be stopped by, burned by, etc.
I know you didn’t mean it like this, but this is kind of an insult to the insane amounts of work that go into crafting just the RNG systems behind roguelikes.
Or pair something like this with SLAM to track the motion and constrain its generation - feed it the localisation/particle/Kalman filter (or whatever map representation) as additional context, and it should be able to form consensus fairly quickly?
(Half-baked thoughts)
I first got irritated a bit by this as well, but then the game Myst came to mind.
So I'm willing to accept the limitation, and at this point we know that this can only get better. Next I thought about the likelihood of Nvidia releasing an AI game engine, or more of a renderer, fully AI based. It should be happening within the next 10 years.
Imagine creating a game by describing scenes, like the ones in the article, with a good morphing technology between scenes, so that the transitions between them are like auto-generated scenes which are just as playable.
The effects shown in the article were very interesting, like the ripple, sonar or wave. The wave made me think about how trippy games could get in the future, more extreme versions of the Subnautica video [0] which was released last month.
We could generate video games which would periodically slip into hallucinations, a thing that is barely doable today, akin to shader effects in Far Cry or other games when the player gets poisoned.
It's "old news" I guess at this point, but the AI Minecraft demo (every frame generated from the previous frame, no traditional "engine") is still the most impressive thing to me in this space https://oasis.us.decart.ai/welcome There are some interesting "speed runs" people have been doing like https://www.youtube.com/watch?v=3UaVQ5_euw8
We might all be dead in 10 years, but with big tech companies making their plays, all the VC money flowing in to new startups, and nuclear plants being brought online to power the next base model training runs, there's room for a little mild entertainment like these sorts of gimmicks in the next 3 years or so. I doubt anything that comes of it will top even my top 15 video games list though.
> We might all be dead in 10 years, but with big tech companies making their plays, all the VC money flowing in to new startups, and nuclear plants being brought online to power the next base model training runs, there's room for a little mild entertainment like these sorts of gimmicks in the next 3 years or so. I doubt anything that comes of it will top even my top 15 video games list though.
That’s a contestant for the most depressing tradeoff ever. “Yeah, we’ll all die in agony way before our time, but at least we got to play with a neat but ultimately underwhelming tool for a bit”.
You’re describing a pie in the sky. A vision. Not reality. We have been burned many times already, nothing in this field is a given.
> at this point we know that this can only get better.
We don’t know that. It will probably get better, but will it be better enough? No one knows.
> It should be happening within the next 10 years.
Every revolution in tech is always ten years away. By now that’s a meme. Saying something is ten years away is about as valuable as saying one has no idea how doable it is.
> Imagine
Yes, I understand the goal. Everyone does, it’s not complicated. We can all imagine Star Trek technology, we all know where the compass is pointed, that doesn’t make it a given.
In fact, the one thing we can say for sure about imagining how everything will be great in ten years is that we routinely fail to predict the bad parts. We don’t live in fantasy land, advancements in tech are routinely used for detrimental reasons.
> Imagine creating a game by describing scenes, like the ones in the article, with a good morphing technology between scenes, so that the transitions between them are like auto-generated scenes which are just as playable.
Why do you think this game would be good? I'm not a game maker but the visual layer is not the reason people like or enjoy a game (ex: nintendo). There are teams of professionals making games today that range from awful to great. I get that there are indie games made by a single person that will benefit from generated graphics, but asset creation seems to be a really small part of it.
“We are hard at work improving the size and fidelity of our generated worlds”
I imagine the further you move from the input image, the more the model has to make up information and the harder to keep it consistent. Similar problem with video generation.
> I imagine the further you move from the input image, the more the model has to make up information and the harder to keep it consistent. Similar problem with video generation.
Which is the same thing as saying this may turn out to be a dud, like so many other things in tech and the current crop of what we’re calling AI.
Like I said, I get this is an early demo, but don’t oversell it. They could’ve started by being honest and clarifying they’re generating scenes (or whatever you want to call them, but they’re definitely not “worlds”), letting you play a bit, then explain the potential and progress. As it is, it just sounds like they want to immediately wow people with a fantasy and it detracts from what they do have.
Maybe they think it's a good deal, producing some oversold tech demos in exchange for a decade's worth of funding and not having to produce anything more than an "Our Incredible Journey" letter at the end. The prospect of replacing all human labor has made it easier than ever to run the grift on investors in this time of peak FOMO.
Fair criticism. I’m also not a fan of hyperbole. Still find World Labs stuff super intriguing and I’m optimist about them to be able to fulfill the vision.
In general, it depends on how much the model ends up "understanding" the input. (I use "understand" here in the sense some would claim SOTA LLMs do.)
You can imagine this as a spectrum. On the one end you have models that, at each output pixel, try to predict pixels that are locally similar to ones in previous frame; on the other end, you could imagine models that "parse" the initial input image to understand the scene - objects (buildings, doors, people, etc.) and their relationships, and separately, the style with which they're painted, and use that to extrapolate further frames[0]. The latter would obviously fare better, remaining stylistically consistent for longer.
(This model claims to be of the second kind.)
The way I see it: a human could do it[1], so there's no reason an ML model wouldn't be able to.
--
[0] - Brute-force approach: 1) "style-untransfer" the input, i.e. style-transfer to some common style, e.g. photorealistic or sketch, 2) extrapolate the style-untransfered image, and 3) style-transfer result back using original input as style reference. Feels like it should work somewhat okay-ish; wonder if anyone tried that.
[1] - And the hard part wouldn't be extrapolating the scene, but rather keeping the style.
This indeed looks more like photogrammetry than a diffusion model predicting the next frame. There's 3D information extracted from the input image and likely additional generated poses that allow reconstructing the scene with gaussian splats. Not sure how much segmentation (understanding of each part of the scene) is going on. Probably not much if I have to guess.
Models are really great at making stuff up though. And video models already have very good consistency over thousands of frames. It seems like larger worlds shouldn't be a huge hurdle. I wonder why they launched without that, as this doesn't seem much better than previous work.
As someone completely not involved in this project, I would predict that increasing the scene size while remaining halfway consistent isn't that difficult.
Let me elaborate by using cat-4d.github.io, one of their competitors in this field of research: If you look at the "How it works" section you can see that the first step is to take an input video and then create artificial viewpoints of the same action being observed by other cameras. And then in the 2nd step, those viewpoints are merged into one 4D gaussian splatting scene. That 2nd step is pretty similar to 4D NeRF training, BTW, just with a different data format.
Now if you need a small scene, you generate a few camera locations that are nearby. But there's nothing stopping you from generating other camera locations or even from using previously generated camera locations and moving the camera again, thereby propagating details that the AI invented outwards. So you could imagine this as you start with something "real" at the center of the map and then you create AI fakes with different camera positions in a circle around the real stuff, and then the next circle around the 1st-gen fakes, and the next circle, and so on. This process is mostly limited by 2 things: The ability of your AI model to capture a global theme. World Labs has demonstrated that they can invent details consistent with a theme in this demos, so I would assume they solved this already. And the other limit is computing time. A world box 2x in each direction is 8x the voxel data and I wouldn't be surprised if you need something like 16x to 32x the number of input images to fit the GSplats/NeRF.
So most likely, the box limit is purely because the AI model is slow and execution is expensive and they didn't want to spend 10,000x the resources for making the box 10x larger.
I mean its a marketing hype for their product. Its a pretty good starting step though - assuming they can build on it and expand that world space as opposed to just converting an image to 3D.
Certainly has some value to it.. marketing, hiring, fundraising (Assuming its a private company)
My take is that its a good start and 3-4 years from now it will have a lot of potential value in world creation if they can make the next steps.
It's definitely a balancing act. World labs was stealth for a bit. Without a brand, stated mission, examples / demos of what you are capable of... is harder to hire, fund raise or get the attention and mind-share you need once you are ready to ship product.
The risk is setting expectations that can't be fulfilled.
I'm in the 3D space and I'm optimistic about World Labs.
Obviously, the generation has to stop at some point and obviously from any key image you could continue generating if you had unlimited GPU, which I’m sorry they didn’t provide for you.
I am not sure it's obvious that you could continue generating from any key image and it wouldn't deteriorate into mush. If you take that museum scene and look at the vase-like display piece while walking around it as much as you can it already becomes fuzzy and has the beginnings of weird artifacts growing out of it.
I was also disappointed by a still image showing a vast sky, but in motion you see it's just a painting on a short ceiling. The model interpreted the vast sky as a painting on a short ceiling.
When watching 3D movies with a VR headset you have to keep your head perfectly still or the lack of parallax destroys the 3D illusion. Compare to a 3D game where moving your head actually lets you move through space and actually look around objects.
Something like this applied to every frame of the movie would allow you to move around a little and preserve the perspective shifts. The limitation that you can only move about 4 feet in any direction would not matter for this use case.
Of course this comes at the expense of the director and cinematographer's intention, which is no small thing.
Have you ever seen the Google Lightfields demo? They have a rig they concocted to essentially capture a "volume" of video to allow for the stereoscopic effect in VR AND which then cleverly presents a different combination of the footage it captured based on your precise head position, so it makes up for these distortions. I found it absolutely breathtaking... first time seeing VR for a space that actually made me feel like I was in it. This was A LONG time ago and I suspected I'd be seeing a lot more of that content, but I was... very wrong, it seems.
Your point is completely correct. Even Apple's awesome new stereoscopic 3D short film for the AVP immediately loses what it could be its total awesomeness from this basic fact. The perspective being perfectly fixed will never quite be there to fool our brains so used to dealing with these micro-movements.
I have seen that, and I came close to buying one of those Lytro light field cameras so many times (but thankfully restrained myself). Light field seemed like a huge obvious "way of the future" thing in the 2010s but with the benefit of hindsight it did not exactly seem to have changed the world.
Yeah parallax, reflexions, shadows are as important as stereo. We’ve been always sold that stereo = 3D but it’s just one among many cues that the brain relies on.
I suspect this would be feasible if filming was accompanied with a depth sensing camera like the Microsoft Kinect. In post production, you could tell roughly how far each pixel is from the camera which could aid in the reconstruction of a 3d scene.
Maybe this could be done with just the aperture and focal distance, which most modern cinema cameras record as they film.
Definitely if there’s a future for 3D and immersive video it depends on adding more cues other than just stereo. Lack of parallax one of main reasons causing discomfort for many.
People complaining that it's a small area, lol my man, this is fucking insane, i know AI is starting to get normalized, but they converted an image into a 3d world! even if its 1ft/1ft its still amazing.
We recently added an environment creator for our VR game which does kinda a similar thing but for even less freedom of movement, so I think I have a little bit of insight into it.
After seeing what they are showing on their page I am majorly impressed and don't feel they are misleading anyone at all
The prompting is done from the browser and it's then automatically added in the game so the next time you do your workout it is already there to select.
You are the maintainer of A-Frame? That's awesome.
We used GodotEngine in the past when it was still called VRWorkout but had to switch to Unity due to business reasons.
The environment creator uses several off the shelf models under the hood with custom loras and blender at the end to create the exportable meshes.
Users usually need to workout in the game to achieve coins to generate environments because we have no actual monetization behind it, so we can't have people generate endless amounts of environments, but if you want to try it out send me a message at michael -at- xrworkout.io and I'll set you up so you can try it.
Not my project, but another approach recently published used Depth Anywhere to create a virtual depthmap for a given 360º equirectangular image and then apply to point cloud and render using three.js / A-Frame.
Appears to be similar capability as OP for creating scene depth from 2D, but using point cloud instead of gaussian splatting for rendering so looks more pixelated:
https://github.com/akbartus/360-Depth-in-WebXR
Also unlike the World Lab example you have the ability to go further outside the bounds of the point cloud to inspect the deficiencies of the approach. It's getting there but still needs work.
Yes and this is a great example of how open A-Frame is compared to OP example. You can inspect every part of the experience from the code to the actual runtime inspector to see how Akbartus achieved the effect -- and then help to make it even better! :)
I do think there is the possibility to use something like this eventually to do all the processing in the browser for Depth Anywhere + Splat reconstruction to fill in the holes of the current point cloud approach: https://github.com/ArthurBrussee/brush
This is neat I guess. Maybe I'm just blase with seeing yet another AI demo where I'm supposed to fill in the blanks in coming up with ways to make the tech actually useful.
The "Step into Paintings" section cracked me up. As soon as you pan away from the source material, the craziness of the model is on full display. So sure, I can experience iconic pieces of art in a new way, it's just not a good experience.
Their bet is that XYZ can generalize from Unreal and NVIDIA Isaac recordings.
Is XYZ diffusion-transformers? Or is XYZ Chameleon? Or some novel architecture?
It takes the absolute fastest teams, it seems, 7 months to develop a first version of a model. And it also seems that models are like babies, 9 moms do not produce a model in 1 month.
The tough thing is that it may be possible to develop a great video model with DiTs for $220m; or it may be possible to develop a great video model with Chameleon for $1b; but if it's 3D + time, will it be too expensive for them to do?
The craziest thing to me is that these guys are super talented, but they might not have enough money!
Then they need to sell _billions_ of dollars worth of these worlds to ... game studios? In order for this valuation to make sense they need to convince the majority of major game studios to spend all their world creation budget solely on this company. Seems unrealistic but I guess only time can tell.
Good point, that's fair there are more use cases. IDK about architecture, typically you want more structure/determinism instead of probabilistic generation. Overall this is very cool. I like the consistency it has and it does generally amaze me nonetheless. But you gotta admit selling a billion dollars of anything is really hard. That is three times the budget of the highest budget movie ever created! (avengers endgame at 356 million). It is almost the ENTIRE budget of the biggest game ever, grand theft auto six.
The overly ambitious claims are what led them to raise $230M+ without a product.
Fei-Fei Li is a luminary in the field, and she's assembled a stellar team of some of the best researchers in the space.
Their gamble is that they'll be able to move faster than the open research and other companies looking to productionize that research.
Time will tell if this can become an ElevenLabs or if it'll fizzle out like Character.ai.
My worry is that without a product, they'll malinvestment their research into cool problems that don't satisfy market demand. There's nothing like the north star of customers. They'll also have a tough time with hiring going forward with that valuation.
The market of open research and models is producing a lot of neat stuff in 3D. But there's no open pool of data yet, despite HuggingFace and others trying.
I don't know enough about venture funding. Did $230M really get transferred into World Labs bank account, or is this a "commitment" of $230M which is trickled out a few million at a time?
Before: You and your cofounders share 100% of the stock in your company that's valued at $X.
After: Your company now has everything it had before plus $Y worth of "something" from the VCs. Your company is now valued at $X plus $Y. The VCs now hold stock in your company worth $Y. You and your cofounders still hold stock your company worth $X.
"Something" might be anything. Cash, stocks, commitments to resources, whatever.
I believe Fei-Fei is focused on physical world interaction, https://behavior.stanford.edu/ is a project she works on for more physical interaction with AI
First reaction after trying it was a bit of a surprise when I got an “Out of bounds” message – not what I expected for 3D worlds.
Scrolling down to the "Looking Ahead" section, they are working on improving both size and fidelity.
Is a 2D image really the best input primitive for 3D world construction? As a user, I'd prefer to have 3D primitives (plane, sphere, mesh) as tools when building my worlds.
I couldn't get the "tap to interact" panels to work. No mouse events had any effect. I had to take it very literally, but first, I had to drag my browser to my laptop screen, which did enable me to literally tap the screen.
baseline images seem to be rendered, because there is shading, lightining, shadows etc.
when I tried other tool (image -> 3d model) it tended to work only for their example images, and when I used anything else it produced some black and flat shape.
so the headline should be: Generate 3D worlds from a single image rendered by us that we used to train our model.
I was interested to see that a co-founder is Stanford CS prof Fei-Fei Li. I'm reading her nonfiction book now, "The Worlds I See," about her experience with AI; she testified before Congress about it.
The boundaries make it pretty obvious how flat this world still is and even on blurring the edges it's obvious there really isn't anything to the models. This is cool for sure, and I can see it being useful for better photogrammetry and assisting in building out worlds, but it isn't going to suddenly be used to make entire game worlds on its own.
Yeah I thought the same and was immediately disappointed that you could only step a tiny bit forwards.
BUT, you can turn around and see something that I presume was entirely generated. So I don't think it is just doing some clever tricks to make the photo look 3D, but also "infilling" what is behind the camera too. That is kinda cool.
I'd love to see this improved so I can walk around some more though, to see what is down those alleys etc.
I’m sure It’ll improve. I imagine, the further from the input image the more the model has to make up stuff. Gen-AI video models are limited to a few seconds. In 3D you’re constrained to a volume
In some angles you can see there’s some gaussian splat / point cloud representation underneath. There’s definitely a 3D representation. But yeah navigable volume is limited at the moment. It will improve
Amazing! It looks like this is one step closer to singularity and this startup showcases what future startups should aspire to be. Although the technology for world generation is just in its infancy, but it’s impressive and atmosphere is great with a stunning impact. Everything you need to wow an investor and secure funding while the technology itself can be hotfixed for years to come. I think at this stage the showcased technology seems more aligned with cinema than the metaverse. Great work! Looking forward to the updates.
I think it does look much nicer than the other examples of this nature that I've seen.
I would guess they are pre-generating the world from the image, not generating it on the go as it runs reasonably well, but doesn't this really limit world size?
I noticed some solid geometry that is accidentally transparent.
The stuff behind the camera looks pretty good, which is presumably fully generated, so if they can make it so you can actually move around more, and with similar quality it could be interesting.
I do wish the examples had a full screen button, the view is tiny..
In 20 years a fiction writer will be able to upload their work to something and generate a movie.
It won't be Hollywood quality. It'll either look like animation or early mixed CGI/live action stuff with "wooden" performances, etc., but it will let them see their work acted out which will be super cool.
Obviously the pro version of this with detailed editing and incorporating real human actors for the starring roles will be what's used to make a lot of real film and TV serious content.
The width of modern mobile phones isn't far off from the average pupillary distance of adults. It feels like there is an opportunity to create a glut of useful 3D data by simply placing one camera on each side of the phone rather than trying to infer it from a single view of the scene.
Maybe when there is better technology for viewing 3D content.
This is some cool progress, but from what I gather it's not actually generating a "world". What I'd be really interested in is the capability of fully generating the geometrical world description in something like USD (Universal Scene Description).
I have quite a few in mind, but the biggest would probably be of creating a digital version of my home (or even my neighborhood) and seeing what it would look like with arbitrary changes
I’ve been trying to get into this sort of 3D Gaussian Splatting stuff, particularly with this focus on environments as opposed to just individual objects or characters. Does anyone know of a model that’s good at doing that and is openly distributed/locally runnable?
Ugh. The AI-generated rear views being nowhere near at the level of detail of even the uncanny valley foreground images. Is not really generating a "3D 'world'" so much as extrapolating a 360º view from a single scene. There's no sense to the architecture or flora. Staircases that lead nowhere.
It's more hypecycle nonsense. But they'll poor billions into this rather than pay human artists what they're worth.
wasd isn't accessible for those of us who have the unfortunate disability of not using a qwerty keyboard. If your project isn't a competitve FPS, arrows are fine.
It's generating gaussian splats, so not quite 3D worlds. In short, they're pseudo 3D in that there is in fact a 3D point cloud, but the points are elliptical angular-dependent colored splats that get projected into 2D space[1]. They're better for reconstructing source-realistic renderings from a constrained view box, but break down outside of that.
They're a really cool way to capture spatial memories though! Friends and I occasionally use Polycam or Luma Labs apps. But there's not _too much_ you can do with them due to the above limitations.
From a brief look at the OP link, World Labs seems to be generating a 360º gaussian splat (for a limited view box) from a still photo, which is cool as hell! But we still have the same problem of "what do we do with gaussian splats".
[1] This description is hand-wavey as I'm a relative layman when it comes to how these work. I'm sure someone can reply with a more precise answer if this one is bad.
In some angles you can appreciate artifacts that resemble those of gaussian splats. I’d bet is a 3D representation but not your traditional mesh. Very cool
We have a gaussian splat component (same format that it looks World lab uses) for A-Frame that should work in VR on Quest and Vision Pro in their respective Web browsers. Quest FPS might not be ideal yet
it looks like they're basing the infilling on 360 photos / videos. that's why you can't walk around freely: the inpainting must be done from the center of the sphere
To build an LLM that can reason about the 3d world. I suspect they will add the ability to reason about the physics of the world next. It's just another attempt to get closer to AGI.
They most likely will have to pivot a few times but once they show their LLM solving problems that others can't, the others will quickly add these features too. Right now, it is cheaper to wait for World Labs to go first. The others are not that far behind: https://cat-4d.github.io/
Don’t know the exact details but I imagine the further from the original input image the more the system needs to make up stuff. Same why generative video models are limited to a few seconds. It will improve
Can you point at some data that would indicate it will improve? There are lots of statements today about GenAI akin to "that will get fixed later" but we don't actually seem to know what will actually improve and what will just get incrementally prettier without fixing the underlying issue.
AI can outpaint more images in similar style, then map them to 3D. IMHO, AI should generate a story from the image, then use image + story + location & direction to generate consistent 3D world.
I get these are early steps, but they oversold it.