Hacker News new | past | comments | ask | show | jobs | submit login

> Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

I feel like we're close too, but for another reason.

For although I love SD and these video examples are great... It's a flawed method: they never get lighting correctly and there are many incoherent things just about everywhere. Any 3D artist or photographer can immediately spot that.

However I'm willing to bet that we'll soon have something much better: you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.

And the scene shall be sent into Blender and you'll click on a button and have an actual rendering made by Blender, with correct lighting.

Wanna move that bicycle? Move it in the 3D scene exactly where you want.

That is coming.

And for audio it's the same: why generate an audio file when soon models shall be able to generate the various tracks, with all the instruments and whatnots, allowing to create the audio file?

That is coming too.




> you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.

I'm always confused why I don't hear more about projects going in this direction. Controlnets are great, but there's still quite a lot of hallucination and other tiny mistakes that a skilled human would never make.


Blender files are dramatically more complex than any image format, which are basically all just 2D arrays of 3-value vectors. The blender filetype uses a weird DNA/RNA struct system that would probably require its own training run.

More on the Blender file format: https://fossies.org/linux/blender/doc/blender_file_format/my...


But surely you wouldn't try to emit that format directly, but rather some higher level scene description? Or even just a set of instructions for how to manipulate the UI to create the imagined scene?


It sure feels weird to me as well, that GenAI is always supposed to be end-to-end with everything done inside NN blackbox. No one seems to be doing image output as SVG or .ai.


Imo the thinking is that whenever humans have tried to pre-process or feature-engineer a solution or tried to find clever priors in the past, massive self-supervised-learning enabled, coarsely architected, data-crunching NNs got better results in the end. So, many researchers / industry data scientists may just be disinclined to put effort into something that is doomed to be irrelevant in a few years. (And, of course, with every abstraction you will lose some information that may bear more importance than initially thought)


The way that website builders using GenAI work is they have a LLM generate the copy, then find a template that matches that and fill it out. This basically means the "visual creativity" part is done by a human, as the templates are made and reviewed by a human.

LLMs are good at writing copy that sounds accurate and creative enough, and there are known techniques to improve that (such as generating an outline first, then generating each section separately). If you then give them a list of templates, and written examples of what they are used for, the LLM is able to pick one that's a suitable match. But this is all just probability, there's no real creativity here.

Earlier this year I played around with trying to have GPT-3 directly output an SVG given a prompt for a simple design task (a poster for a school sports day), and the results were pretty bad. It was able to generate a syntantically coreect SVG, but the design was terrible. Think using #F00 and #0F0 as colours, placing elements outside the screen boundaries, layering elements so they are overlapping.

This was before GPT-4, so it would be interesting to repeat that now. Given the success people are having with GPT-4V, I feel that it could just be a matter of needing to train a model to do this specific task.


There is a fundamental disconnect between industry and academia here.


Over the last 10 years of industry work, I'd say about 20% of my time has been format shifting, or parsing half baked undocumented formats that change when I'm not paying attention.

That pretty much matches my experience working with NN's and LLM's


I've seen this but producing Python scripts that you run in Blender, e.g. https://www.youtube.com/watch?v=x60zHw_z4NM (but I saw something marginally more impressive, not sure where though!)


My god that is an irritating video style, "AI woweee!"


Yeah I'd imagine that's the best way. Lots of LLMs can generate workable Python code too, so code that jives with Blender's Python API doesn't seem like too much of a leap.

The only trick is that there has to be enough Blender Python code to train the LLM on.


Maybe something like OpenSCAD is a good middle ground. Procedural code-like format for specifying 3D objects that can then be converted and imported in Blender.


I tried all the AI stuff that I could on OpenSCAD.

While it generates a lot of code that initially makes sense, when you use the code, you get a jumbled block.


This. I think problem is that the LLMs really struggle with 3d scene understanding, so what you would need to do is generate code that generates code.

But also I suspect there just isn't that much openscad code in the training data, and the semantics are different enough to python or any of the other languages that are well-represented that it struggles.


Scene layouts, models and their attributes are a result of user input (ok and sometimes program output). One avenue to take there would be to train on input expecting an output. Like teaching a model to draw instead of generate images.. which in a sense we already did by broadly painting out silhouettes and then rendering details.


Voxel files could be a simpler step for 3D images.


> I'm always confused why I don't hear more about projects going in this direction.

Probably because they aren't as advanced and the demos aren't as impressive to nontechnical audiences who don't understand the implications: there’s lots of work on text-to-3d-model generation, and even plugins for some stable diffusion UIs (e.g., MotionDiff for ComyUI.)


I think the bottleneck is data

For single 3D object the biggest dataset is ObjaverseXL with 10M samples

For full 3D scenes you could at best get ~1000 scenes with datasets like ScanNet I guess

Text2Image models are trained on datasets with 5 billion samples


Oh, I don't know about that. Working in feature film animation, studios have gargantuan model libraries from current and past projects, with a good number (over half) never used by a production but created as part of some production's world building. Plus, generative modeling has been very popular for quite a few years. I don't think getting more 3D models then they could use is a real issue for anyone serious.


Where can you find those? I'm in the same situation as him, I've never heard of a 3d dataset better than objaverse XL.

Got a public dataset?


These are not public datasets, but with some social engineering I bet one could get access.

I've not worked in VFX for a while, but when I did the modeling departments at multiple studios had giant libraries of completed geometries for every project they ever did, plus even larger libraries of all the pieces and parts they use as generic lego geometry whenever they need something new.

Every 3D modeler I know has their own personal libraries of things they'd made as well as their own "lego sets" of pieces and parts and generative geometry tools they use when making new things.

Now this is just a guess, but do you know anyone going through one of those video game schools? I wager the schools have big model libraries for the students as well. Hell, I bet Ringling and Sheridan (the two Harvards of Animation) have colossally sized model libraries for use by their students. Contact them.


There's a lot of issues with it, but perhaps the biggest is that there aren't just troves of easily scrapable and digestible 3D models lying around on the internet to train on top of like we have with text, images, and video.

Almost all of the generative 3D models you see are actually generative image models that essentially (very crude simplification) perform something like photogrammetry to generate a 3D model - 'does this 3D object, rendered from 25 different views, match the text prompt as evaluated by this model trained on text-image pairs'?

This is a shitty way to generate 3D models, and it's why they almost all look kind of malformed.


If reinforcement learning were farther along, you could have it learn to reproduce scenes as 3D models. Each episode's task is to mimic an image, each step is a command mutating the scene (adding a polygon, or rotating the camera, etc.), and the reward signal is image similarity. You can even start by training it with synthetic data: generate small random scenes and make them increasingly sophisticated, then later switch over to trying to mimic images.

You wouldn't need any models to learn from. But my intuition is that RL is still quite weak, and that the model would flounder after learning to mimic background color and placing a few spheres.



From my very clueless perspective, it seems very possible to train an AI to use Blender to create images in a mostly unsupervised way.

So we could have something to convert AI-generated image output into 3D scenes without having to explicitly train the "creative" AI for that.

Probably much more viable, because the quantity of 3D models out in the wild is far far lower than that of bitmap images.


I think this recent Gaussian Splatting technique could end up working really well for generative models, at least once there is a big corpus of high quality scenes to train on. Seems almost ideal for the task because it gets photorealistic results from any angle, but in a sparse, data efficient way, and it doesn’t require a separate rendering pipeline.


One was on the front page the other day, I’ll search for a link


I assume because it's still extremely early.


> However I'm willing to bet that we'll soon have something much better: you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.

I agree with this philosophy - Teach the AI to work with the same tools the human does. We already have a lot of human experts to refer to. Training material is everywhere.

There isn't a "text-to-video" expert we can query to help us refine the capabilities around SD. It's a one-shot, Jupiter-scale model with incomprehensible inertia. Contrast this with an expert-tuned model (i.e. natural language instructions) that can be nuanced precisely and to the the point of imperceptibility with a single sentence.

The other cool thing about the "use existing tools" path is that if the AI fails part way through, it's actually possible for a human operator to step in and attempt recovery.


Nah I disagree, this feels like a glorification of the process not the end result. Just because having the 3D model in the scene with all the lighting makes the end result feel more solid to you because you feel you can see the work that's going into it.

In the end diffusion technology can make a more realistic image faster than a rendering engine can.

I feel pretty strongly that this pipeline will be the foundation for most of the next decade of graphics and making things by hand in 3D will become extremely niche because lets face it anyone who has worked in 3D it's tedious, it's time consuming, takes large teams and it's not even well paid.

The future is just tools that give us better controls and every frame will be coming from latent space not simulated photons.

I say this as someone who had done 3D professionally in the past.


Nah, I agree with GP. Who didn't suggest making 3D scenes by hand, but the opposite: create those 3D scenes using the generative method, use ray-tracing or the like to render the image. Maybe have another pass through a model to apply any touch-ups to make it more gritty and less artificial. This way things can stay consistent and sane, avoiding all those flaws which are so easy to spot today.


I know exactly what OP suggested but why are you both glorifying the fact there is a 3D scene graph made in the middle and then slower rendering at the end when the tech can just go from the first thing to a better finished thing?


Because it just can't. And it won't. It can't even reliably produce consistent shadows in a still image, so when we talk video with a moving camera, all bets are off. To create flawless movie simulations through a dynamic and rich 3D world, requires an ability of internally represent that scene with a level of accuracy which is beyond what we can hope generative models to achieve, even with the gargantuan amount of GPU-power behind ChatGPT, for example. ChatGPT, may I remind you, can't even properly simulate large-ish multiplications. I think you may need to slightly recalibrate your expectations for generative tech here.


I find that very unlikely. LLMs seem capable of simulating human intuition, but not great at simulating real complex physics. Human intuition of how a scene “should” look isn’t always the effect you want to create, and is rarely accurate im guessing


> LLMs seem capable of simulating human intuition, but not great at simulating real complex physics.

Diffusion models aren't LLMs (they may use something similar as their text encoder layer) and they simulate their training corpus, which usually isn't selected solely for physical fidelity, because that's not actually the single criteria for visual imagery outside of what is created by diffusion models.


Huh fair enough. I mean they are large models based on language but I see your point. Even though everything you said is true, I still believe there’s a place for human-constructed logically-explicit simulations and functions. In general, and in visual arts.


>For although I love SD and these video examples are great... It's a flawed method: they never get lighting correctly and there are many incoherent things just about everywhere. Any 3D artist or photographer can immediately spot that.

The question is whether the 99% of the audience would even care...


Of course they would. The internet spent a solid month laughing at the Sonic the Hedgehog movie because Sonic had weird-looking teeth.


Since that movie did well and spawned 2 sequels, the real conclusion is that the viewers didn't really care.

As for "the internet", there will always some small part of it which will obsess and/or laught over anything, doesn't mean they represent anything significant - not even when they're vocal.


Viewers did care: the teeth got changed before the movie was released. And, I don't know if you missed it, but it wasn't just one niche of the internet commenting on his teeth. The "outrage" went mainstream; even dentists were making hit-pieces on Sonic's teeth. I'm not gonna lie, it was amazing marketing for the movie, intentional or not.


No they laughed at it because it looked awful in every single way


Whats your reasoning for feeling that we're close?


We do it for text, audio and bitmapped images. A 3D scene file format is no different, you could train a model to output a blender file format instead of a bitmap.

It can learn anything you have data for.

Heck, we do it with geospatial data already, generating segmentation vectors. Why not 3D?


>3D scene file format is no different

Not in theory, but the level of complexity is way higher and the amount of data available is much smaller.

Compare bitmaps to this: https://fossies.org/linux/blender/doc/blender_file_format/my...


Also the level of fault tolerance... if your pixels are a bit blurry, chances are no one notices at a high enough resolution. If your json is a bit blurry you have problems.


You can do "constrained decoding" on a code model which keeps it grammatically correct.

But we haven't gotten diffusion working well for text/code, so generating long files is a problem.


Recent results for code diffusion here: https://www.microsoft.com/en-us/research/publication/codefus...

I'm not experienced enough to validate their claims, but I love the choice of languages to evaluate on:

> Python, Bash and Excel conditional formatting rules.



Text, audio, and bitmapped images are data. Numbers and tokens.

A 3D scene is vastly more complex, and the way you consume it is tangential to the rendering of it we use to interpret. It is a collection of arbitrary data structures.

We’ll need a new approach for this kind of problem


> Text, audio, and bitmapped images are data. Numbers and tokens.

> A 3D scene is vastly more complex

3D scenes, in fact, are also data, numbers and tokens. (Well, numbers, but so are tokens.)


As I stated and you selectively omitted, 3D scenes are collections of many arbitrary data structures.

Not at all the same as fixed sized arrays representing images.


Text gen, one of the things you contrast 3d to, similarly isn't fixed size (capped in most models, but not fixed.)

In fact, the data structures of a 3D scene can be serialized as text, and a properly trained text gen system could generate such a representation directly, though that's probably not the best route to decent text-to-3d.


Text is a standard sized embedding vector that gets passed one at a time to an LLM. All tokens have the same shape. Each token is processed one at a time. All tokens also have a pre defined order. It is very different and vastly simpler.

Serializing 3D models as text is not going to work for negligibly non trivial circumstances.


That indeed sounds like a very plausible solution -- working with AI on the level of scene definitions, model geometries etc.

However, 3D is just one approach to rendering visuals. There are so many other styles and methods how people create images, and if I understand correctly, we can do image-to-text to analyze image content, as well as text-to-image to generate it - regardless of the orginal method (3d render or paintbrush or camera lens). There are some "fuzzy primitives" in the layers there that translate to the visual elements.

I'm hoping we see "editors" that let us manipulate / edit / iterate over generated images in terms of those.


Not that I’m against the described 3d way, but personally I don’t care about light and shadows until it’s so bad that I do. This obsession with realism is irrational in video games. In real life people don’t understand why light works like this or like that. We just accept it. And if you ask someone to paint how it should work, the result is rarely physical but acceptable. It literally doesn’t matter until it’s very bad.


This isn't coming, it's already here. https://github.com/gsgen3d/gsgen Yes, it's just 3D models for now, but it can do whole scenes generations, it's just not great yet at it. The tech is there but just need to improve.


Are you working on all that?


Probably not. But there does seem to be a clear path to it.

The main issue is going to be having the right dataset. You basically need to record user actions in something like blender (ie: moving a model of a bike to the left of a scene), match it to a text description of the action (ie; "move bike to the left") and match those to before/after snapshots of the resulting file format.

You need a whole metric fuckton of these.

After that, you train your model to produce those 3d scene files instead of image bitmaps.

You can do this for a lot of other tasks. These general purpose models can learn anything that you can usefully represent in data.

I can imagine AGI being, at least in part, a large set of these purpose trained models. Heck, maybe our brains work this way. When we learn to throw a ball, we train a model in a subset of our brain to do just this and then this model is called on by our general consciousness when needed.

Sorry, I'm just rambling here but its very exciting stuff.


The hard part of AGI is the self-training and few examples. Your parents didn't attach strings to your body and puppeteer you through a few hundred thousand games of baseball. And the humans that invented baseball had zero training data to go on.


Your body is a result of a billion year old evolutionary optimization process. GPT-4 was trained from scratch in a few months.


I have for some time planning to do a 'Wikipedia for AI' (even bought a domain), where people could contribute all sorts of these skills ( not only 3d video, but also manual skills, or anything). Given the current climate of 'AI will save/doom us', and that users would in some sense be training their own replacements, I don't know how much love such site would have, though.


Excellent point.

Perhaps a more computationally expensive but better looking method will be to pull all objects in the scene from a 3D model library, then programmatically set the scene and render it.


I am guessing it will be similar to inpainting in normal stable diffusion, which is easy when using the workflow feature InvokeAI ui.


Thanks! this is exactly what I have been thinking, only you've expressed it much more eloquently than I would be able.


Where is the training data coming from?


we're working on this if you want to give it a try - dream3d.com


You should put a demo on the landing page


just redid the ux and making a new one, but here's a quick example: https://www.loom.com/share/fa84ba92d7144179ac17ece9bf7fbd99




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: