Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Text to image models feels inefficient to me. I wonder if it would be possible and better to do it in separate steps, like text to scene graph, scene graph to semantically segmented image, segmented image to final image. That way each step could be trained separately and be modular, and the image would be easier to edit instead of completely replace it with the output of a new prompt. That way it should be much easier to generate stuff like "object x next to object y, with the text foo on it", and the art style or level of realism would depend on the final rendering model which would be separate from the prompt adherence.

Kind of like those video2video (or img2img on each frame I guess) models where they enhance the image outputs from video games:

https://www.theverge.com/2021/5/12/22432945/intel-gta-v-real... https://www.reddit.com/r/aivideo/comments/1fx6zdr/gta_iv_wit...



In general, it has been shown time and time again that this approach fails for neural network based models.

If you can train a neural network that goes from a to b and a network that goes from b to c, you can usually replace that combination with a simpler network that goes from a to c directly.

This makes sense, as there might be information in a that we lose by a conversion to b. A single neural network will ensure that all relevant information from a that we need to generate c will be passed to the upper layers.


Yes this is true, you do lose some information between the layers, and this increased expressibility is the big benefit of using ML instead of classic feature engineering. However, I think the gain would be worth it for some use cases. You could for instance take an existing image, run that through a semantic segmentation model, and then edit the underlying image description. You could add a yellow hat to a person without regenerating any other part of the image, you could edit existing text, change a person's pose, you could probably more easily convert images to 3D, etc.

It's probably not a viable idea, I just wish for more composable modules that lets us understand the models' representation better and change certain aspects of them, instead of these massive black boxes that mix all these tasks into one.

I would also like to add that the text2image models already have multiple interfaces between different parts. There's the text encoder, the latent to pixel space VAE decoder, controlnets, and sometimes there a separate img2imgstyle transfer at the end. Transformers already process images patchwise, but why does those patches have to be even square patches instead of semantically coherent areas?


It's my understanding an a-c will usually be bigger parameter wize and more costly to train


Isn't this essemtially the approach to image recognition etc. that failed for ages until we brute forced it with bigger and deeper matrices?

It seems sensible to extract features and reason about things the way a human would, but it turns out its easier to scale pattern matching purely done by computer.



From the PDF - "One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are "search" and "learning".

The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done."


If I would take the Lesson literally, we should not even study text to image. We should study how a machine with limitless cpu cycles would make our eyes see something we are currently thinking of.

My point being, optimization or splitting up int subs, before handing over the problem to the machine, makes sense.


I think the bitter lesson implies that if we could study/implement "how a machine with limitless cpu cycles would make our eyes see something we are currently thinking of" then it would likely lead to a better result than us using hominid heuristics to split things into sub-problems that we hand over to the machine.


The technology to probe brains and visual related neurons exists today. With limitless cpu cycles we would for sure be able to do make us see whatever we think about.


I'm not really familiar with that technology space, but if you take that as true, is your argument something like:

- We don't have limitless CPU cycles

- Thus we need to split things into sub-problems

If so that might still be amenable to the bitter lesson, where Sutton is saying human heuristics will always lose out to computational methods at scale.

Meaning something like:

- We split up the thought to vision problem into N sub-problems based on some heuristic.

- We develop a method which works with our CPU cycle constraint (it isn't some probe -> CPU interface). Perhaps it uses our voice or something as a proxy for our thoughts, and some composition of models.

Sutton would say:

Yeah that's fine, but if we had the limitless CPU cycles/adequate technology, the solution of probe -> CPU would be better than what we develop.


I think Sutton is right that if we had limitless cpu, any human split up would be inferior. So indeed since we are far away from limitless cpu, we divide and compose.

But i think we're onto something!

Voice to image indeed might give better results than text to image, since voice has some vibe to it (intonation, tone, color, stress on certain words, speed and probably even traits we don't know yet) that will color or even drastically influence the image output.


A problem with image recognition i can think of, is that any rude categorization of the image, which is millions of pixels will make it less accurate.

With image generation on the other hand, which starts from a handful of words, we can first do some text processing into categories, such as objects vs people, color vs brightness, environment vs main object, etc.


You could imagine doing it with 2 specialized NNs, but then you have to figure out a huge labeled dataset of scene graphs. The problem fundamentally is that any "manual" feature engineering is not going to be supervised and fitted on a huge corpus, the way the self-learned features are.


I am hoping that AI art tends towards a modular approach, where generating a character, setting, style, and camera movement each happens in its own step. It doesn’t make sense to describe everything at once and hope you like what you get.


Definitely, that would make much more sense seeing how content is produced by people. Adjust the technology to how people want to use it instead of forcing artists becoming prompt engineers and settling for something close enough what they want.

At the very least image generators should output layers, I think the style component is already possible with the img2img models.


You can already do that with comfyui - it’s just not easy to set up


That's essentially what diffusion does, except it doesn't have clear boundaries between "scene graph" and "full image". It starts out noisy and adds more detail gradually


That's true, the inefficiency is from using pixel-to-pixel attention at each stage. It the beginning low resolution would be enough, even at the end high resolution is only needed at the pixel's neighborhood


The issue with this is there's a false assumption that an image is a collection of objects. It's not (necessarily).

I want a picture of frozen cyan peach fuzz.


https://imgur.com/ayAWSKr

Prompt: frozen cyan peach fuzz, with default settings on a first generation SD model.

People _seriously_ do not understand how good these tools have been for nearly two years already.


If by people you mean me, then I wasn't clear enough in my comment. The example given implied an image without any objects the GP was talking about, just a uniform texture.



can do this with any image generation model.

Disclaimer: I'm not behind any


Running that image through Segment Anything you get this: https://imgur.com/a/XzCanxx

Imagine if instead of generating the RGB image directly the model would generate something like that, but with richer descriptive embeddings on each segment, and then having a separate model generating the final RGB image. Then it would be easy to change the background, rotate the peach, change color, add other fruits, etc, by editing this semantic representation of the image instead of wrestling with the prompt to try to do small changes without regenerating the entire image from scratch.


I guess the inefficiency is obvious to many, it's just a matter of time until something like this will come out. and yeah, as others said, you might lose info a-to-b that's needed for b-to-c, but you gain more in predictability/customization


You seem to be describing ComfyUI to me. You can definitely do this kind of workflow with ComfyUI.


disney’s multiplane camera but for ai

compositing.

do this with ai today where each layer you want has just the artifact on top of a green background.

layer them in the order you want, then chroma key them out like you’re a 70s public broadcasting station producing reading rainbow.

the ai workflow becomes a singular, recursive step until your disney frame is complete. animate each layer over time and you have a film.


Neural networks will gradually be compressed to their minimum optimal size (once we know how to do that)




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: