LLMs are autoregressive, so they can't be (multi-modality) integrated with diffu...

SweetSoftPillow · 2025-03-25T19:26:20 1742930780

Gemini added their multimodal Flash model to Google AI Studio some time ago. It does not use Imagen via tool, it's uses native capabilities to manipulate images, and it's free to try.

summerlight · 2025-03-25T19:24:40 1742930680

Your understanding seems outdated, I think people are referring Gemini native image generation

argsnd · 2025-03-25T19:10:51 1742929851

Is this the same for their gemini-2.0-flash-exp-image-generation model?

cubefox · 2025-03-25T20:32:17 1742934737

No that seems to be indeed a native part of the multimodal Gemini model. I didn't know this existed, it's not available in the normal Gemini interface.

lxgr · 2025-03-25T20:42:52 1742935372

This is a pretty good example of the current state of Google LLMs:

The (no longer, I guess) industry-leading features people actually want are hidden away in some obscure “AI studio” with horrible usability, while the headline Gemini app still often refuses to do anything useful for me. (Disclaimer: I last checked a couple of months ago, after several more of mild amusement/great frustration.)

tough · 2025-03-25T20:55:23 1742936123

hey at least now they bought ai.dev and redirected it to their bad ux

vladf · 2025-03-25T23:20:38 1742944838

That's pretty disappointing, it has been out for a while, and we still get top comments like (https://news.ycombinator.com/item?id=43475043) where people clearly think native image generation capability is new. Where do you usually get your updates from for this kind of thing?

johntb86 · 2025-03-25T19:35:52 1742931352

Meta has experimented with a hybrid mode, where the LLM uses autoregressive mode for text, but within a set of delimiters will switch to diffusion mode to generate images. In principle it's the best of both worlds.

echelon · 2025-03-25T19:23:36 1742930616

I expect the Chinese to have an open source answer for this soon.

They haven't been focusing attention on images because the most used image models have been open source. Now they might have a target to beat.

rfoo · 2025-03-25T20:29:34 1742934574

ByteDance has been working on autoregressive image generation for a while (see VAR, NeurIPS 2024 best paper). Traditionally they weren't in the open-source gang though.

cubefox · 2025-03-25T20:37:39 1742935059

The VAR paper is very impressive. I wonder if OpenAI did something similar. But the main contribution in the new GPT-4o feature doesn't seem to be just image quality (which VAR seems to focus on), but also massively enhanced prompt understanding.

hansvm · 2025-03-26T06:13:45 1742969625

> so they can't be integrated

That's overly pessimistic. Diffusion models take an input and produce an output. It's perfectly possible to auto-regressively analyze everything up to the image, use that context to produce a diffusion image, and incorporate the image into subsequent auto-regressive shenanigans. You'll preserve all the conditional probability factorizations the LLM needs while dropping a diffusion model in the middle.