Diffusion Forcing: Next-Token Prediction Meets Full-Sequence Diffusion

vessenes · 2024-07-04T10:35:31

A number of ideas seem notable to me here; first, they are merging the idea of sequence masking (the key training idea for LLMs) with diffusion models; they do this by keeping track of an ‘uncertainty’ level per pixel. This ‘uncertainty’ level is treated as the ‘noise’ level for the diffusion model, (a model which denoises controlled by some sort of embedding).

There are a bunch of neat things you can do with this: in particular, you can firm up parts of the image earlier than others, and thus use it for, say maze solving. They even show it controlling a robot arm moving fruit around, which is pretty wild.

In a way the title undersells the idea - this is a way to do fractional masking, since the masking level is a float - and I think is really a pretty profound and interesting idea.

However, there’s a lot not talked about in this paper; I’d be very curious to see their codebase. How exactly do you set up a maze-following task vs a video extension task? How do you hook up a robot arm to this model, and tell the model what you want done? The architecture itself deserves a significant number of papers / explication.

bravura · 2024-07-04T12:05:40

Thank you for this. It appears to be an exceedingly elegant take on modeling uncertainty in planning and search. There's something quite potent about changing the task to be variable-length, but also forcing the agent to account for its current situation instead of taking it for granted. This allows the agent to react and generalize way better along its path, even in the face of unforeseen challenges.

I assume this is set up so that all tasks are treated as variable horizon, and the current state as a consequence of preceding actions. I agree it would be nice to see the code.

IanCal · 2024-07-05T00:38:33

> However, there’s a lot not talked about in this paper; I’d be very curious to see their codebase.

Is the linked codebase enough? I'd be interested to understand what's missing here.

https://github.com/buoyancy99/diffusion-forcing

vessenes · 2024-07-05T12:32:32

Thanks - I somehow missed this in the site / paper. I think this should answer my questions :)

jimsimmons · 2024-07-04T21:19:49

I work in the field and the work is presented in an extremely obtuse manner.

What is the problem you're trying to solve? Are you proposing a new generative model?

actionfromafar · 2024-07-05T10:50:12

I don't have any theoretical background, but I don't understand the videos either... ok, "Teacher Forcing" looks bad I guess. But are the other good or bad? Which is even the baseline?

luke-stanley · 2024-07-04T08:43:12

Anyone know of research or tools for using an existing text generating LLM with diffusion like techniques with no new pre-training, or at most, a bit of fine-tuning, such that it works with a small GPT / Phi 3 / Gwen model, for example? I know about Tree of Thoughts with MCTS etc, that are somewhat similar (though often with a different reward learned goal) but I'm interested in something closer to token level generation. Is this possible?

treprinum · 2024-07-04T15:50:23

Russ is doing diffusion now? Must be very applicable to robotics.

krasin · 2024-07-04T20:28:35

Diffusion policies are indeed started to be used in Robotics recently. See https://diffusion-policy.cs.columbia.edu/ and related research.

blovescoffee · 2024-07-04T06:08:51

Am I missing something about training time? Does adding per token noise cause training to slow significantly? Cool paper though!

omerhac · 2024-07-04T17:27:18

Very cool, but why is it called diffusion forcing?

abound · 2024-07-05T02:45:19

Second paragraph:

> The name "Diffusion Forcing" comes from "teacher forcing" and "diffusion models".