Hacker News new | past | comments | ask | show | jobs | submit login
Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models (nvidia.com)
239 points by lnyan on April 19, 2023 | hide | past | favorite | 82 comments



Whenever I see these text to video diffusion models, I always think that a much better result would come out of some sort of multi-modal vision-aware AI agent were able to use 3d modeling tools like blender, continuously iterate it, work with physics and particle simulation tools then refine the less realistic details with a diffusion model based filter. It would also allow all the assets created along the way to be reused and refined.

I imagine it will be the de-facto way AI generates video in the next few years, once huge models are smart enough to use tools and work in high breadth and depth jobs like humans do.


The same wasn't true for images - it has been easier (so far) to directly compute the images than it has been to train them to use photoshop or illustrator.

Similarly, neural networks have already been shown to be able to directly simulate physics simulations (smoke, particles) with suprising accuracy.

I agree that logically your statement makes sense, however I'm not sure it matches what has been successful in the space so far. Tools may make sense for these NN's to use, however they probably won't look anything like the tools we use today.


>> it has been easier (so far) to directly compute the images than it has been to train them to use photoshop or illustrator.

possible because training models are fed by an abundance of existing human-generated content, which is ever increasing in quality, dirt-cheap, and easily accessible; which doesn't make AI Generated Content disruptive in this respect (my view).

not sure what you mean by "easier", since computers are better at following rules and people better at reasoning.

>> I agree that logically your statement makes sense ...

Isn't that the best case for AI generated content, logical-sense? or are you suggesting more (e.g 'source of meaning')?


> Isn't that the best case for AI generated content, logical-sense? or are you suggesting more (e.g 'source of meaning')?

I just mean that the approach sounds logical, although the best approach isn’t always the one that appears the most logical. This appears particularly true in AI - for instance diffusion models feel more like an incredibly surprising discovery rather than an approach that’s immediately obvious by applying logic.

> Not sure what you mean by "easier", since computers are better at following rules and people better at reasoning.

By easier I just mean we have had better results so far. Maybe easier wasn't the right word. On your second point - computers will be better at most reasoning tasks shortly.


Well, all the problems obviously caused by the lack of an underlining physical model (like the hands thing) disagree. Your comment just says that up to now it has been easier to ignore those problems than to fix them.

But video has much stricter physical constraints than image, so it's not clear we can ignore the problem at all.


Not sure if that had been ineffective or other approaches than current one didn't align with user goals. The premise of diffusion-based image generators was that the user only has to verbally describe an image, requiring absolutely no skill, while the trend since their inception seems moving away from that fast.


The future is stacking models in feedback loops. One model could detect bad anatomy and tell the other models to rerender the frame. AutoGTP is an example of this


I've definitely been seeing things trending in this direction.


This actually exists, already... sort-of.

A major topic of research is to make differentiable renderers, which can then have their parameters tuned via gradient descent, much like how AIs are trained.

The problem is that most renderers are very discrete, and hence aren't differentiable. For the same reason that reverse rendering is difficult on such systems, it's hard to train an AI to use them also.

Having said that, I can imagine it happening. Look at Google's Alpha Star and related AIs, which can play games with discrete events at grandmaster level. You could train an AI to operate a 3D design program similarly. The input training could be recordings of actions taken by human designers, or simply train the AI by making it match target images. Start out simple, and then crank up the complexity over time.


Differentiable renderers have started to volumetric representations and distance fields to represent geometry, which kinda solves the discretization of the scene.


I concur. Throwing in more and more compute to make digital dreams more and more realistic doesn't seem to be an efficient strategy for video generation where you need to keep lots of out-of-the-picture context in memory.


It's a bit light on details beyond trained-on-video, and interpolating between keyframes in latent space.

The stable diffusion community is getting impressive results generating keyframes in a batch and using existing interpolation mechanisms on the decoded frames.

https://www.reddit.com/r/StableDiffusion/comments/12o8qm3/fi...

That example is using control-nets to induce the animation they want but it would be quite easy to train a model on a sequence of video frames in the same layout.


Hm? the paper is actually fairly detailed for an industry paper. They are able to freeze much of the model and only train temporal layers inbetween.


The results are impressive, but also serious nightmare fuel. I suppose that's a good example of the "uncanny valley": https://en.wikipedia.org/wiki/Uncanny_valley


I think they are not close to the uncanny valley yet. Most of the results are still clearly CGI. For the uncanny valley, they have to look really, really accurate, BUT something feels "off" without knowing what. In most of the generated imagery nowadays, if you look closely you can still immediately point out what is off.


I think people are abusing this term when it comes to AI. These examples do not trigger any uncanny valley effect for me. The same way cartoons do not. It's nightmarish but that's different.


It looks like the pictures that you would get out of MidJourney V1 or the early DALL-E, but I assume this can quickly improve in the next months.


Not really, Dalle 1 generated low resolution, blurry images without CFG (so to generate something relevant you would need to generate ~512 of them and sort them using CLIP), the first MJ generative model wasn't even conditioned on text, it was an unconditional diffusion model probably guided by CLIP (so it wasn't particularly coherent). People don't seem to remember how bad text-to-image was only a few months ago.


I've seen the phrase "uncanny valley" thrown around a bunch of times. Who are these people who have emotional reactions to CGI which is not quite real looking?


Emotional reaction doesnt just mean crying or laughing, the feeling of "that looks unreal" is a emotional reaction too and that's what's being referred to here.


Almost-real-looking CGI is one thing. I find some of these AI images to be far creepier because they look so plausible, but then the longer you look the worse it gets. Pretty girl on a beach at sunset! But there's a hand resting on her waist, and she's alone. And she has six fingers on one hand. And one foot is the wrong way around. And then you realise that she has too many teeth and they go past the end of her lips... and one is poking out from one of her eyes. And then it gets weirder.


For many millennia CGI has had difficulty creating anything that looked real.

But each advance has been an opportunity to unnerve people as they react to how much more real something looks, but still isn’t real.

I think the uncanny valley is now getting trained out of us.

We are all getting used to a continuum of real to stylistically unreal, with no more unexplored valley.

Except for the weird mistakes. Those will likely remain weird to us since most of us don’t want to immerse ourselves in worlds of uncountable fingers and third arms for long enough for that to start feeling normal.


It's not a "phrase" as much as a well researched phenomenon [1].

[1] https://en.wikipedia.org/wiki/Uncanny_valley


Realism is just a part of the valley phenomenon, not all super-realistic things cause it. There is sort of a sense of betrayal, which gives a heavy feeling in the gut. To me, it's a similar feeling to motion sickness.


No model, no code.

About as exciting as Imagen-video, which was released eons ago.


The approach is different, it can use pretrained models, i.e. stable diffusion, which is a pretty exciting research development. This means that it only requires 'fine-tuning' existing models to get this result.


I agree with that, but it's hard for me to get excited with the knowledge that it'll almost certainly be discarded and forgotten. I've seen too many papers that looked interesting from a theoretical perspective, but were simply never brought to the public because of the barrier of dev+training.

In this case, you need someone that can implement the method as described (hard!), and then you need someone with a setup better than a rented 8xA100 (expensive and not available on many cloud providers) to actually reproduce the model.


To put it in context, in almost all areas of research (physics, biology, chemistry, electronics, etc), running experiments is expensive. ML is in the category that there can still be advances done by amateurs at home. I don't think it's worth writing off everything that requires more resources than a hobbyist.


Yeah. You can rent 8xA100s, it's a steep price but you can. It's much harder for a hobbyist to rent electron microscopes or NMR machines.



Ugh, that tweet is so confusing. The researchers did work for Stability - this work was done for NVIDIA. It's entirely unclear if the researchers are even still associated with stability.ai but Emad sure does imply that's the case.

> (Team working on our variant, will be done when it's done and they are happy with it).

I _think_ he's saying that _his_ team is working on a similar model - and that they will release _that_ model "when it's done" (and not to to expect that to happen any time soon).

Just super vague, bordering on taking credit for work that NVIDIA did. Seems like he typed it out on his phone and/or is Elon-levels of lazy about tweets.


Three of the authors of this paper work for Stability AI currently.

One of them even said: "Unfortunately we cannot release the weights. That's why I joined @StabilityAI to work on OS video models"

- https://twitter.com/andi_blatt/status/1648598423526932483


Thanks for the clarification!


I doubt you'll ever again see a model being publicly released for something as capable as this.

I can't say I fully understand the mechanisms by which they achieve that, but it's clear that the powers that be have decided that the public cannot be trusted with powerful AI models. Stable Diffusion and LLaMA were mistakes that won't be repeated anytime soon.


AI models with obvious, broad, real world applications always seem get reproduced in public. Nvidia’s result is obviously great, but it’s still a long way from being useful. It reminds me of image generation models maybe 4 years ago.

We need a killer application for video generation models first, then I’m sure someone will throw a $100k at training an open source version.


I think deep down, everyone already knows what the "killer application" for open source AI video generation is going to be.


I am going to guess: Making "dog-nature-like" humanoids that aim to please lonely people, who are nicer to be around than people, easier than real relationships.


Reducing human trafficking by generating artificial content ?


Not a thing.


Yeah, just visit civitai and see what's popular there.


Don't we have enough of that already?


Capitalism doesn't ask that question.


> AI models with obvious, broad, real world applications always seem get reproduced in public.

Really? GPT-3 was released almost 3 years ago. Where is the public reproduction?

And don't say LLaMA. I've used it and it isn't even close to GPT-3, nevermind GPT-4.


The current generation of GPT-3, which started with text-davinci-003, was actually released on November 2022, not quite 3 years ago. I'm not even sure the model that was released 3 years ago is still available to test, but it was much less impressive than more recent models - I wouldn't be surprised if LLaMa were actually better.


The model trained 3 years ago was only trained on 300B tokens, heavily undertrained (in terms of the Chinchilla scale), that's why LLaMa models can easily beat it on most benchmarks (they were trained on 1/1.4T tokens). About the current GPT-3.5 models, who knows, OpenAI is not very open about it.


Which Llama model have you played with? Unless it’s Llama 65B with 16f precision it’s not a fair comparison to gpt3.


The tragedy of the commons is at play here. We could get amazing ML models rivaling the best if people interested could pool together money for a $100k or $1 million training run. But there doesn't seem to be the willingness, so we humbly rely on the benevolence of companies like Stability and Meta to release models to the public.


As someone not very familiar with the field, what is wrong with this comment? Is the scale off?

Kickstarter is into the tens of millions these days [1]. I would assume some number of millions might be possible, if the right names were behind it.

[1] https://www.kickstarter.com/projects/dragonsteel/surprise-fo...


Kickstarters like the one you linked don't suffer from the tragedy of the commons because people are essentially just pre-paying for a product. With funding an open source ML model, there's little incentive to not be a free-rider.


I doubt there would be much funding for a kickstarter that didn't promise to be open source.

> people are essentially just pre-paying for a product

I think this is what the 4 nines of the user space wants. A pre trained, open source, model to work with.


> but it's clear that the powers that be have decided that the public cannot be trusted with powerful AI models.

The most dangerous AI model today (in a practical sense, as people are actually using for shady stuff) is ChatGPT, which is closed source, but open to the public so anyone can cheat on their exams, write convincing fake product reviews, or generate SEO spams, etc.

The fact that a model is closed source doesn't change anything as long as it's available for use. Bad actors don't care about running the code on their own machine…


But they're still showing us that the results exist. They're trying to have it both ways, by showing the results are tangible progress while implicitly admitting that that progress is too powerful in the hands of the public.

Is there anything that incentivizes Nvidia to publish these results? Is it just needing to get papers out in the public for the academic clout? Something tells me that all this accomplishes is setting the standards of everyone who sees the possibilities, that "this will be the future", and a third party without the moral framework of Nvidia will become motivated to develop and openly release their own version at some point.


That's just good marketing, isn't it? "Our product is amazing! In fact it's too good, no you can't have it. Unless just maybe, we might let you buy access to it." Oh wow if it's so good that they won't let me have it then I definitely want it!


A lot of those examples have a Shutterstock watermark. I doubt that Shutterstock allows the use of unlicensed videos for this use case.


I’m also now mildly disturbed/worried by the idea of going about my business in AI-generated VR a few years from now (maybe with some 2D-to-3D-ifying compatibility layer) and being haunted by ghosts of stock image watermarks.


I noticed the same thing. I'm not surprised about the current discussion over copyright because this training data pretty clearly has a basis in unlicensed Shutterstock images.



> A lot of those examples have a Shutterstock watermark. I doubt that Shutterstock allows the use of unlicensed videos for this use case.

That almost everyone (other than, maybe, people specifically and very loudly selling the fact that they aren’t, e.g., Adobe) training base AI models is relying on the idea that doing so is fair use and doesn’t require a license is hardly news.


If they posted these videos on publicly accessible websites, it could be fair use. We need some more court cases to really know.



> † Andreas, Robin and Tim did the work during internships at NVIDIA.

Some pretty experienced/expensive interns you got there, NVIDIA.


I give a summary about the training of the text-to-video model: they used 48 GPUs for the keyframes model 128k steps, 24 GPUs for the interpolation model 14k steps, 32 GPUs for the upscaler model 10k steps, 40 GPUs for the autoencoder 35k steps. They used A100-80G in all cases except the autoencoder (A100-40G).

This is not particularly expensive considering the numbers these companies usually use, thanks to the fact that they only trained the temporal layers in the LDM and the autoencoder.


Some things to note: - This was finished last year, impressive now but even more so back then - No code or models released BUT several authors have moved to StabilityAI and are working on their own improved open video models, which is hopeful as the field continues to move forward - The paper uses existing image models as a base, and so a better base model (the new XL stable diffusion variant, or Midjourney's underlying model) will give even better results.


The generated 'driving scene' videos eerily resemble a lot of crash dashcam video's. I even have the idea that the bottom right video is composited of video's right before a crash. The white SUV seems to make a hard steer right, but then just continues straight on. It even looks as if some wreckage is visible flying over the road.


That's probably what they were trained on. Gotta be a lot of them that have been posted online.


I understand the technical marvel, but there are obvious inconsistencies in all the images and I would call them more animations than video sequences.


The pace of these is improving at a remarkable clip.

I remember when image generation was similarly derided.

I wouldn't count anything out at this point.

(Also, this was an intern project. There are a couple of senior staff researchers on the paper, but it's mostly interns.)


The fox dressed in suit dancing in park is obviously Roald Dahl's Fantastic Mr Fox and his tail was just shot by the farmers it seems.


That guitar teddybear sample clearly played Stevie Ray Vaughan's "Scuttle Buttin'". Good taste for music!


I was curious about the use of "sks" in the examples. It is explained in the paper:

> Using DreamBooth [66], we fine-tune our Stable Diffusion spatial backbone on small sets of images of certain objects, tying their identity to a rare text token (“sks”).

I wonder how long that token will stay "rare".


DreamBooth replaces the old meaning of the token in a model, so the only importance of "rareness" is you weren't already using it.


This seemed like it might be an interesting article to read, but it turns out that clicking on this link initiated a huge many megabyte download of some kind of media experience that totally locked up my machine. I had a bunch of apps going and was hoping for a smooth start to the day. Had this been an actual article to read then it could have loaded and been put in the background. If you are going to link to some hypermedia experience that requires a high end machine, good net connection, and well configured ad blockers then there should be a warning. This is not some taste issue about me not liking the custom font or whatever. Much of the population does their work on lower end machines with modest net connections and this site is not compatible with that. Do you really want HN to be a site only for the ascended? If you can't communicate without an avalanche of crud coming along then maybe you need to consider what you are saying and maybe consult an editor or something.


Wat. It‘s a demo for a text to video model, of course the authors would show videos, expecting anything else is wild. It should be on your machine to handle this gracefully…


An interesting take on the "diffusion can't do hands" trope: take a teddy bear (the most finger-less humanoid shape known to man, and now also to machine I guess) and put it in that one scene where all attention is on fingers (guitar solo)


I think their main focus is generate more real lift simulation of driving and creating many scenarios to train on. Its good for the autonomous car mostly targeting Cruise and Tesla.


It's interesting the generative video attempts aren't starting with something simpler and and growing from there with the audience.

No model, no code makes it a challenge to explore.


A bit ridiculous seeing the much-maligned "trending on artstation" keywords on nVidia's own tech demo page.


What's the problem with "trending on artstation"? Does it not do anything to the output?


It implies that the researchers are okay with generating outputs based on data scraped from ArtStation portfolios without consent.

There was also the time months ago when the art of a deceased artist (Qinni) was featured prominently on the front cover of an img2img/style transfer paper until the artist's sister requested it was taken down.

https://twitter.com/ZirocketArt/status/1629557296563888128


Come on, it’s like slapping a doge sticker on your research paper.

I think prompting is a bit past “trending on Artstation in the style of Greg Rutkowski” at this point..?


>Come on, it’s like slapping a doge sticker on your research paper.

Like in this case? https://arxiv.org/pdf/2106.08254.pdf, figure 1


I feel bad for this Greg Rutkowski.

He's got to be the most cucked man on the planet.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: