Whenever I see these text to video diffusion models, I always think that a much better result would come out of some sort of multi-modal vision-aware AI agent were able to use 3d modeling tools like blender, continuously iterate it, work with physics and particle simulation tools then refine the less realistic details with a diffusion model based filter. It would also allow all the assets created along the way to be reused and refined.
I imagine it will be the de-facto way AI generates video in the next few years, once huge models are smart enough to use tools and work in high breadth and depth jobs like humans do.
The same wasn't true for images - it has been easier (so far) to directly compute the images than it has been to train them to use photoshop or illustrator.
Similarly, neural networks have already been shown to be able to directly simulate physics simulations (smoke, particles) with suprising accuracy.
I agree that logically your statement makes sense, however I'm not sure it matches what has been successful in the space so far. Tools may make sense for these NN's to use, however they probably won't look anything like the tools we use today.
>> it has been easier (so far) to directly compute the images than it has been to train them to use photoshop or illustrator.
possible because training models are fed by an abundance of existing human-generated content, which is ever increasing in quality, dirt-cheap, and easily accessible; which doesn't make AI Generated Content disruptive in this respect (my view).
not sure what you mean by "easier", since computers are better at following rules and people better at reasoning.
>> I agree that logically your statement makes sense ...
Isn't that the best case for AI generated content, logical-sense? or are you suggesting more (e.g 'source of meaning')?
> Isn't that the best case for AI generated content, logical-sense? or are you suggesting more (e.g 'source of meaning')?
I just mean that the approach sounds logical, although the best approach isn’t always the one that appears the most logical. This appears particularly true in AI - for instance diffusion models feel more like an incredibly surprising discovery rather than an approach that’s immediately obvious by applying logic.
> Not sure what you mean by "easier", since computers are better at following rules and people better at reasoning.
By easier I just mean we have had better results so far. Maybe easier wasn't the right word. On your second point - computers will be better at most reasoning tasks shortly.
Well, all the problems obviously caused by the lack of an underlining physical model (like the hands thing) disagree. Your comment just says that up to now it has been easier to ignore those problems than to fix them.
But video has much stricter physical constraints than image, so it's not clear we can ignore the problem at all.
Not sure if that had been ineffective or other approaches than current one didn't align with user goals. The premise of diffusion-based image generators was that the user only has to verbally describe an image, requiring absolutely no skill, while the trend since their inception seems moving away from that fast.
The future is stacking models in feedback loops. One model could detect bad anatomy and tell the other models to rerender the frame. AutoGTP is an example of this
A major topic of research is to make differentiable renderers, which can then have their parameters tuned via gradient descent, much like how AIs are trained.
The problem is that most renderers are very discrete, and hence aren't differentiable. For the same reason that reverse rendering is difficult on such systems, it's hard to train an AI to use them also.
Having said that, I can imagine it happening. Look at Google's Alpha Star and related AIs, which can play games with discrete events at grandmaster level. You could train an AI to operate a 3D design program similarly. The input training could be recordings of actions taken by human designers, or simply train the AI by making it match target images. Start out simple, and then crank up the complexity over time.
Differentiable renderers have started to volumetric representations and distance fields to represent geometry, which kinda solves the discretization of the scene.
I concur. Throwing in more and more compute to make digital dreams more and more realistic doesn't seem to be an efficient strategy for video generation where you need to keep lots of out-of-the-picture context in memory.
It's a bit light on details beyond trained-on-video, and interpolating between keyframes in latent space.
The stable diffusion community is getting impressive results generating keyframes in a batch and using existing interpolation mechanisms on the decoded frames.
That example is using control-nets to induce the animation they want but it would be quite easy to train a model on a sequence of video frames in the same layout.
I think they are not close to the uncanny valley yet. Most of the results are still clearly CGI. For the uncanny valley, they have to look really, really accurate, BUT something feels "off" without knowing what. In most of the generated imagery nowadays, if you look closely you can still immediately point out what is off.
I think people are abusing this term when it comes to AI. These examples do not trigger any uncanny valley effect for me. The same way cartoons do not. It's nightmarish but that's different.
Not really, Dalle 1 generated low resolution, blurry images without CFG (so to generate something relevant you would need to generate ~512 of them and sort them using CLIP), the first MJ generative model wasn't even conditioned on text, it was an unconditional diffusion model probably guided by CLIP (so it wasn't particularly coherent). People don't seem to remember how bad text-to-image was only a few months ago.
I've seen the phrase "uncanny valley" thrown around a bunch of times. Who are these people who have emotional reactions to CGI which is not quite real looking?
Emotional reaction doesnt just mean crying or laughing, the feeling of "that looks unreal" is a emotional reaction too and that's what's being referred to here.
Almost-real-looking CGI is one thing. I find some of these AI images to be far creepier because they look so plausible, but then the longer you look the worse it gets. Pretty girl on a beach at sunset! But there's a hand resting on her waist, and she's alone. And she has six fingers on one hand. And one foot is the wrong way around. And then you realise that she has too many teeth and they go past the end of her lips... and one is poking out from one of her eyes. And then it gets weirder.
For many millennia CGI has had difficulty creating anything that looked real.
But each advance has been an opportunity to unnerve people as they react to how much more real something looks, but still isn’t real.
I think the uncanny valley is now getting trained out of us.
We are all getting used to a continuum of real to stylistically unreal, with no more unexplored valley.
Except for the weird mistakes. Those will likely remain weird to us since most of us don’t want to immerse ourselves in worlds of uncountable fingers and third arms for long enough for that to start feeling normal.
Realism is just a part of the valley phenomenon, not all super-realistic things cause it. There is sort of a sense of betrayal, which gives a heavy feeling in the gut. To me, it's a similar feeling to motion sickness.
The approach is different, it can use pretrained models, i.e. stable diffusion, which is a pretty exciting research development. This means that it only requires 'fine-tuning' existing models to get this result.
I agree with that, but it's hard for me to get excited with the knowledge that it'll almost certainly be discarded and forgotten. I've seen too many papers that looked interesting from a theoretical perspective, but were simply never brought to the public because of the barrier of dev+training.
In this case, you need someone that can implement the method as described (hard!), and then you need someone with a setup better than a rented 8xA100 (expensive and not available on many cloud providers) to actually reproduce the model.
To put it in context, in almost all areas of research (physics, biology, chemistry, electronics, etc), running experiments is expensive. ML is in the category that there can still be advances done by amateurs at home. I don't think it's worth writing off everything that requires more resources than a hobbyist.
Ugh, that tweet is so confusing. The researchers did work for Stability - this work was done for NVIDIA. It's entirely unclear if the researchers are even still associated with stability.ai but Emad sure does imply that's the case.
> (Team working on our variant, will be done when it's done and they are happy with it).
I _think_ he's saying that _his_ team is working on a similar model - and that they will release _that_ model "when it's done" (and not to to expect that to happen any time soon).
Just super vague, bordering on taking credit for work that NVIDIA did. Seems like he typed it out on his phone and/or is Elon-levels of lazy about tweets.
I doubt you'll ever again see a model being publicly released for something as capable as this.
I can't say I fully understand the mechanisms by which they achieve that, but it's clear that the powers that be have decided that the public cannot be trusted with powerful AI models. Stable Diffusion and LLaMA were mistakes that won't be repeated anytime soon.
AI models with obvious, broad, real world applications always seem get reproduced in public. Nvidia’s result is obviously great, but it’s still a long way from being useful. It reminds me of image generation models maybe 4 years ago.
We need a killer application for video generation models first, then I’m sure someone will throw a $100k at training an open source version.
I am going to guess: Making "dog-nature-like" humanoids that aim to please lonely people, who are nicer to be around than people, easier than real relationships.
The current generation of GPT-3, which started with text-davinci-003, was actually released on November 2022, not quite 3 years ago. I'm not even sure the model that was released 3 years ago is still available to test, but it was much less impressive than more recent models - I wouldn't be surprised if LLaMa were actually better.
The model trained 3 years ago was only trained on 300B tokens, heavily undertrained (in terms of the Chinchilla scale), that's why LLaMa models can easily beat it on most benchmarks (they were trained on 1/1.4T tokens). About the current GPT-3.5 models, who knows, OpenAI is not very open about it.
The tragedy of the commons is at play here. We could get amazing ML models rivaling the best if people interested could pool together money for a $100k or $1 million training run. But there doesn't seem to be the willingness, so we humbly rely on the benevolence of companies like Stability and Meta to release models to the public.
Kickstarters like the one you linked don't suffer from the tragedy of the commons because people are essentially just pre-paying for a product. With funding an open source ML model, there's little incentive to not be a free-rider.
> but it's clear that the powers that be have decided that the public cannot be trusted with powerful AI models.
The most dangerous AI model today (in a practical sense, as people are actually using for shady stuff) is ChatGPT, which is closed source, but open to the public so anyone can cheat on their exams, write convincing fake product reviews, or generate SEO spams, etc.
The fact that a model is closed source doesn't change anything as long as it's available for use. Bad actors don't care about running the code on their own machine…
But they're still showing us that the results exist. They're trying to have it both ways, by showing the results are tangible progress while implicitly admitting that that progress is too powerful in the hands of the public.
Is there anything that incentivizes Nvidia to publish these results? Is it just needing to get papers out in the public for the academic clout? Something tells me that all this accomplishes is setting the standards of everyone who sees the possibilities, that "this will be the future", and a third party without the moral framework of Nvidia will become motivated to develop and openly release their own version at some point.
That's just good marketing, isn't it? "Our product is amazing! In fact it's too good, no you can't have it. Unless just maybe, we might let you buy access to it." Oh wow if it's so good that they won't let me have it then I definitely want it!
I’m also now mildly disturbed/worried by the idea of going about my business in AI-generated VR a few years from now (maybe with some 2D-to-3D-ifying compatibility layer) and being haunted by ghosts of stock image watermarks.
I noticed the same thing. I'm not surprised about the current discussion over copyright because this training data pretty clearly has a basis in unlicensed Shutterstock images.
> A lot of those examples have a Shutterstock watermark. I doubt that Shutterstock allows the use of unlicensed videos for this use case.
That almost everyone (other than, maybe, people specifically and very loudly selling the fact that they aren’t, e.g., Adobe) training base AI models is relying on the idea that doing so is fair use and doesn’t require a license is hardly news.
I give a summary about the training of the text-to-video model: they used 48 GPUs for the keyframes model 128k steps, 24 GPUs for the interpolation model 14k steps, 32 GPUs for the upscaler model 10k steps, 40 GPUs for the autoencoder 35k steps. They used A100-80G in all cases except the autoencoder (A100-40G).
This is not particularly expensive considering the numbers these companies usually use, thanks to the fact that they only trained the temporal layers in the LDM and the autoencoder.
Some things to note:
- This was finished last year, impressive now but even more so back then
- No code or models released BUT several authors have moved to StabilityAI and are working on their own improved open video models, which is hopeful as the field continues to move forward
- The paper uses existing image models as a base, and so a better base model (the new XL stable diffusion variant, or Midjourney's underlying model) will give even better results.
The generated 'driving scene' videos eerily resemble a lot of crash dashcam video's. I even have the idea that the bottom right video is composited of video's right before a crash. The white SUV seems to make a hard steer right, but then just continues straight on. It even looks as if some wreckage is visible flying over the road.
I was curious about the use of "sks" in the examples. It is explained in the paper:
> Using DreamBooth [66], we fine-tune our Stable Diffusion spatial backbone on small sets of images of certain objects, tying their identity to a rare text token (“sks”).
This seemed like it might be an interesting article to read, but it turns out that clicking on this link initiated a huge many megabyte download of some kind of media experience that totally locked up my machine. I had a bunch of apps going and was hoping for a smooth start to the day. Had this been an actual article to read then it could have loaded and been put in the background. If you are going to link to some hypermedia experience that requires a high end machine, good net connection, and well configured ad blockers then there should be a warning. This is not some taste issue about me not liking the custom font or whatever. Much of the population does their work on lower end machines with modest net connections and this site is not compatible with that. Do you really want HN to be a site only for the ascended? If you can't communicate without an avalanche of crud coming along then maybe you need to consider what you are saying and maybe consult an editor or something.
Wat. It‘s a demo for a text to video model, of course the authors would show videos, expecting anything else is wild. It should be on your machine to handle this gracefully…
An interesting take on the "diffusion can't do hands" trope: take a teddy bear (the most finger-less humanoid shape known to man, and now also to machine I guess) and put it in that one scene where all attention is on fingers (guitar solo)
I think their main focus is generate more real lift simulation of driving and creating many scenarios to train on. Its good for the autonomous car mostly targeting Cruise and Tesla.
It implies that the researchers are okay with generating outputs based on data scraped from ArtStation portfolios without consent.
There was also the time months ago when the art of a deceased artist (Qinni) was featured prominently on the front cover of an img2img/style transfer paper until the artist's sister requested it was taken down.
I imagine it will be the de-facto way AI generates video in the next few years, once huge models are smart enough to use tools and work in high breadth and depth jobs like humans do.