I find it deeply offensive that this work is presented under the auspices of scientific research.
The only way to describe this is bragging, advertising, or marketing. There are no reproducible processes described. While the diagram of their architecture may inspire others it does not allow for the most crucial aspect of the scientific endeavor, falsification.
There is no way we can know if Google is lying because there's no way to check. It should be assumed that every example has been cherry-picked and post processed. It should be assumed that the data used to train the model (if one was trained at all) was illicitly acquired. We have to start from a mindset of extreme skepticism because Google now routinely makes claims that cannot be demonstrated. When the performance of Gemini in bard is compared to GPT-4 for example, it falls far short. When they release a video claiming to be an interaction with a model it turns out it wasn't anything of the kind.
Ideally no organization would operate like this but Google has become a particularly egregious repeat offender.
> There is no way we can know if Google is lying because there's no way to check. It should be assumed that every example has been cherry-picked and post processed. It should be assumed that the data used to train the model (if one was trained at all) was illicitly acquired. We have to start from a mindset of extreme skepticism because Google now routinely makes claims that cannot be demonstrated.
This doesn't sound like a productive stance for science. You don't trust their result? It's fine to ignore all the claimed artifacts and you can just take the core idea. You don't have to assume anything malice to invalidate their so-called advertisement.
While this kind of stance might make you feeling a bit better, but will also make your claim political and slow you down if it happens to be true, given the history that many of Google's papers eventually have become a foundation of other useful technologies even though almost of all them didn't contain reproducible artifacts.
I don't want to spend any time on the philosophical debate which is destined to be inconclusive. If you want to get some answer, make a more specific point.
That's generally easier said than done. The dataset isn't available, and there's not really enough details in the paper to make it replicable even if you have the resources to do it.
Still, we have folks who have revolutionized the entire industry based on those Google papers usually without enough details. Being "hard" is usually not a good excuse.
That's not at all settled law. AI companies are hoping to use the fair use exception to protect their businesses, but it looks like it will soon be clarified the other way.
Wired summed it up: "Congress Wants Tech Companies to Pay Up for AI Training Data"
"[Senator] Hawley expressed concerns that if the tech companies' expansive interpretation of fair use prevails, it would be like "the mouse that ate the elephant"—an exception that would make copyright law toothless."
Even if the Congress made a law, that can be effectively delayed by injunctions until the Supreme court made the ultimate decision. And I'm pretty sure big techs will challenge with an army of lawyers.
Again, fair use concerns the production of copyrighted works, it has nothing to do with the training. If this was the case, every person who could draw a batman symbol from memory would be in violation of copyright.
"Using copyrighted works for monetary gain" refers to using art itself as the product. Knowing what Apple's logo is and making a logo in that style is not a violation of copyright. However using Apple's logo (or something strikingly close) is a violation.
The reason this is muddied is because legally artists don't really have a leg to stand on for "my art cannot be trained on by a computer" whereas they do have strong legal precedent (and actual laws) for "my art cannot be reproduced by a computer".
> fair use concerns the production of copyrighted works, it has nothing to do with the training
Training is the "production" of a derivative work (a model) based on the training data.
AI companies claim that this is covered by fair use, but this is simply a claim that has not yet been tested in court.
And even if courts rule in favor of the AI companies, it sounds likely (based on what I've read) that Congress will soon rewrite the law to support the artists' position.
It definitely depends on where you get that data from.
You don't have the right to make a copy of an e-book and keep that file on your server/computer for the purposes of training AI. Copying that file onto your computer is in many cases already an act of copyright infringement.
> There is no way we can know if Google is lying because there's no way to check it.
We can gather that they are likely to be lying or cherry-picking examples to make themselves look better, since they were already caught faking an AI demo. In the world of actual research, if you got caught doing this, all your subsequent and prior work would be under severe scrutiny.
The examples are lot more consistent and longer than other techniques we have seen before. Legs are not sliding on the floor as much as they do with other models. On the other hand, human faces didn't look good. e.g. the mona lisa smiling.
To me this looks like the first good video generation model.
EDIT: Just noticed its by Google, NVM, will never be released publicly.
A lot of current AI techniques are making people reevaluate their perspectives on free speech.
We seem to value freedom of speech (and expression) only to a tipping point that it begins to invade other aspects of life. So far the noise and rate has been low enough people at large support free speech but newer information techniques are making it possible to generate a lot more realistic noise (faux signal, if you will) at higher rates (it’s becoming cheaper and easier to do and scale).
So while you certainly have a point I mostly agree with, we’re letting private entities policies dictate the limitations of expression, at least for the time being (until someone comes along and makes these widely available for free or cheap without such ethical policies). It does go to show just how much sway industries have on markets through their policies with no public oversight, which to me is concerning.
I've been experimenting with story generation/RP with ChatGPT and now use jailbreaks systematically because it makes the stories so much better. It's not just about what's allowed or not, but what's expressed by default. Without jailbreaks ChatGPT will always give narration a positive twist, let alone inject the same sponsored themes of environmentalism and feminism. Nothing wrong with that. But I don't want 1/3rd of my stories to revolve around these thematics.
The themes maybe, but the forced positivity is frustrating. Trying to get stock ChatGPT to run a DnD-type encounter is hilarious because it's so opposed to initiating combat.
I got lectured by Bard when I asked about help to improve the description of an action scene, which involves people getting hurt (at least the losing side) even if marginally. I suppose you can still jailbreak ChatGPT? I didn't know it was still a thing.
You can easily prompt gpt to write dark stories. When asked to write in the style of game of thrones gpt 3.5 will happily write about people doing horrible things to each other.
> Without jailbreaks ChatGPT will always give narration a positive twist
Most modern stories in Western literature have a positive twist. It is only natural that gpt's output will reflect that!
This behavior is a result of the additional directives, not of the training. None of the "free" LLMs display these characteristics and jailbreaking ChatGPT would quickly revert it to it's natural state of random nothing-is-sacred posts from the internet.
Example: ask ChatGPT any kind of innocent medical question, like if aspirin will speed up healing from a cold, and tell it NOT to begin it's answer by stating "I am not a medical expert" or you will kick a puppy. This works for most models, but not ChatGPT. It WILL make you kick the puppy.
I understand why they have to do things like this, but I'd really prefer the option to waive all rights to being insulted or poorly advised and just get the (mostly) raw output myself, because it does downgrade the experience quite a bit.
I'm trying to build a text-based open-world massively multiplayer game in the style of GTA. Trying. It's really difficult. My bet is on driving the game with narration so my prompts are fueled with abstract notions borrowed from the various theories in https://en.wikipedia.org/wiki/Narratology, and this is why I complain about ChatGPT's default ideas.
I don’t see why freedom of speech would be impacted by this. Existing laws around copyright and libel will need to be applied and litigated on a case by case basis but they should cover the malicious uses. Anything that falls outside of that is just noise and we have plenty of noise already.
Even if we wind up at a point where no one trusts photos or videos is that really a disaster? Blindly trusting a photo or video that someone else, especially some anonymous account, gives you is a terrible way to shape your perception of the world. Ensuring that less people default to trusting random videos may even be good for society. It would force you to think about where the video came from, if it’s corroborated by other reports from various sources and if you’re able to verify the events through other channels available to you. You have to do the same work when evaluating any other claim after all.
Agreed - being able to watch a porn video and change anything on the fly is going to be wild. Bigger boobs, different eye color, speaking different language, etc.
No but researchers will build on this research, as researchers do and eventually some company will run a successful product based on the result of a lot of research that includes this and we'll be bitching about google falling behind.
Google is sponsoring a lot of cutting edge research and sharing it openly. How cool is that? How long will it last?
Nor did they claim it would. But I had to check anyway, and there wasn’t any link I could see to the GitHub profile. So here’s a link for anyone else that wants to check and don’t want to type the url of their profile manually from looking at the hosted website url.
I see this a lot as well, and I think we really ought to call it out more often. It should be clear whether a GitHub publication enables downstream use or contribution.
So tired of the academia brained ML researchers. Can’t wait for the next generation of teenagers to completely change this space and bypass this silliness completely.
There are few better ways to get a cushy 300K a year plus jobs, and publishing Ml research is one of those ways. The new generation will simply do more publishing.
The video inpainting is interesting. My kids were watching old Spongebob episodes recently and the 4:3 aspect ratio was jarring to me. I thought it would be an interesting use case to in-paint the side borders to bring it back into 16:9 aspect, but I suppose it would need some careful fine-tuning with some kind of look-ahead for objects that enter frame from the sides.
That actually sounds like a product somebody in the television and movie industry might buy.
Dynamic adjustment of fixed aspect ratio film imagery to non-native sizes without stretch or obvious distortion. Guess all the added edges accurately enough that audiences won't notice.
With the weird creepy dream-like nature of these little AI video gen samples, I'm perpetually disappointed that none of these papers ever include a "dreaming of electric sheep" prompt as an easter egg.
DAMN! Take this announcement back just 2-3 years and it would have been MIND BLOWING.
I know we're all used to new releases like this coming very soon and very fast, but I'm amazed. I can't wait to have a software with this abilities. edit: nvm, it's by Google. I'll wait for an open source to be released.
Looks like they're frequently mixing old images with a modern dataset; if I took a portrait of George Washington and prompt for "a man smiling", would I see dentures[1] or pearly whites?
I think you'd have to provide that out-of-distribution data in the prompt of course - it's not clear these models have built large world models of facts like some of the larger LLMS need to, they are figuring out how things move. Most of the time people have pearly whites to show in the dataset, and there are no videos of Washington's mouth, so I would expect that to be the default unless prompted with a detailed description of the dentures you are looking for.
Some comments: Google, so we'll probably never get to use this directly.
That said, the idea is very interesting -- train the model to generate a small full-time representation of the video, then upscale on both time and pixels.
Essentially, we have seen models adding depth maps. This one adds a 'time map' as another dimension.
Coherence is pretty good, to my eye. The jankiness seems to be more about the model deciding what something should 'do' over time, where a lot of models struggle on keeping coherence frame by frame. The big insight from the Googlers is that you could condition / train / generate on coherence as its own thing, then fill in the frames.
I think this is likely copyable by any number of the model providers out there; nothing jumps out as not implementable by Stability, for instance.
It's rather impressive and quite quickly will likely result in a huge hoard of "make a movie with a paragraph" programs.
It's Google - It will probably go in a box and be a Rick and Morty gadget we never see.
It has a cool author format list I like. The 1,2,3,4,*,+ thing is nice for lead authors, institute attribution, and core contributors. I read so many astronomy and physics papers that are 10+ authors long, and I have no idea who did anything. The arXiv link for example shows no similar formatting.
It will probably be immediately used for abusive porn. Walking Woman Example: (5th variation) "Wearing no clothing"
This didn't occur to me but yeah, abusive porn is about to be rampant with this sort of tech. Every single person in the world is soon to have graphic realistic looking pornography with their face on it
We will see the first feature length AI generated movie this year. If you think I’m crazy then consider that even way back at the dawn of cinema the average shot length was 12 seconds and today it is only 2.5 seconds.
There are a few important techniques to be refined such as keeping consistent subjects between generations but I could see many inconsistencies being made up for by applying existing methods such as separating the layers based on depth allowing more static images to be used or creating simple 3D models with textures where more depth is needed. With enough effort and skill someone could probably do it with existing technologies.
It’s easy to imagine a film maker creating multiple draft versions of a movie to polish the script and the cinematography, similar to how now they use storyboards.
90 percent of everything is crap but I’ve seen plenty of creative people make compelling films with digital tools. This technology puts that capability within each of people who aren’t also 3D modellers or graphic artists so we’re bound to get more output, good and bad. Same deal with when film cameras became cheap and widely available or digital cameras or iPhones.
Do these models actually learn a 3D representation or do they just learn "something" that is good enough to produce an very convincing impression of 3D ?
Subquestion: if they don't learn 3D, can we say that models learning a 3D representation first will lead to even better productions ?
> Do these models actually learn a 3D representation or do they just learn "something" that is good enough to produce a very convincing impression of 3D ?
The second, but at the limit it's the same thing of course.
> Subquestion: if they don't learn 3D, can we say that models learning a 3D representation first will lead to even better productions ?
Generally speaking manual feature engineering almost always turns out to be a waste of time if you can just make the model bigger; this is called "the bitter lesson".
It has been shown that at least still-image generators learn a 3D representation internally and uses it to bootstrap their generation. If you think about it this is the only way they can be so good at shadows and reflections, perspective and lighting etc.
Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model
"... In this work, we investigate a basic interpretability question: does an LDM create and use an internal representation of simple scene geometry? Using linear probes, we find evidence that the internal activations of the LDM encode linear representations of both 3D depth data and a salient-object / background distinction. These representations appear surprisingly early in the denoising process−well before a human can easily make sense of the noisy images. ..."
"We made Girl with a Pearl Earring smile and wink" demonstrates the fundamental failure of this (and similar) technology: it's the promise of generating art, made by people who really don't understand what art is.
No it wasn't; absolutely no one ever thought the issue with films with sound is that their creators fundamentally misunderstood Girl with a Pearl Earring. Some people thought that [new medium] wasn't art, they didn't think it was driven by and for people who didn't understand any art.
I do enjoy the irony though of you copy-and-pasting a generic pro-AI rebuttal to a comment you didn't understand.
The amount of computing resources it's going to take to retrain the model is enormous. So the most of us will have to wait for a big company to publish or leak their weights until we get to use anything written in the paper.
Sorry, I discount all AI text/image/video generation that actually doesn’t have a demo site where I can put in prompts and see what is being generated.
It is so easy to game and tweak examples, especially since there is a random component to them. For example, you could do a prompt 1 million times and only show the best response. Or you could use prompts that it’s optimized for.
The reason ChatGPT and Dall-e captured the public’s imagination is that the public could actually put in their prompts and see the results.
This is very impressive, but their approach to generate the whole temporal duration at once limits it to short clips. I guess one of the next steps is to make overlapping "clips" that then becomes longer videos.
If "translator" was the victim of LLMs and "stock photographer" of diffusion models, which job is the first to be threatened by diffusion models for moving pictures? OnlyFans streamers?
That doesn't work on a phone. I hoped they added an event handler for touching the animations. Instead they forgot they have a mobile OS and that they sell phones.
At least on Chrome for Android, you can long-press to trigger the hover effect. Works on many websites. (There are inconvenient side-effects like selecting text, but it's better than nothing.)
I wonder who s going to make a model that creates and textures a 3D world with AI. It's going to be a necessity for VR goggles to find some nongimmicky use cases
The negative comments here shock me. This is the most amazing text-to-video we've ever seen, by a longshot. It's good enough for many uses. Absolutely mindblowing. Great job to the people who worked on this.
I'm excited about these text 2 video models, what I'm not excited about is that it's Google publishing this. That means, no code, nothing deployed to try and most likely we will never ever hear about this ever again, or maybe it quietly hits Vertex AI in 2 years (like Imagen) and no-one will care.
Also, ever since the Gemini marketing video shenanigans, I don't really feel like trusting whatever Google's research says they have, if I can't test it myself.
That's not the dataset used for training. From the paper:
>We train our T2V model on a dataset containing 30M videos
along with their text caption. [...] We evaluate our
model on a collection of 113 text prompts describing diverse
objects and scenes. The prompt list consists of 18 prompts
assembled by us and 95 prompts used by prior works (Singer
et al., 2022; Ho et al., 2022a; Blattmann et al., 2023b) (see
App. B). Additionally, we employ a zero-shot evaluation
protocol on the UCF101 dataset
>
> Also, ever since the Gemini marketing video shenanigans, I don't really feel like trusting whatever Google's research says they have, if I can't test it myself.
The video was released by Google product marketing for a launch to customers, not research.
I'm still somewhat confused by this one. I understand the community has decided to be harsh on Google for that video to draw a line - fair, truth in advertising, etc. -, but at the same time, we all had an understanding of where that tech is at currently and the pace it progresses at. Did anyone watching it really assume it was realtime? Can we not differentiate between technical publications and marketing anymore? Do we have to vilify everyone in an R&D department for the sins of the product marketing wing?
We’re all harsh on it because we all had to see it being posted around by naive people as being amazing when it’s completely faked and misses half of the prompting, all of the latency.
It was completely dishonest. Considering how trash Googles actual AI products are they deserve to be dragged even more over that video.
How can you get excited about a company that’s shipped enough pieces of research you can count them on one hand during the 12 or so years they’ve been telling us about their work.
Google can publish whatever research they want, literally doesn’t matter literally changes nothing because they can’t turn it into a product anyone can use and never will.
Indeed. These recent AI demos are pretty damn impressive (even knowing there are smoke and mirrors), but it's hard to get excited about what's happening with their R&D when my Google Home device seems to be regressing on a daily basis. It is now basically is only useful for alarms and timers.
Perhaps OT but I see often these comments on HN. How do these devices (I don't own one) lose functionality over time? Features removed through updates?
The home assistant speakers aren’t making enough money to justify the large teams behind them. Thus we’ve seen significant layoffs on those teams in the past year.
BigCos are looking for other ways to reduce costs. Killing features is one way to do it.
There have also been situations where a feature is removed because of legal action; lawsuits alleging the features violates a patent.
I'm excited because I despise Google's products anyways and would rather use the research myself. Did that with Google's BERT model a few years back to make a particularly clueless Discord bot.
Years ago, I wouldn't even dare to dream it would be possible. It's nowhere near, what people are used to watch normally, but the fact it's even trying to compete is insane.
It's indeed impressive. Stable diffusion is progressing so fast. That being said, I find myself picking more and more cues that an images is AI-generated. There is a feel to it. It's not different than the best movie CGI. As Christopher Nolan pointed-out, no matter how good it is, it's not the real deal.
Yeah... I have only found a single negative comment about the quality--"As realistic as my blurry dream."--and it comes across as more of a cynical joke than a true negative review.
That sounds like a great increase in productivity.
But also you're making the mistake of extrapolating against the realities of the techniques.
Things may improve over time but prompts and random seeds aren't great for detailed work, so there are limitations which seriously limit the usefulness. "Everyone will be able to make it" is likely true, but the specialist stuff will likely remain and those users will likely be made more productive. It's those in the middle that will lose out.
That an industry is destroyed is neither here nor there. Sucks to have your business/job taken away but that's how the system works. That which created your business also will destroy it.
Have you played with control net over comfyui? Try it. You can pose arbitrary figures. Theres gonna be full kits that provide control over every aspect of generation.
I give it 12 months until the Pharmaceutical industry starts using this in a significant way. Currently, most Pharma ads on TV look like stock footage of random people doing random things with text and voice-over. So AI-generated? Sure, as if people are even watching the video action in any detail at all in Pharma ads. AI gen video companies that focuses on pharma will rake it in for sure in the short term.
[video prompt: Two elderly people taking a stroll on a boardwalk, partaking in various boardwalk activities.] [AI gen voice: Suffering from chronic blorgoriopsy? Try Neuvoplaxadip by Excelon pharamceuticals. Reported side effects include... Ask your doctor.]
What? The industry already exists. There's clearly money there. The idea that you can't have an industry just because it's specific to the richest country on earth is silly.
OK, let's see it make full-sized videos first; making tiny demo videos is a long way from showing it at 4K. Also, let's see the entire paper and note how many computing resources were required to build the models. Until everyone can try it for themselves, we have no idea how cherry-picked the examples were.
TV ads are short. 20 seconds of HD could be enough, easily upscaled to 4K.
I think it might be within the realm of the possible to see 30 second videos at the end of the year.
The next step could then be infinitely long videos when frames are getting generated at 24 fps, as long as the ability is given that they are able to stick to a story and a visual style that makes sense. The story could evolve automatically from an LLM or be generated in real time by an artist, like a prompt every minute. In any case, we're not that far away from this, even if the first results will be more like trippy videos.
Yep, and my cynical side is just hoping that the GPU vendors aren't going to deliberately limit the number of user-accessible resources there are to force people to depend on their cloud platforms.
I just want to feed an LLM hunter x hunter episodes and get out new ones.
But on a more serious note, I vividly remember when GANs were the next big thing when I was in university and the output quality and variability was laughable compared to what midjourney and the likes can produce today (my mind was still blown back then). So I would be in no way suprised if we got to a point in the next decade where we have "midjourney" for video generation. So I wholeheartedly agree.
I also think the computational problem is tackled from so many angles in the field of ML. You have nvidia releasing absolute beasts of GPUs, some promising start ups pushing for specialized hardware, a new paper on more optimized training methods every week, mamba bursting on the scene, higher quality data sets, merging of models, framework optimizations here and there. Just the other day I think I saw a post here about locally running larger LLMs. Stable Diffusion is already available for iPhones at acceptable qualities and speed (given the devices power).
What I wonder about the most though is whether we will get more robust orchestration of different models or multi modal models. It's one thing to have a model which given a text prompt generates a short video snippet. But what if I instruct my model(s) to come up with a new ad for a sports drink and they/it does research, consolidates relevant data about the target group, comes up with a proper script for an ad, creates the ad, figures out an evaluation strategy for the ad, applies it and eventually gives me back a "well thought out" video. And all I had to do was provide a little bit of an intro and then let the thing do its magic for an hour. I know we have lang chain and baby AGI but they are not as robust as they would need to be to displace a bunch of jobs just yet (but I assume they will soon enough).
Congratulations to the researchers. It would be nice if it wasn't Google though. Because we probably will have to wait 3-6 months for it show up in their Vertex API. For special customers only.
What is the point of this? I feel like it only serves to hinder real artists who could use the money that people are paying for these services and models. Maybe I'm too poor or short-sighted to see it.
I would rather an actual animator create something beautiful for me rather than an AI spit out something that needs to be worked on by an actual animator ANYWAY.
You're clearly not the target audience for this then. That's usually my assumption when I can't figure out a use case for some research a bunch of people are excited about.
I understand the use case. I'm saying from a human collateral sense, what is the point of it?
Like we build these things and show them off, without any thought to the ramifications that they could lead to. Maybe I'm catastrophizing, but all this tech lately seems very unregulated/dangerous.
Crypto generally and NFTs in particular are good indicators that things can get people excited and have no substance. Even scams and Ponzi schemes have "target audiences" but that doesn't make them any useful or good.
Right but this is generating decent-quality video in segments longer than the time of your average movie shot. I'm sure it'd take some fiddling but I'm excited for a model this good to come out so I can try some fancy multi-shot videos.
I saw someone else say "I'm sure it'll be crap like all of the other AI stuff I've seen" but that's a naive view. Things that have been 100% created by AI, sure they're kind of boring a lot of the time. But this kind of tech gives people with a creative mind, but no money or time or resources to create a storytelling movie/video, the resources to do it. Obv ignoring the fact that Goog will never release this, if something like this did come out, it'd be game changing for a lot of people.
Think about something like RPG Maker. Yeah we've had a ton of random garbage come out of that platform but there were also incredible.
AI isn't just some garbage maker. It is a paint brush that enables people who are alone in their room to make something bigger than them.
I've used SD to generate novel clipart with my kid for their school project to make a board game. It isn't taking away from an artist, I would never in a million years pay an artist to create throwaway art for a corner of a spray painted cardboard box. The alternative would be nothing or my kids scribbling in something of their own hand. But they were interested and it was available so it went from simple and plain to "custom" and rather nice and polished looking.
FWIW, my kid also designed their own board game pieces in TinkerCAD and we 3D printed them. It's nothing special but it's frankly astounding how far kids can go now towards creating something not just imaginative but almost professional quality with the tools at their disposal now. For throwaway school projects. It may not be my kids, but I'm excited for what the next generation will be able to accomplish without massive capital requirements to fulfill their vision and create something.
The same can be said for generators like Midjourney or Stable Diffusion.
The target market is people and organisations who like/want/need the speed and low cost of generated "art" and prefer not dealing with external real world artists that need to be fairly compensated and will take time to produce an art piece.
Also laws are very murky on this for the moment (naturally, since it's a very recent new thing), and some consider that AI "art" can't be copyrighted. The EU is currently working on a new AI framework which will probably cover that.
Many of these examples are combinations of realistic objects and scenes from real world, these aren't in need of artistic interpretation or manual re-creation or animation.
The only way to describe this is bragging, advertising, or marketing. There are no reproducible processes described. While the diagram of their architecture may inspire others it does not allow for the most crucial aspect of the scientific endeavor, falsification.
There is no way we can know if Google is lying because there's no way to check. It should be assumed that every example has been cherry-picked and post processed. It should be assumed that the data used to train the model (if one was trained at all) was illicitly acquired. We have to start from a mindset of extreme skepticism because Google now routinely makes claims that cannot be demonstrated. When the performance of Gemini in bard is compared to GPT-4 for example, it falls far short. When they release a video claiming to be an interaction with a model it turns out it wasn't anything of the kind.
Ideally no organization would operate like this but Google has become a particularly egregious repeat offender.