Stable Video Diffusion

btbuildem · 2023-11-21T19:23:01.000000Z

In the video towards the bottom of the page, there are two birds (blue jays), but in the background there are two identical buildings (which look a lot like the CN Tower). CN Tower is the main landmark of Toronto, whose baseball team happens to be the Blue Jays. It's located near the main sportsball stadium downtown.

I vaguely understand how text-to-image works, and so it makes sense that the vector space for "blue jays" would be near "toronto" or "cn tower". The improvements in scale and speed (image -> now video) are impressive, but given how incredibly able the image generation models are, they simultaneously feel crippled and limited by their lack of editing / iteration ability.

Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

TacticalCoder · 2023-11-21T19:45:04.000000Z

> Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

I feel like we're close too, but for another reason.

For although I love SD and these video examples are great... It's a flawed method: they never get lighting correctly and there are many incoherent things just about everywhere. Any 3D artist or photographer can immediately spot that.

However I'm willing to bet that we'll soon have something much better: you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.

And the scene shall be sent into Blender and you'll click on a button and have an actual rendering made by Blender, with correct lighting.

Wanna move that bicycle? Move it in the 3D scene exactly where you want.

That is coming.

And for audio it's the same: why generate an audio file when soon models shall be able to generate the various tracks, with all the instruments and whatnots, allowing to create the audio file?

That is coming too.

epr · 2023-11-21T20:26:04.000000Z

> you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.

I'm always confused why I don't hear more about projects going in this direction. Controlnets are great, but there's still quite a lot of hallucination and other tiny mistakes that a skilled human would never make.

boppo1 · 2023-11-21T20:42:28.000000Z

Blender files are dramatically more complex than any image format, which are basically all just 2D arrays of 3-value vectors. The blender filetype uses a weird DNA/RNA struct system that would probably require its own training run.

More on the Blender file format: https://fossies.org/linux/blender/doc/blender_file_format/my...

mikepurvis · 2023-11-21T20:51:07.000000Z

But surely you wouldn't try to emit that format directly, but rather some higher level scene description? Or even just a set of instructions for how to manipulate the UI to create the imagined scene?

numpad0 · 2023-11-22T04:28:09.000000Z

It sure feels weird to me as well, that GenAI is always supposed to be end-to-end with everything done inside NN blackbox. No one seems to be doing image output as SVG or .ai.

metanonsense · 2023-11-22T09:11:23.000000Z

Imo the thinking is that whenever humans have tried to pre-process or feature-engineer a solution or tried to find clever priors in the past, massive self-supervised-learning enabled, coarsely architected, data-crunching NNs got better results in the end. So, many researchers / industry data scientists may just be disinclined to put effort into something that is doomed to be irrelevant in a few years. (And, of course, with every abstraction you will lose some information that may bear more importance than initially thought)

fy20 · 2023-11-22T23:35:16.000000Z

The way that website builders using GenAI work is they have a LLM generate the copy, then find a template that matches that and fill it out. This basically means the "visual creativity" part is done by a human, as the templates are made and reviewed by a human.

LLMs are good at writing copy that sounds accurate and creative enough, and there are known techniques to improve that (such as generating an outline first, then generating each section separately). If you then give them a list of templates, and written examples of what they are used for, the LLM is able to pick one that's a suitable match. But this is all just probability, there's no real creativity here.

Earlier this year I played around with trying to have GPT-3 directly output an SVG given a prompt for a simple design task (a poster for a school sports day), and the results were pretty bad. It was able to generate a syntantically coreect SVG, but the design was terrible. Think using #F00 and #0F0 as colours, placing elements outside the screen boundaries, layering elements so they are overlapping.

This was before GPT-4, so it would be interesting to repeat that now. Given the success people are having with GPT-4V, I feel that it could just be a matter of needing to train a model to do this specific task.

HammadB · 2023-11-22T08:30:03.000000Z

There is a fundamental disconnect between industry and academia here.

maccard · 2023-11-22T11:02:55.000000Z

Over the last 10 years of industry work, I'd say about 20% of my time has been format shifting, or parsing half baked undocumented formats that change when I'm not paying attention.

That pretty much matches my experience working with NN's and LLM's

BirdieNZ · 2023-11-21T21:19:03.000000Z

I've seen this but producing Python scripts that you run in Blender, e.g. https://www.youtube.com/watch?v=x60zHw_z4NM (but I saw something marginally more impressive, not sure where though!)

bsenftner · 2023-11-22T12:27:03.000000Z

My god that is an irritating video style, "AI woweee!"

mikebelanger · 2023-11-22T02:58:27.000000Z

Yeah I'd imagine that's the best way. Lots of LLMs can generate workable Python code too, so code that jives with Blender's Python API doesn't seem like too much of a leap.

The only trick is that there has to be enough Blender Python code to train the LLM on.

arcticbull · 2023-11-22T03:42:49.000000Z

Maybe something like OpenSCAD is a good middle ground. Procedural code-like format for specifying 3D objects that can then be converted and imported in Blender.

lightedman · 2023-11-22T05:56:17.000000Z

I tried all the AI stuff that I could on OpenSCAD.

While it generates a lot of code that initially makes sense, when you use the code, you get a jumbled block.

regularfry · 2023-11-22T08:32:11.000000Z

This. I think problem is that the LLMs really struggle with 3d scene understanding, so what you would need to do is generate code that generates code.

But also I suspect there just isn't that much openscad code in the training data, and the semantics are different enough to python or any of the other languages that are well-represented that it struggles.

Keyframe · 2023-11-21T20:54:28.000000Z

Scene layouts, models and their attributes are a result of user input (ok and sometimes program output). One avenue to take there would be to train on input expecting an output. Like teaching a model to draw instead of generate images.. which in a sense we already did by broadly painting out silhouettes and then rendering details.

guyomes · 2023-11-21T22:00:56.000000Z

Voxel files could be a simpler step for 3D images.

dragonwriter · 2023-11-21T22:57:11.000000Z

> I'm always confused why I don't hear more about projects going in this direction.

Probably because they aren't as advanced and the demos aren't as impressive to nontechnical audiences who don't understand the implications: there’s lots of work on text-to-3d-model generation, and even plugins for some stable diffusion UIs (e.g., MotionDiff for ComyUI.)

lairv · 2023-11-21T23:23:36.000000Z

I think the bottleneck is data

For single 3D object the biggest dataset is ObjaverseXL with 10M samples

For full 3D scenes you could at best get ~1000 scenes with datasets like ScanNet I guess

Text2Image models are trained on datasets with 5 billion samples

bsenftner · 2023-11-22T12:33:37.000000Z

Oh, I don't know about that. Working in feature film animation, studios have gargantuan model libraries from current and past projects, with a good number (over half) never used by a production but created as part of some production's world building. Plus, generative modeling has been very popular for quite a few years. I don't think getting more 3D models then they could use is a real issue for anyone serious.

senseiV · 2023-11-22T16:43:15.000000Z

Where can you find those? I'm in the same situation as him, I've never heard of a 3d dataset better than objaverse XL.

Got a public dataset?

bsenftner · 2023-11-22T17:29:00.000000Z

These are not public datasets, but with some social engineering I bet one could get access.

I've not worked in VFX for a while, but when I did the modeling departments at multiple studios had giant libraries of completed geometries for every project they ever did, plus even larger libraries of all the pieces and parts they use as generic lego geometry whenever they need something new.

Every 3D modeler I know has their own personal libraries of things they'd made as well as their own "lego sets" of pieces and parts and generative geometry tools they use when making new things.

Now this is just a guess, but do you know anyone going through one of those video game schools? I wager the schools have big model libraries for the students as well. Hell, I bet Ringling and Sheridan (the two Harvards of Animation) have colossally sized model libraries for use by their students. Contact them.

jowday · 2023-11-21T21:47:07.000000Z

There's a lot of issues with it, but perhaps the biggest is that there aren't just troves of easily scrapable and digestible 3D models lying around on the internet to train on top of like we have with text, images, and video.

Almost all of the generative 3D models you see are actually generative image models that essentially (very crude simplification) perform something like photogrammetry to generate a 3D model - 'does this 3D object, rendered from 25 different views, match the text prompt as evaluated by this model trained on text-image pairs'?

This is a shitty way to generate 3D models, and it's why they almost all look kind of malformed.

sterlind · 2023-11-21T22:12:44.000000Z

If reinforcement learning were farther along, you could have it learn to reproduce scenes as 3D models. Each episode's task is to mimic an image, each step is a command mutating the scene (adding a polygon, or rotating the camera, etc.), and the reward signal is image similarity. You can even start by training it with synthetic data: generate small random scenes and make them increasingly sophisticated, then later switch over to trying to mimic images.

You wouldn't need any models to learn from. But my intuition is that RL is still quite weak, and that the model would flounder after learning to mimic background color and placing a few spheres.

skdotdan · 2023-11-21T22:18:43.000000Z

Deepmind tried something similar in 2018 https://deepmind.google/discover/blog/learning-to-write-prog...

sanitycheck · 2023-11-22T12:50:22.000000Z

From my very clueless perspective, it seems very possible to train an AI to use Blender to create images in a mostly unsupervised way.

So we could have something to convert AI-generated image output into 3D scenes without having to explicitly train the "creative" AI for that.

Probably much more viable, because the quantity of 3D models out in the wild is far far lower than that of bitmap images.

eigenvalue · 2023-11-22T02:28:26.000000Z

I think this recent Gaussian Splatting technique could end up working really well for generative models, at least once there is a big corpus of high quality scenes to train on. Seems almost ideal for the task because it gets photorealistic results from any angle, but in a sparse, data efficient way, and it doesn’t require a separate rendering pipeline.

bozhark · 2023-11-21T21:13:18.000000Z

One was on the front page the other day, I’ll search for a link

insanitybit · 2023-11-22T00:05:33.000000Z

I assume because it's still extremely early.

bob1029 · 2023-11-21T20:21:29.000000Z

> However I'm willing to bet that we'll soon have something much better: you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.

I agree with this philosophy - Teach the AI to work with the same tools the human does. We already have a lot of human experts to refer to. Training material is everywhere.

There isn't a "text-to-video" expert we can query to help us refine the capabilities around SD. It's a one-shot, Jupiter-scale model with incomprehensible inertia. Contrast this with an expert-tuned model (i.e. natural language instructions) that can be nuanced precisely and to the the point of imperceptibility with a single sentence.

The other cool thing about the "use existing tools" path is that if the AI fails part way through, it's actually possible for a human operator to step in and attempt recovery.

whywhywhywhy · 2023-11-22T14:03:48.000000Z

Nah I disagree, this feels like a glorification of the process not the end result. Just because having the 3D model in the scene with all the lighting makes the end result feel more solid to you because you feel you can see the work that's going into it.

In the end diffusion technology can make a more realistic image faster than a rendering engine can.

I feel pretty strongly that this pipeline will be the foundation for most of the next decade of graphics and making things by hand in 3D will become extremely niche because lets face it anyone who has worked in 3D it's tedious, it's time consuming, takes large teams and it's not even well paid.

The future is just tools that give us better controls and every frame will be coming from latent space not simulated photons.

I say this as someone who had done 3D professionally in the past.

pegasus · 2023-11-22T16:06:05.000000Z

Nah, I agree with GP. Who didn't suggest making 3D scenes by hand, but the opposite: create those 3D scenes using the generative method, use ray-tracing or the like to render the image. Maybe have another pass through a model to apply any touch-ups to make it more gritty and less artificial. This way things can stay consistent and sane, avoiding all those flaws which are so easy to spot today.

whywhywhywhy · 2023-11-23T00:01:27.000000Z

I know exactly what OP suggested but why are you both glorifying the fact there is a 3D scene graph made in the middle and then slower rendering at the end when the tech can just go from the first thing to a better finished thing?

pegasus · 2023-11-23T07:56:13.000000Z

Because it just can't. And it won't. It can't even reliably produce consistent shadows in a still image, so when we talk video with a moving camera, all bets are off. To create flawless movie simulations through a dynamic and rich 3D world, requires an ability of internally represent that scene with a level of accuracy which is beyond what we can hope generative models to achieve, even with the gargantuan amount of GPU-power behind ChatGPT, for example. ChatGPT, may I remind you, can't even properly simulate large-ish multiplications. I think you may need to slightly recalibrate your expectations for generative tech here.

bbor · 2023-11-22T16:30:08.000000Z

I find that very unlikely. LLMs seem capable of simulating human intuition, but not great at simulating real complex physics. Human intuition of how a scene “should” look isn’t always the effect you want to create, and is rarely accurate im guessing

dragonwriter · 2023-11-22T16:43:30.000000Z

> LLMs seem capable of simulating human intuition, but not great at simulating real complex physics.

Diffusion models aren't LLMs (they may use something similar as their text encoder layer) and they simulate their training corpus, which usually isn't selected solely for physical fidelity, because that's not actually the single criteria for visual imagery outside of what is created by diffusion models.

bbor · 2023-11-23T23:35:39.000000Z

Huh fair enough. I mean they are large models based on language but I see your point. Even though everything you said is true, I still believe there’s a place for human-constructed logically-explicit simulations and functions. In general, and in visual arts.

coldtea · 2023-11-21T21:34:48.000000Z

>For although I love SD and these video examples are great... It's a flawed method: they never get lighting correctly and there are many incoherent things just about everywhere. Any 3D artist or photographer can immediately spot that.

The question is whether the 99% of the audience would even care...

COAGULOPATH · 2023-11-22T04:47:32.000000Z

Of course they would. The internet spent a solid month laughing at the Sonic the Hedgehog movie because Sonic had weird-looking teeth.

coldtea · 2023-11-22T11:47:01.000000Z

Since that movie did well and spawned 2 sequels, the real conclusion is that the viewers didn't really care.

As for "the internet", there will always some small part of it which will obsess and/or laught over anything, doesn't mean they represent anything significant - not even when they're vocal.

PawgerZ · 2023-11-22T14:11:30.000000Z

Viewers did care: the teeth got changed before the movie was released. And, I don't know if you missed it, but it wasn't just one niche of the internet commenting on his teeth. The "outrage" went mainstream; even dentists were making hit-pieces on Sonic's teeth. I'm not gonna lie, it was amazing marketing for the movie, intentional or not.

ekianjo · 2023-11-22T10:01:23.000000Z

No they laughed at it because it looked awful in every single way

atentaten · 2023-11-21T20:05:36.000000Z

Whats your reasoning for feeling that we're close?

cptaj · 2023-11-21T20:27:09.000000Z

We do it for text, audio and bitmapped images. A 3D scene file format is no different, you could train a model to output a blender file format instead of a bitmap.

It can learn anything you have data for.

Heck, we do it with geospatial data already, generating segmentation vectors. Why not 3D?

boppo1 · 2023-11-21T20:44:46.000000Z

>3D scene file format is no different

Not in theory, but the level of complexity is way higher and the amount of data available is much smaller.

Compare bitmaps to this: https://fossies.org/linux/blender/doc/blender_file_format/my...

kaibee · 2023-11-21T21:20:59.000000Z

Also the level of fault tolerance... if your pixels are a bit blurry, chances are no one notices at a high enough resolution. If your json is a bit blurry you have problems.

astrange · 2023-11-22T01:41:38.000000Z

You can do "constrained decoding" on a code model which keeps it grammatically correct.

But we haven't gotten diffusion working well for text/code, so generating long files is a problem.

DougBTX · 2023-11-22T12:30:27.000000Z

Recent results for code diffusion here: https://www.microsoft.com/en-us/research/publication/codefus...

I'm not experienced enough to validate their claims, but I love the choice of languages to evaluate on:

> Python, Bash and Excel conditional formatting rules.

dragonwriter · 2023-11-21T22:52:19.000000Z

We do it for 3D, too.

https://guytevet.github.io/mdm-page/

jncfhnb · 2023-11-21T22:48:27.000000Z

Text, audio, and bitmapped images are data. Numbers and tokens.

A 3D scene is vastly more complex, and the way you consume it is tangential to the rendering of it we use to interpret. It is a collection of arbitrary data structures.

We’ll need a new approach for this kind of problem

dragonwriter · 2023-11-21T22:53:38.000000Z

> Text, audio, and bitmapped images are data. Numbers and tokens.

> A 3D scene is vastly more complex

3D scenes, in fact, are also data, numbers and tokens. (Well, numbers, but so are tokens.)

jncfhnb · 2023-11-22T00:34:57.000000Z

As I stated and you selectively omitted, 3D scenes are collections of many arbitrary data structures.

Not at all the same as fixed sized arrays representing images.

dragonwriter · 2023-11-22T00:38:13.000000Z

Text gen, one of the things you contrast 3d to, similarly isn't fixed size (capped in most models, but not fixed.)

In fact, the data structures of a 3D scene can be serialized as text, and a properly trained text gen system could generate such a representation directly, though that's probably not the best route to decent text-to-3d.

jncfhnb · 2023-11-22T01:49:24.000000Z

Text is a standard sized embedding vector that gets passed one at a time to an LLM. All tokens have the same shape. Each token is processed one at a time. All tokens also have a pre defined order. It is very different and vastly simpler.

Serializing 3D models as text is not going to work for negligibly non trivial circumstances.

btbuildem · 2023-11-22T13:53:57.000000Z

That indeed sounds like a very plausible solution -- working with AI on the level of scene definitions, model geometries etc.

However, 3D is just one approach to rendering visuals. There are so many other styles and methods how people create images, and if I understand correctly, we can do image-to-text to analyze image content, as well as text-to-image to generate it - regardless of the orginal method (3d render or paintbrush or camera lens). There are some "fuzzy primitives" in the layers there that translate to the visual elements.

I'm hoping we see "editors" that let us manipulate / edit / iterate over generated images in terms of those.

wruza · 2023-11-22T09:26:34.000000Z

Not that I’m against the described 3d way, but personally I don’t care about light and shadows until it’s so bad that I do. This obsession with realism is irrational in video games. In real life people don’t understand why light works like this or like that. We just accept it. And if you ask someone to paint how it should work, the result is rarely physical but acceptable. It literally doesn’t matter until it’s very bad.

Kuinox · 2023-11-21T22:46:27.000000Z

This isn't coming, it's already here. https://github.com/gsgen3d/gsgen Yes, it's just 3D models for now, but it can do whole scenes generations, it's just not great yet at it. The tech is there but just need to improve.

p1esk · 2023-11-21T19:56:12.000000Z

Are you working on all that?

cptaj · 2023-11-21T20:24:41.000000Z

Probably not. But there does seem to be a clear path to it.

The main issue is going to be having the right dataset. You basically need to record user actions in something like blender (ie: moving a model of a bike to the left of a scene), match it to a text description of the action (ie; "move bike to the left") and match those to before/after snapshots of the resulting file format.

You need a whole metric fuckton of these.

After that, you train your model to produce those 3d scene files instead of image bitmaps.

You can do this for a lot of other tasks. These general purpose models can learn anything that you can usefully represent in data.

I can imagine AGI being, at least in part, a large set of these purpose trained models. Heck, maybe our brains work this way. When we learn to throw a ball, we train a model in a subset of our brain to do just this and then this model is called on by our general consciousness when needed.

Sorry, I'm just rambling here but its very exciting stuff.

sterlind · 2023-11-21T22:18:32.000000Z

The hard part of AGI is the self-training and few examples. Your parents didn't attach strings to your body and puppeteer you through a few hundred thousand games of baseball. And the humans that invented baseball had zero training data to go on.

p1esk · 2023-11-22T00:55:01.000000Z

Your body is a result of a billion year old evolutionary optimization process. GPT-4 was trained from scratch in a few months.

filipezf · 2023-11-21T23:40:34.000000Z

I have for some time planning to do a 'Wikipedia for AI' (even bought a domain), where people could contribute all sorts of these skills ( not only 3d video, but also manual skills, or anything). Given the current climate of 'AI will save/doom us', and that users would in some sense be training their own replacements, I don't know how much love such site would have, though.

sheepscreek · 2023-11-21T22:36:19.000000Z

Excellent point.

Perhaps a more computationally expensive but better looking method will be to pull all objects in the scene from a 3D model library, then programmatically set the scene and render it.

internet101010 · 2023-11-21T21:19:41.000000Z

I am guessing it will be similar to inpainting in normal stable diffusion, which is easy when using the workflow feature InvokeAI ui.

a_bouncing_bean · 2023-11-21T20:54:31.000000Z

Thanks! this is exactly what I have been thinking, only you've expressed it much more eloquently than I would be able.

solarkraft · 2023-11-22T00:38:28.000000Z

Where is the training data coming from?

jwoodbridge · 2023-11-22T04:31:10.000000Z

we're working on this if you want to give it a try - dream3d.com

hackerlight · 2023-11-22T06:36:21.000000Z

You should put a demo on the landing page

jwoodbridge · 2023-11-22T12:52:33.000000Z

just redid the ux and making a new one, but here's a quick example: https://www.loom.com/share/fa84ba92d7144179ac17ece9bf7fbd99

xianshou · 2023-11-21T19:34:52.000000Z

Emu edit should be exactly what you're looking for: https://ai.meta.com/blog/emu-text-to-video-generation-image-...

smcleod · 2023-11-21T21:13:39.000000Z

It doesn’t look like the code for that is available anywhere though?

01100011 · 2023-11-22T00:42:32.000000Z

I recently tried to generate clip art for a presentation using GPT-4/DALL-E 3. I found it could handle some updates but the output generally varied wildly as I tried to refine the image. For instance, I'd have a cartoon character checking its watch and also wearing a pocket watch. Trying to remove the pocket watch resulted in an entirely new cartoon with little stylistic continuity to the first.

Also, I originally tried to get the 3 characters in the image to be generated simultaneously, but eventually gave up as DALL-E had a hard time understanding how I wanted them positioned relative to each other. I just generated 3 separate characters and positioned them in the same image using Gimp.

btbuildem · 2023-11-22T13:57:10.000000Z

Yes that's exactly what I'm referring to! It feels as if there is no context continuity between the attempts.

filterfiber · 2023-11-21T19:58:01.000000Z

> Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

Emu can do that.

The bluejay/toronto thing may be addressable later (I suspect via more detailed annotations a la dalle3) - these current video models are highly focused on figuring out temporal coherence

amoshebb · 2023-11-21T20:18:13.000000Z

I wonder what other odd connections are made due to city-name almost certainly being the most common word next to sportsball-name.

Do the parameters think that Jazz musicians are mormon? Padres often surf? Wizards like the Lincoln Memorial?

dsmmcken · 2023-11-22T02:36:43.000000Z

Adobe is doing some great work here in my opinion in terms of building AI tools that make sense for artist workflows. This "sneak peak" demo from the recent Adobe Max conference is pretty much exactly what you described, actually better because you can just click on an object in the image and drag it.

See video: https://www.adobe.com/max/2023/sessions/project-stardust-gs6...

btbuildem · 2023-11-22T14:10:59.000000Z

Right, that's embedded directly into the existing workflow. Looks like a very powerful feature indeed.

thatoneguy · 2023-11-22T17:07:32.000000Z

Makes me wonder if they train their data on everything anyone has ever uploaded to Creative Cloud.

achileas · 2023-11-22T01:38:59.000000Z

> Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

Nearly all of the available models have this, even the highly commercialized ones like in Adobe Firefly and Canva, it’s called inpainting in most tools.

btbuildem · 2023-11-22T14:05:34.000000Z

I think that's more "inpainting" where the existing software solution uses AI to accelerate certain image editing tasks. I was looking for whole-image manipulation at the "conceptual" level.

achileas · 2023-11-27T21:02:54.000000Z

They have this. Inpainting is just a subset of the image-to-image workflow and you don't have to provide a region if you want to do whole-image manipulation.

omneity · 2023-11-22T14:57:10.000000Z

Nice eye!

As for your last question yes that exists. There are two models from Meta that do exactly this, instruction based iteration on photos, Emu Edit[0], and videos, Emu Video[1].

There's also LLaVa-interactive[2] for photos where you can even chat with the model about the current image.

[0]: https://emu-edit.metademolab.com/

[1]: https://emu-video.metademolab.com/

[2]: https://llava-vl.github.io/llava-interactive/

COAGULOPATH · 2023-11-22T04:42:08.000000Z

> they simultaneously feel crippled and limited by their lack of editing / iteration ability.

Yeah. They're not "videos" so much as images that move around a bit.

This doesn't really look any better than those Midjourney + RunwayML videos we had half a year ago.

>Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

Google has a model called Phenaki that supposedly allows for that kind of stuff. But the public can't use it so it's hard to say how good it actually is.

treesciencebot · 2023-11-21T19:57:08.000000Z

Have you seen fal.ai/dynamic where you can perform image to image synthesis (basically editing an existing image with the help of diffusion process) using LCMs to provide a real time UI?

appplication · 2023-11-21T19:29:16.000000Z

I don’t spend a lot of time keeping up with the space, but I could have sworn I’ve seen a demo that allowed you to iterate in the way you’re suggesting. Maybe someone else can link it.

ssalka · 2023-11-21T19:48:32.000000Z

My guess is you're thinking of InstructPix2Pix[1], with prompts like "make the sky green" or "replace the fruits with cake"

[1] https://github.com/timothybrooks/instruct-pix2pix

appplication · 2023-11-21T22:33:26.000000Z

This is exactly it!

tjoff · 2023-11-21T20:22:46.000000Z

Emu-Edit is the closest I've seen.

https://emu-edit.metademolab.com/

https://ai.meta.com/blog/emu-text-to-video-generation-image-...

accrual · 2023-11-21T19:45:43.000000Z

It's not exactly like GP described (e.g. move bike to the left) but there is a more advanced SD technique called inpainting [0] that allows you to manually recompose parts of the image, e.g. to fix bad eyes and hands.

[0] https://stable-diffusion-art.com/inpainting_basics/

JoshTriplett · 2023-11-21T20:05:36.000000Z

I also wonder if the model takes capitalization into account. Capitalized "Blue Jays" seems more likely to reference the sports team; the birds would be lowercase.

zeckalpha · 2023-11-22T06:02:08.000000Z

I see that as a reference to the AI generated Toronto Blue Jays advertisement gone wrong that went viral earlier this year. https://www.blogto.com/sports_play/2023/06/ai-generated-toro...

stevage · 2023-11-22T01:21:29.000000Z

I wondered similarly whether the astronaut's weird gait was because it was kind of "moonwalking" on the moon.

kshacker · 2023-11-21T19:29:27.000000Z

Assuming we can post links, you mean this video: https://youtu.be/G7mihAy691g?si=o2KCmR2Uh_97UQ0N

Also, maybe you can't edit post facto, but when you give prompts, would you not be able to say : two blue jays but no CN tower

FrozenTuna · 2023-11-21T19:31:23.000000Z

Yes, its called a negative prompt. Idk if txt2video has it, but both llms and stable-diffusion have it so I'd assume its good to go.

nottheengineer · 2023-11-21T19:38:25.000000Z

Haven't implemented negative prompts yet, but from what I can tell it's as simple as substracting from the prompt in embedding space.

FrozenTuna · 2023-11-21T19:30:48.000000Z

Not exactly what you're asking for, but AnimateDiff has introduced creating gifs to SD. Still takes quite a bit of tweaking IME.

ProfessorZoom · 2023-11-21T20:21:11.000000Z

that sounds like v0 by vercel, you can iterate just like you asked, to combine that type of iteration with video would be really awesome

psunavy03 · 2023-11-21T20:07:15.000000Z

> sportsball

This is not the flex you think it is. You don't have to like sports, but snarking on people who do doesn't make you intellectual, it just makes you come across as a douchebag, no different than a sports fan making fun of "D&D nerds" or something.

Zetaphor · 2023-11-21T20:24:27.000000Z

This has become a colloquial term for describing all sports, not the insult you're perceiving it to be.

Rather than projecting your own hangups and calling people names, try instead assuming that they're not trying to offend you personally and are just using common vernacular.

achileas · 2023-11-22T01:36:57.000000Z

If only there was an existing way to refer to sports generally! And OP was referring to a specific sport (baseball), not sports generally.

btbuildem · 2023-11-22T14:03:06.000000Z

The Rogers Centre hosts baseball, football, and basketball games - so in this case "sportsball" was just a shorthand for all these ball sports.

jojobas · 2023-11-22T07:16:22.000000Z

Would you get incensed by "petrolhead", "greenfingers" or "trekkie"? Is that what you choose to be emotional about?

callalex · 2023-11-22T17:31:05.000000Z

You’re really not helping the “sports fans are combative thugs” stereotype by going off on an insult tirade over an innocent word.

chaps · 2023-11-21T20:15:08.000000Z

Ah, Mr. Kettle, I see you've met my friend, Mr. Pot!

valine · 2023-11-21T19:20:02.000000Z

The rate of progress in ML this past year has been breath taking.

I can’t wait to see what people do with this once controlnet is properly adapted to video. Generating videos from scratch is cool, but the real utility of this will be the temporal consistency. Getting stable video out of stable diffusion typically involves lots of manual post processing to remove flicker.

alberth · 2023-11-21T20:33:30.000000Z

What was the big “unlock” that allowed so much progress this past year?

I ask as a noob in this area.

4death4 · 2023-11-21T20:45:48.000000Z

I think these are the main drivers behind the progress:

- Unsupervised learning techniques, e.g. transformers and diffusion models. You need unsupervised techniques in order to utilize enough data. There have been other unsupervised techniques in the past, e.g. GANs, but they don't work as well.

- Massive amounts of training data.

- The belief that training these models will produce something valuable. It costs between hundreds of thousands to millions of dollars to train these models. The people doing the training need to believe they're going to get something interesting out at the end. More and more people and teams are starting to see training a large model as something worth pursuing.

- Better GPUs, which enables training larger models.

- Honestly the fall of crypto probably also contributed, because miners were eating a lot of GPU time.

mkaic · 2023-11-21T21:11:12.000000Z

I don't think transformers or diffusion models are inherently "unsupervised", especially not the way they're used in Stable Diffusion and related models (which are very much trained in a supervised fashion). I agree with the rest of your points though.

ebalit · 2023-11-21T21:27:46.000000Z

Generative methods have usually been considered unsupervised.

You're right that conditional generation start to blur the lines though.

n2d4 · 2023-11-21T23:45:27.000000Z

"Generative AI" is a misnomer; it's not the same kind of "generative" as the G in GAN.

While you're right about GANs, diffusion models as transformers as transformers are most commonly trained with supervised learning.

ebalit · 2023-11-21T23:54:31.000000Z

I disagree. Diffusion models are trained to generate the probability distribution of their training dataset, like other generative models (GAN, VAE, etc). The fact that the architecture is a Transformer (or a CNN with attention like in Stable Diffusion) is orthogonal to the generative vs discriminative divide.

Unsupervised is a confusing term as there is always an underlying loss being optimized and working as a supervision signal, even for good old kmeans. But generative models are generally considered to be part of unsupervised methods.

valec · 2023-11-22T05:11:16.000000Z

self-supervised is a better term

JCharante · 2023-11-22T13:35:38.000000Z

> The belief that training these models will produce something valuable

Exactly. The growth in the next decade is going to be unimaginable because now governments and MNCs believe that there realistically be progress made in this field.

Cyphase · 2023-11-21T20:53:36.000000Z

One factor is that Stable Diffusion and ChatGPT were released within 3 months of each other – August 22, 2022 and November 3, 2022, respectively. That brought a lot of attention and excitement to the field. More excitement, more people, more work being done, more progress.

Of course those two releases didn't fall out of the sky.

JCharante · 2023-11-22T13:36:20.000000Z

Dalle 2 also went viral around the same time

mlboss · 2023-11-21T20:37:21.000000Z

Stable diffusion open source release and llama release

alberth · 2023-11-21T20:45:44.000000Z

But what technically allowed for so much progress?

There’s been open source AI/ML for 20+ years.

Nothing comes close to the massive milestones over the past year.

kmeisthax · 2023-11-21T21:14:54.000000Z

Attention, transformers, diffusion. Prior image synthesis techniques - i.e. GANs - had problems that made it difficult to scale them up, whereas the current techniques seem to have no limit other than the amount of RAM in your GPU.

mschuster91 · 2023-11-22T01:04:52.000000Z

> But what technically allowed for so much progress?

The availability of GPU compute time. Up until the Russian invasion into Ukraine, interest rates were low AF so everyone and their dog thought it would be a cool idea to mine one or another sort of shitcoin. Once rising interest rates killed that business model for good, miners dumped their GPUs on the open market, and an awful lot of cloud computing capacity suddenly went free.

fragmede · 2023-11-22T05:29:15.000000Z

the Transformers are all you need paper from Google, which may end up being a larger contribution to society than Google search, is foundational.

Emad Mostaque and his investment in stable diffusion, and his decision to release it to the world.

I'm sure there are others, but those are the two that stick out to me.

Chabsff · 2023-11-21T21:06:58.000000Z

Public availability of large transformer-based foundation models trained at great expense, which is what OP is referring to, is definitely unprecedented.

jasonjmcghee · 2023-11-21T21:24:08.000000Z

People figuring out how to train and scale newer architectures (like transfomers) effectively, to be wildly larger than ever before.

Take AlexNet - the major "oh shit" moment in image classification.

It had an absolutely mind-blowing number of parameters at a whopping 62 million.

Holy shit, what a large network, right?

Absolutely unprecedented.

Now, for language models, anything under 1B parameters is a toy that barely works.

Stable diffusion has around 1B or so - or the early models did, I'm sure they're larger now.

A whole lot of smart people had to do a bunch of cool stuff to be able to keep networks working at all at that size.

Many, many times over the years, people have tried to make larger networks, which fail to converge (read: learn to do something useful) in all sorts of crazy ways.

At this size, it's also expensive to train these things from scratch, and takes a shit-ton of data, so research/discovery of new things is slow and difficult.

But, we kind of climbed over a cliff, and now things are absolutely taking off in all the fields around this kind of stuff.

Take a look at XTTSv2 for example, a leading open source text-to-speech model. It uses multiple models in its architecture, but one of them is GPT.

There are a few key models that are still being used in a bunch of different modalities like CLIP, U-Net, GPT, etc. or similar variants. When they were released / made available, people jumped on them and started experimenting.

dragonwriter · 2023-11-21T21:29:06.000000Z

> Stable diffusion has around 1B or so - or the early models did, I'm sure they're larger now.

SDXL is 6.6 billion.

password54321 · 2023-11-22T12:49:22.000000Z

There has been massive progress in ML every year since 2013, partly due to better GPUs and lots of training data. Many are only taking notice now that it is in products but it wasn't that long ago there was skepticism on HN even when software like Codex existed in 2021.

moritonal · 2023-11-22T12:32:45.000000Z

Where do you want to start? The Internet collection and structuring the world's knowledge into a few key repositories? The focus on GPUs in gaming and then the crypto market creating a suite of libraries dedicated to hard scaling math. Or then the miniaturization and focus on energy efficiency due to phones making scaled training cost-effective. Finally the papers released by Google and co which didn't seem to recognise quite how easy it would be to build and replicate upon. Nothing was unlocked apart from a lot of people suddenly noticed how doable all this already was.

marricks · 2023-11-21T23:28:26.000000Z

I mean, you probably didn't pay much attention to battery capacity before phones, laptops, and electric cars, right? Battery capacity has probably increased though at some rate before you paid attention. It's just when something actually becomes relevant that we notice.

Not that more advances don't happen with sustained hype, just there's some sort of tipping point involving usefulness based either on improvement of the thing in question or it's utility elsewhere.

throwaway290 · 2023-11-22T00:16:08.000000Z

MS subsidizing it with 10 billions USD and (un)healthy contempt towards copyright.

Der_Einzige · 2023-11-21T19:36:14.000000Z

Controlnet is adapted to video today, the issues are that it's very slow. Haven't you seen the insane quality of videos on civitai?

valine · 2023-11-21T19:38:19.000000Z

I have seen them, the workflows to create those videos are extremely labor intensive. Control net lets you maintain poses between frames, it doesn’t solve the temporal consistency of small details.

mattnewton · 2023-11-21T19:43:12.000000Z

People use animatediff’s motion module (or other models that have cross frame attention layers). Consistency is close to being solved.

dragonwriter · 2023-11-21T19:52:46.000000Z

Temporal consistency is improving, but “close to being solved” is very optimistic.

mattnewton · 2023-11-21T20:13:52.000000Z

No I think we’re actually close. My source is I’m working on this problem and the incredible progress of our tiny 3 person team at drip.art (http://api.drip.art) - we can generate a lot of frames that are consistent, and with interpolation between them, smoothly restyle even long videos. Cross-frame attention works for most cases, it just needs to be scaled up.

And that’s just for diffusion focused approaches like ours. There are probably other techniques from the token flow or nerf family of approaches close to breakout levels of quality, tons of talented researchers working on that too.

ryukoposting · 2023-11-22T06:18:53.000000Z

The demo clips on the site are cool, but when you call it a "solved problem," I'd expect to see panning, rotating, and zooming within a cohesive scene with multiple subjects.

mattnewton · 2023-11-22T17:52:44.000000Z

Thanks for checking it out! We’re certainly not done yet, but much of what you ask is possible or will be soon on the modeling side and we need tools to expose that to a sane workflow in traditional video editors.

Hard_Space · 2023-11-22T07:23:58.000000Z

Once a video can show a person twisting round, and their belt buckle is the same at the end as it was at the start of the turn, it's solved. VFX pipelines need consistency. TC is a long, long way from being solved, except by hitching it to 3DMMs and SMPL models (and even then, the results are not fabulous yet).

valine · 2023-11-21T19:50:59.000000Z

Hopefully this new model will be a step beyond what you can do with animatediff

capableweb · 2023-11-21T20:15:54.000000Z

> Haven't you seen the insane quality of videos on civitai?

I have not, so I went to https://civitai.com/ which I guess is what you're talking about? But I cannot find a single video there, just images and models.

Kevin09210 · 2023-11-21T22:57:40.000000Z

https://www.youtube.com/shorts/ZN-NbdFwfNQ

https://www.youtube.com/watch?v=3WWy98ylLT4

https://www.youtube.com/shorts/1vqOjYWEF84

https://www.youtube.com/shorts/jOIb9QbrhZ8

https://www.youtube.com/shorts/C3F_YI84TXA

https://www.youtube.com/shorts/4IqJHozY4F0

https://www.youtube.com/shorts/h3OmBLlm5-g

https://www.youtube.com/shorts/ZT7tuIgSDRk

https://www.youtube.com/shorts/WnUYbsOMyvs

https://www.youtube.com/shorts/BKKqX2aMlSg

The inconsistencies are what's most interesting in these videos in fact

capableweb · 2023-11-22T16:08:17.000000Z

Not sure I'd call that "insane quality", more like neat prototypes. I'm excited where things will be in the future, but clearly it has a long way to go.

adventured · 2023-11-21T23:40:42.000000Z

https://civitai.com/images

Go there, in the top right of the content area it has two drop-downs: Most Reactions | Filters

Under filters, change the media setting to video.

Civitai has a notoriously poor layout for finding/browsing things unfortunately.

dragonwriter · 2023-11-21T23:02:35.000000Z

A small percentage of the images are animations. This id (for obvious reasons) particularly common for images used on the catalog pages for animation-related tools and models, but also its not uncommon for (AnimateDiff-based, mostly) animations to be used to demo the output of other models.

kornesh · 2023-11-22T00:15:57.000000Z

Yeah, solving the flickering problem and achieving temporal consistency will be the key to realize the full potential of generative video models.

Right now, AnimateDiff is leading the way in consistency but I'm really excited to see what people will do with this new model.

hanniabu · 2023-11-21T20:45:33.000000Z

> but the real utility of this will be the temporal consistency

The main utility will me misinformation

firefoxd · 2023-11-22T01:17:09.000000Z

I understand the magnitude of innovation that's going on here. But still feel like we are generating these videos with both hands tied behind our backs. In other words, it's nearly impossible to edit the videos in this constraints. (Imagine trying to edit the blue Jays to get the perfect view).

Since videos are rarely consumed raw, what if this becomes a pipeline in Blender instead? (Blender the 3d software). Now the video becomes a complete scene with all the key elements of the text input animated. You have your textures, you have your animation, you have your camera, you have all the objects in place. We can even have the render engine in the pipeline to increase the speed of video generation.

It may sound like I'm complaining, but I'm just ask making a feature request...

huytersd · 2023-11-22T01:30:43.000000Z

What would solve all these issues is full generation of 3D models that we hopefully get a chance to see over the next decade. I’ve been advocating for a solid LiDAR camera on the iPhone so there is a lot of training data for these LLMs.

ricardobeat · 2023-11-22T02:11:38.000000Z

> I’ve been advocating for a solid LiDAR camera on the iPhone

What do you mean by “advocating”? The iPhone has had a LiDAR camera since 2020.

xvector · 2023-11-22T02:24:44.000000Z

That's probably why they qualified with "solid", the iPhone's LiDAR camera is quite terrible.

huytersd · 2023-11-22T04:01:54.000000Z

Yes, exactly.

jwoodbridge · 2023-11-22T04:29:27.000000Z

we're working on this - dream3d.com

ericpauley · 2023-11-21T19:20:06.000000Z

I'm still puzzled as to how these "non-commercial" model licenses are supposed to be enforceable. Software licenses govern the redistribution of the software, not products produced with it. An image isn't GPL'd because it was produced with GIMP.

yorwba · 2023-11-21T19:37:51.000000Z

The license is a contract that allows you to use the software provided you fulfill some conditions. If you do not fulfill the conditions, you have no right to a copy of the software and can be sued. This enforcement mechanism is the same whether the conditions are that you include source code with copies you redistribute, or that you may only use it for evil, or that you must pay a monthly fee. Of course this enforcement mechanism may turn out to be ineffective if it's hard to discover that you're violating the conditions.

comex · 2023-11-21T19:57:19.000000Z

It also somewhat depends on open legal questions like whether models are copyrightable and, if so, whether model outputs are derivative works of the model. Suppose that models are not copyrightable, due to their not being the product of human creativity (this is debatable). Then the creator can still require people to agree to contractual terms before downloading the model from them, presumably including the usage limitations as well as an agreement not to redistribute the model to anyone else who does not also agree. Agreement can happen explicitly by pressing a button, or potentially implicitly just by downloading the model from them, if the terms are clearly disclosed beforehand. But if someone decides on their own (not induced by you in any way) to violate the contract by uploading it somewhere else, and you passively download it from there, then you may be in the clear.

ronsor · 2023-11-21T21:45:36.000000Z

> Then the creator can still require people to agree to contractual terms before downloading the model from them, presumably including the usage limitations as well as an agreement not to redistribute the model to anyone else who does not also agree.

I don't think it's possible to invent copyright-like rights.

yorwba · 2023-11-21T23:05:53.000000Z

Why not? Two willing parties can agree to bind themselves to all kinds of obligations in a contract as long as they're not explicitly illegal.

Copyleft is an example of someone successfully inventing a copyright-like right by bootstrapping off existing copyright with a specially engineered contract.

frognumber · 2023-11-22T02:37:54.000000Z

There are a few problems:

1) You and I invent our own private "copyright" for data (which is not copyrightable)

2) Everything is fine until my wife walks up to my computer and makes a copy of the data. She's not bound by our private "copyright." She doesn't even know it exists, and shares the data with her bestie.

And... our private pseudo-copyright is dead.

Also: Licenses are not the same as contracts. There are times when something can be both, one, or the other. But there are a lot of limits on how far they reach. The output of a program is rarely copyrightable by the author (as opposed to the user).

yorwba · 2023-11-22T09:30:19.000000Z

> my wife walks up to my computer and makes a copy of the data

As you agreed to in our contract, you now need to compensate me for the damage caused by your failure to prevent unauthorized third-party access. Of course you're free to attempt to recover the sum you have to pay me from your wife.

> The output of a program is rarely copyrightable by the author (as opposed to the user).

The author of the program can make it a condition of letting the user use the program that the user has to assign all copyright to the author of the program, kind of like "By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed." https://www.ycombinator.com/legal/

frognumber · 2023-11-22T14:55:41.000000Z

Okay. Now put yourself in the position of Microsoft, using this scheme for Windows. We'll pretend real copyright doesn't exist, and we've got your hair-brained scheme. This is how it plays out:

1) You have a $1T product.

2) My wife leaks it, or a burglar does. I am a typical consumer, with say, a $20k net worth.

You have two choices:

1) Sue me, recover $20k, and be down $1T (minus $20k, plus litigation fees), and get the press of ruining the life of some innocent random person

2) Not sue me. Be down $1T (including the $20k) .

And yes, the author of a program can put whatever conditions they want into the license: "By using this program, you agree to transfer $1M into my bank account in bit coin, to give me your first-born baby, to swear fealty to me, and to give me your wife it servitude." A court can then read those conditions, have a good laugh, and not enforce them. There are very clear limits on what a court will enforce in licenses (and contracts), and owning the output of a program, and barring exceptional circumstance, courts will not enforce them:

https://www.lexology.com/library/detail.aspx?g=eb52567a-2104...

This is why programmers should learn basic law, not treat is as computer code, and consult lawyers when issues come up. Read by a lawyer, a license or contract with an unenforceable clause is as good as having no such clause.

yorwba · 2023-11-22T17:20:27.000000Z

> There are very clear limits on what a court will enforce in licenses (and contracts), and owning the output of a program, and barring exceptional circumstance, courts will not enforce them:

It seems to me that the cases in the article you linked involved the author of the program arguing that their copyright automatically extended to the output without any extra contractual provisions concerning copyright assignment, so I don't think they can be used as precedent regarding the enforceability of such clauses.

ronsor · 2023-11-23T01:20:37.000000Z

> The author of the program can make it a condition of letting the user use the program that the user has to assign all copyright to the author of the program

I think it is quite likely a court would find that unconscionable.

SXX · 2023-11-21T21:56:40.000000Z

It doesn't have to be enforceable. This licensing model works exactly the same as Microsoft Windows licensing or WinRAR licensing. Lots and lots of people have pirated Windows or just buy some cheap keys off Ebay, but no one of them in their sane mind would use anything like that at their company.

The same way you can easily violate any "non-commercial" clauses of models like this one as private person or as some tiny startup, but company that decide to use them for their business will more likely just go and pay.

So it's possible to ignore license, but legal and financial risks are not worth it for businesses.

taberiand · 2023-11-22T00:14:24.000000Z

I've heard companies also intentionally do not go after individuals pirating software e.g., Adobe Photoshop - it benefits them to have students pirate and skill up on their software and then enter companies that buy Photoshop because their employees know it, over locking down and having those students, and then the businesses, switch to open source.

Duanemclemore · 2023-11-22T19:31:39.000000Z

I'm sure there are plenty of other examples, but in my personal experience this was Autodesk's strategy with AutoCAD. Get market saturation by being extremely light on piracy. Then, once you're the only one standing lower the boom. I remember, it was almost like flipping a switch on a single DAY in the mid-00's when they went from totally lax on unpaid users to suing the bejeezus out of anyone who they had good enough documentation on.

One smart thing they did was they'd check the online job listings and if a firm advertised for needing AutoCAD experience they'd check their licenses. I knew firms who got calls from Autodesk legal the DAY AFTER posting an opening.

dist-epoch · 2023-11-21T19:40:35.000000Z

Visual Studio Community (and many other products) only allows "non-commercial" usage. Sounds like it limits what you can do with what you produce with it.

At the end of the day, a license is a legal contract. If you agree that an image which you produce with some software will be GPL'ed, it's enforceable.

As an example, see the Creative Commons license, ShareAlike clause:

> If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

blibble · 2023-11-21T20:07:33.000000Z

> At the end of the day, a license is a legal contract. If you agree that an image which you produce with some software will be GPL'ed, it's enforceable.

you can put whatever you want in a contract, doesn't mean it's enforceable

antonyt · 2023-11-21T22:42:02.000000Z

Do you have link for the VS Community terms you're describing? What I've found is directly contradictory: "Any individual developer can use Visual Studio Community to create their own free or paid apps." From https://visualstudio.microsoft.com/vs/community/

dist-epoch · 2023-11-21T22:52:34.000000Z

Enterprise organizations are not allowed to use VS Community for commercial purposes:

> In enterprise organizations (meaning those with >250 PCs or >$1 Million US Dollars in annual revenue), no use is permitted beyond the open source, academic research, and classroom learning environment scenarios described above.

antonyt · 2023-11-22T16:58:41.000000Z

I see, thanks!

kmeisthax · 2023-11-21T21:03:51.000000Z

So, there's a few different things interacting here that are a little confusing.

First off, you have copyright law, which grants monopolies on the act of copying to the creators of the original. In order to legally make use of that work you need to either have permission to do so (a license), or you need to own a copy of the work that was made by someone with permission to make and sell copies (a sale). For the purposes of computer software, you will almost always get rights to the software through a license and not a sale. In fact, there is an argument that usage of computer software requires a license and that a sale wouldn't be enough because you wouldn't have permission to load it into RAM[0].

Licenses are, at least under US law, contracts. These are Turing-complete priestly rites written in a special register of English that legally bind people to do or not do certain things. A license can grant rights, or, confusingly, take them away. For example, you could write a license that takes away your fair use rights[1], and courts will actually respect that. So you can also have a license that says you're only allowed to use software for specific listed purposes but not others.

In copyright you also have the notion of a derivative work. This was invented whole-cloth by the US Supreme Court, who needed a reason to prosecute someone for making a SSSniperWolf-tier abridgement[2] of someone else's George Washington biography. Normal copyright infringement is evidenced by substantial similarity and access: i.e. you saw the original, then you made something that's nearly identical, ergo infringement. The law regarding derivative works goes a step further and counts hypothetical works that an author might make - like sequels, translations, remakes, abridgements, and so on - as requiring permission in order to make. Without that permission, you don't own anything and your work has no right to exist.

The GPL is the anticopyright "judo move", invented by a really ornery computer programmer that was angry about not being able to fix their printer drivers. It disclaims almost the entire copyright monopoly, but it leaves behind one license restriction, called a "copyleft": any derivative work must be licensed under the GPL. So if you modify the software and distribute it, you have to distribute your changes under GPL terms, thus locking the software in the commons.

Images made with software are not derivative works of the software, nor do they contain a substantially similar copy of the software in them. Ergo, the GPL copyleft does not trip. In fact, even if it did trip, your image is still not a derivative work of the software, so you don't lose ownership over the image because you didn't get permission. This also applies to model licenses on AI software, insamuch as the AI companies don't own their training data[3].

However, there's still something that licenses can take away: your right to use the software. If you use the model for "commercial" purposes - whatever those would be - you'd be in breach of the license. What happens next is also determined by the license. It could be written to take away your noncommercial rights if you breach the license, or it could preserve them. In either case, however, the primary enforcement mechanism would be a court of law, and courts usually award money damages. If particularly justified, they could demand you destroy all copies of the software.

If it went to SCOTUS (unlikely), they might even decide that images made by software are derivative works of the software after all, just to spite you. The Betamax case said that advertising a copying device with potentially infringing scenarios was fine as long as that device could be used in a non-infringing manner, but then the Grokster case said it was "inducement" and overturned it. Static, unchanging rules are ultimately a polite fiction, and the law can change behind your back if the people in power want or need it to. This is why you don't talk about the law in terms of something being legal or illegal, you talk about it in terms of risk.

[0] Yes, this is a real argument that courts have actually made. Or at least the Ninth Circuit.

The actual facts of the case are even more insane - basically a company trying to sue former employees for fixing it's customers computers. Imagine if Apple sued Louis Rossman for pirating macOS every time he turned on a customer laptop. The only reason why they can't is because Congress actually created a special exemption for computer repair and made it part of the DMCA.

[1] For example, one of the things you agree to when you buy Oracle database software is to give up your right to benchmark the software. I'm serious! The tech industry is evil and needs to burn down to the ground!

[2] They took 300 pages worth of material from 12 books and copied it into a separate, 2 volume work.

[3] Whether or not copyright on the training data images flows through to make generated images a derivative work is a separate legal question in active litigation.

dragonwriter · 2023-11-21T21:14:52.000000Z

> Licenses are, at least under US law, contracts

Not necessarily; gratuitous licenses are not contracts. Licenses which happen to also meet the requirements for contracts (or be embedded in agreements that do) are contracts or components of contracts, but that's not all licenses.

rperez333 · 2023-11-22T01:41:42.000000Z

If a company train the model from scratch, on its own dataset, could the resulting model be used commercially?

cubefox · 2023-11-21T19:23:42.000000Z

Nobody claimed otherwise?

not2b · 2023-11-21T19:34:20.000000Z

There are sites that make Stable Diffusion-derived models available, along with GPU resources, and they sell the service of generating images from the models. The company isn't permitting that use, and it seems that they could find violators and shut them down.

littlethoughts · 2023-11-21T19:38:25.000000Z

Fantasy.ai was subject to controversy for attempting to license models.

Der_Einzige · 2023-11-21T19:35:36.000000Z

They're not enforceable.

stevage · 2023-11-22T01:22:56.000000Z

A software licence can definitely govern who can use it and what they can do with it.

> An image isn't GPL'd because it was produced with GIMP.

That's because of how the GPL is written, not because of some limitation of software licences.

accrual · 2023-11-21T19:28:17.000000Z

Fascinating leap forward.

It makes me think of the difference between ancestral and non-ancestral samplers, e.g. Euler vs Euler Ancestral. With Euler, the output is somewhat deterministic and doesn't vary with increasing sampling steps, but with Ancestral, noise is added to each step which creates more variety but is more random/stochastic.

I assume to create video, the sampler needs to lean heavily on the previous frame while injecting some kind of sub-prompt, like rotate <object> to the left by 5 degrees, etc. I like the phrase another commenter used, "temporal consistency".

Edit: Indeed the special sauce is "temporal layers". [0]

> Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets

[0] https://stability.ai/research/stable-video-diffusion-scaling...

adventured · 2023-11-21T19:51:06.000000Z

The hardest problem the Stable Diffusion community has dealt with in terms of quality has been in the video space, largely in relation to the consistency between frames. It's probably the most commonly discussed problem for example on r/stablediffusion. Temporal consistency is the popular term for that.

So this example was posted an hour ago, and it's jumping all over the place frame to frame (somewhat weak temporal consistency). The author appears to have used pretty straight-forward text2img + Animatediff:

https://www.reddit.com/r/StableDiffusion/comments/180no09/on...

Fixing that frame to frame jitter related to animation is probably the most in-demand thing around Stable Diffusion right now.

Animatediff motion painting made a splash the other day:

https://www.reddit.com/r/StableDiffusion/comments/17xnqn7/ro...

It's definitely an exciting time around SD + animation. You can see how close it is to reaching the next level of generation.

shaileshm · 2023-11-21T23:34:10.000000Z

This field moves so fast. Blink an eye and there is another new paper. This is really cool and the learning speed of us humans is insane! Really excited on using it for downstream tasks! I wonder how easy it is to integrate animatediff with this model?

Also, can someone benchmark it on m3 devices? It would be cool to see if it is worth getting on to run these diffusion inferences and development. If m3 pro can allow finetuning it would be amazing to use it on downstream tasks!

awongh · 2023-11-21T19:29:35.000000Z

It makes sense that they had to take out all of the cuts and fades from the training data to improve results.

I’m the background section of the research paper they mention “temporal convolution layers”, can anyone explain what that is? What sort of training data is the input to represent temporal states between images that make up a video? Or does that mean something else?

flaghacker · 2023-11-21T23:01:01.000000Z

It means that instead of (only) doing convolution in spatial dimensions, it also(/instead) happens in the temporal dimension.

A good resource for the "instead" case: https://unit8.com/resources/temporal-convolutional-networks-...

The "also" case is an example of 3D convolution, an example of a paper that uses it: https://www.cv-foundation.org/openaccess/content_iccv_2015/p...

machinekob · 2023-11-21T21:09:09.000000Z

I would assume is something similar to joining multiple frames/attentions? in channel dimension and then moving values inside so convolution will have access to some channels from other video frames.

I was working on similar idea few years ago using this paper as reference and it was working extremely well for consistency also helping with flicker. https://arxiv.org/abs/1811.08383

epiccoleman · 2023-11-21T19:35:29.000000Z

This is really, really cool. A few months ago I was playing with some of the "video" generation models on Replicate, and I got some really neat results[1], but it was very clear that the resulting videos were made from prompting each "frame" with the previous one. This looks like it can actually figure out how to make something that has a higher level context to it.

It's crazy to see this level of progress in just a bit over half a year.

[1]: https://epiccoleman.com/posts/2023-03-05-deforum-stable-diff...

christkv · 2023-11-21T19:22:12.000000Z

Looks like I'm still good for my bet with some friends that before 2028 a team of 5-10 people will create a blockbuster style movie that today costs 100+ million USD on a shoestring budget and we won't be able to tell.

ben_w · 2023-11-21T20:07:01.000000Z

I wouldn't bet either way.

Back in the mid 90s to 2010 or so, graphical improvements were hailed as photorealistic only to be improved upon with each subsequent blockbuster game.

I think we're in a similar phase with AI[0]: every new release in $category is better, gets hailed as super fantastic world changing, is improved upon in the subsequent Two Minute Papers video on $category, and the cycle repeats.

[0] all of them: LLMs, image generators, cars, robots, voice recognition and synthesis, scientific research, …

Keyframe · 2023-11-21T21:04:57.000000Z

Your comment reminded me of this: https://www.reddit.com/r/gaming/comments/ktyr1/unreal_yes_th...

Many more examples, of course.

ben_w · 2023-11-21T21:34:58.000000Z

Yup, that castle flyby, those reflections. I remember being mesmerised by the sequence as a teenager.

Big quality improvement over Marathon 2 on a mid-90s Mac, which itself was a substantial boost over the Commodore 64 and NES I'd been playing on before that.

Sohcahtoa82 · 2023-11-21T23:07:00.000000Z

> Back in the mid 90s to 2010 or so, graphical improvements were hailed as photorealistic

Whenever I saw anybody calling those graphics "photorealistic", I always had to roll my eyes and question if those people were legally blind.

Like, c'mon. Yeah, they could be large leaps ahead of the previous generation, but photorealistic? Get real.

Even today, I'm not sure there's a single game that I would say has photo-realistic graphics.

ben_w · 2023-11-22T10:53:17.000000Z

> Even today, I'm not sure there's a single game that I would say has photo-realistic graphics.

Looking just at the videos (because I don't have time to play the latest games any more and even if I did it's unreleased), I think that "Unrecord" is also something I can't distinguish from a filmed cinematic experience[0]: https://store.steampowered.com/app/2381520/Unrecord/

Though there are still caveats even there, as the pixelated faces are almost certainly necessary given the state of the art; and because cinematic experiences are themselves fake, I can't tell if the guns are "really-real" or "Hollywood".

Buuuuut… I thought much the same about Myst back in the day, and even the bits that stayed impressive for years (the fancy bedroom in the Stoneship age), don't stand out any more. Riven was better, but even that's not really realistic now. I think I did manage to fool my GCSE art teacher at the time with a printed screenshot from Riven, but that might just have been because printers were bad at everything.

Sohcahtoa82 · 2023-11-22T17:44:28.000000Z

Unrecord looks amazing, I forgot about that one.

IMO, though, the lighting in the indoor scenes is just not quite right. There's something uncanny valley about it to me. When the flashlight shines, it's clearly still a computer render to my eyes.

The outdoor shots, though, definitely look flawless.

deckard1 · 2023-11-21T20:27:32.000000Z

I'm imagining more of an AI that takes a standard movie screenplay and a sidecar file, similar to a CSS file for the web and generates the movie. This sidecar file would contain the "director" of the movie, with camera angles, shot length and speed, color grading, etc. Don't like how the new Dune movie looks? Edit the stylesheet and make it your own. Personalized remixed blockbusters.

On a more serious note, I don't think Roger Deakins has anything to worry about right now. Or maybe ever. We've been here before. DAWs opened up an entire world of audio production to people that could afford a laptop and some basic gear. But we certainly do not have a thousand Beatles out there. It still requires talent and effort.

timeon · 2023-11-21T20:54:15.000000Z

> thousand Beatles out there. It still requires talent and effort

As well as marketing.

CamperBob2 · 2023-11-21T19:28:28.000000Z

It'll happen, but I think you're early. 2038 for sure, unless something drastic happens to stop it (or is forced to happen.)

marcusverus · 2023-11-21T20:23:51.000000Z

I'm pumped for this future, but I'm not sure that I buy your optimistic timeline. If the history of AI has taught us anything, it is that the last 1% of of progress is the hardest half. And given the unforgiving nature of the uncanny valley, the video produced by such a system will be worthless until it is damn-near perfect. That's a tall order!

accrual · 2023-11-21T19:34:50.000000Z

The first full-length AI generated movie will be an important milestone for sure, and will probably become a "required watch" for future AI history classes. I wonder what the Rotten Tomatoes page will look like.

jjkaczor · 2023-11-21T19:46:08.000000Z

As per the reviews - it will be hard to say, as both positive and negative takes will be uploaded by ChatGPT bots (or it's myriad of descendents).

qiine · 2023-11-21T20:04:01.000000Z

"I wonder what the Rotten Tomatoes page will look like"

Surely it will be written using machine vision and llms !

throwaway743 · 2023-11-21T19:52:47.000000Z

Definitely a big first for benchmarks. After that hyper personalized content/media generated on-demand

henriquecm8 · 2023-11-22T13:17:53.000000Z

What I am really looking forward is some Star Trek style holodeck, but I guess we will start with it in VR headsets first.

Geordi: "Computer, in the Holmesian style, create a mystery to confound Data with an opponent who has the ability to defeat him"

rbhuta · 2023-11-22T02:36:06.000000Z

VRAM requirements are big for this launch. We're hosting this for free at https://app.decoherence.co/stablevideo. Disclaimer: Google log-in required to help us reduce spam.