Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The paper that comes with this is nearly as crazy as the videos themselves. At a cool 92 pages it's closer to a small book than a normal scientific publication. There's nearly 10 pages of citations alone. I'll have to work through this in the coming days, but here's a few interesting points from the first few sections.

For a long time people have speculated about The Singularity. What happens when AI is used to improve AI in a virtuous circle of productivity? Well, that day has come. To generate videos from text you need video+text pairs to train on. They get that text from more AI. They trained a special Llama3 model that knows how to write detailed captions from images/video and used it to consistently annotate their database of approx 100M videos and 1B images. This is only one of many ways in which they deployed AI to help them train this new AI.

They do a lot of pre-filtering on the videos to ensure training on high quality inputs only. This is a big recent trend in model training: scaling up data works but you can do even better by training on less data after dumping the noise. Things they filter out: portrait videos (landscape videos tend to be higher quality, presumably because it gets rid of most low effort phone cam vids), videos without motion, videos with too much jittery motion, videos with bars, videos with too much text, video with special motion effects like slideshows, perceptual duplicates etc. Then they work out the "concepts" in the videos and re-balance the training set to ensure there are no dominant concepts.

You can control the camera because they trained a dedicated camera motion classifier and ran that over all the inputs, the outputs are then added to the text captions.

The text embeddings they mix in are actually a concatenation of several models. There's MetaCLIP providing the usual understanding of what's in the request, but they also mix in a model trained on character-level text so you can request specific spellings of words too.

The AI sheen mentioned in other comments mostly isn't to do with it being AI but rather because they fine-tune the model on videos selected for being "cinematic" or "aesthetic" in some way. It looks how they want it to look. For instance they select for natural lighting, absence of too many small objects (clutter), vivid colors, interesting motion and absence of overlay text. What remains of the sheen is probable due to the AI upsampling they do, which lets them render videos at a smaller scale followed by a regular bilinear upsample + a "computer, enhance!" step.

They just casually toss in some GPU cluster management improvements along the way for training.

Because the MovieGen was trained on Llama3 generated captions, it's expecting much more detailed and high effort captions than users normally provide. To bridge the gap they use a modified Llama3 to rewrite people's prompts to become higher detail and more consistent with the training set. They dedicated a few paragraphs to this step, but it nonetheless involves a ton of effort with distillation for efficiency, human evals to ensure rewrite quality etc.

I can't even begin to imagine how big of a project this must have been.



Having read the paper, I agree that this is an enormous effort, but I didn't see anything that was particularly surprising from a technical point of view - and nothing of Singularity-level significance. The use of AI to train AI - as a source of synthetic data, or as an evaluation tool - is absolutely widespread. You will find similar examples in almost any AI paper dealing with a system of comparable scale.


Yeah I know, but you sometimes see posts on HN that talk as if AI isn't already being used for self-improvement. I guess the subtlety is that people tend to imagine some sort of generic recursive self-improvement, and overlook the more tightly focused ways it's being used.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: