Considering how evasive they've been, it might also be YouTube.
> When pressed on what data OpenAI used to train Sora, Murati didn’t get too specific and seemed to dodge the question. “I’m not going to go into the details of the data that was used, but it was publicly available or licensed data,” she says. Murati also says she isn’t sure whether it used videos from YouTube, Facebook, and Instagram. She only confirmed to the Journal that Sora uses content from Shutterstock, with which OpenAI has a partnership.
Train for what? For making videos? Train from people’s comments? There’s a lot of garbage on AI slop on youtube, how would this be sifted out? I think there’s more value here on HN in terms of training, but even that, to what avail?
From what I read openai is having trouble bc not enough data.
If u think about it, any videos on YouTube of real world data contribute to its understanding of physics at minimum. From what I gather they do pre training on tons of unstructured content first and that contributes to overall smartness.
YouTube is such a great multimodal dataset—videos, auto-generated captions, and real engagement data all in one place. That’s a strong starting point for training, even before you filter for quality. Microsoft’s Phi-series models already show how focusing on smaller, high-quality datasets, like textbooks, can produce great results. You could totally imagine doing the same thing with YouTube by filtering for high-quality educational videos.
Down the line, I think models will start using video generation as part of how they “think.” Picture a version of GPT that works frame by frame—ask it to solve a geometry problem, and it generates a sequence of images to visualize the solution before responding. YouTube’s massive library of visual content could make something like that possible.