No I think we’re actually close. My source is I’m working on this problem and the incredible progress of our tiny 3 person team at drip.art (http://api.drip.art) - we can generate a lot of frames that are consistent, and with interpolation between them, smoothly restyle even long videos. Cross-frame attention works for most cases, it just needs to be scaled up.
And that’s just for diffusion focused approaches like ours. There are probably other techniques from the token flow or nerf family of approaches close to breakout levels of quality, tons of talented researchers working on that too.
The demo clips on the site are cool, but when you call it a "solved problem," I'd expect to see panning, rotating, and zooming within a cohesive scene with multiple subjects.
Thanks for checking it out! We’re certainly not done yet, but much of what you ask is possible or will be soon on the modeling side and we need tools to expose that to a sane workflow in traditional video editors.
Once a video can show a person twisting round, and their belt buckle is the same at the end as it was at the start of the turn, it's solved. VFX pipelines need consistency. TC is a long, long way from being solved, except by hitching it to 3DMMs and SMPL models (and even then, the results are not fabulous yet).