Right now the visuals for these fakes seem to fall apart at higher resolutions or when there is any significant head rotation. This is actually better than what we have in the realm of images, where almost anyone can create a "realistic" Photoshop of the wast majority of scenes.
OTOH, when time is a factor (as in movies - temporal succession of images), perhaps simple curve fitting may never be quite perfect? Perhaps you need anticipation, and counterfactual thinking, cause and effect, and all that?
Also, like in video games, maybe even some understanding of real world physics may be required for a perfect fake.
Most of early physics was done by curve fitting, and on very small datasets. Simple Newtonian dynamics shouldn't be much of a hurdle for a neural network. That's about as much anticipation of cause and effect as you'd need for faking most kinds of videos.
It's already possible, but nobody has written a public-domain implementation of it. Probably because it can sell for a lot of money. I bet someone has this software right now.