Using Deep Learning models trained on static images is usually not a good idea. But one can simply extend the model to use 3D convolutions instead of 2D, use time as 3rd axis and then feed a sequence of images for training, getting much better results out of it without the wavy effects.