From some anecdotal experience, large language models struggle with spatial structure (which makes sense given their modality and training data). On the other hand, diffusion models create great images, but this does not translate very well to vector data.
Animation is not a very well researched modality for AI, so it could go either way. It's definitely an interesting direction to consider, as it can democratize motion design even further.
It's a JSON structure full of numbers which describe colors and spatial relationships. LLMs have no problem with the JSON syntax, but the numbers they put in it will be nonsense.
Animations require consistency to work and generations are still very bad at consistency. You can see this in action through any AI generated video, that "flickering" are tiny inconsistencies between frames that throw the entire thing off.
Once that issue is fixed, then it's a green light as everything else is vaguely ready.
I think you're talking about generating bitmaps of video, one frame at a time - which is a pretty different task from generating vector animation. If the LLM is approaching the task like an animator would (i.e. I want this shape to move here slowly for a long time, then grow rapidly for a short duration, then...) and expressing the result in some kind of keyframe animation format (Lottie, AE, Unity, etc.) then you aren't having to deal with the kinds of artifacts you described.
What are some roadblocks of making that a reality?