> It is something that can be fixed with better prompting, or is it a limitation of the model/architecture?
Both. You can get better results through better prompting but the root cause of this is a limitation of the architecture and training methods (which are coupled).