Both of them massively reduce the complexity by working on position-aligned images of purely the head. I would estimate that even a picture of a cat from the side will already be too much for this network to handle.
These kind of networks learn their feature space mapping by treating the input data set as continuous. So for them to learn a good mapping from cat to dog, it would need to also see photos of an animal that is half cat half dog. If I had to train this case for work, I guess I'd try to go through baby pictures. Dog -> baby dog -> baby cat -> cat. That might work if baby cats and baby dogs look similar enough.
Try this: https://github.com/NVLabs/FUNIT (2019)
Or this: https://github.com/clovaai/stargan-v2 (2020)
Either of those should probably work for cats to rabbits.