Autoregressive transformer models are usually memory bound, whereas SD is compute bound, so perhaps the difference lies here. Also the reason why SD runs so much faster on the GPU than on the CPU.
M1 has (fast) unified memory between GPU and CPU, so something being memory bound ought not to have much bearing on whether it belongs on CPU or GPU… at least in theory. I’m a total noob here though so I may be wrong.