Yeah, stable diffusion's PyTorch code is not optimized for inference memory usage at start. I am looking at the code now and it seems that if it is converted to static graph, there are probably a bit more opportunities (I only looked at CLIP model and UNet model it uses today, not sure about the Autoencoder yet).