That's essentially what diffusion does, except it doesn't have clear boundaries between "scene graph" and "full image". It starts out noisy and adds more detail gradually
That's true, the inefficiency is from using pixel-to-pixel attention at each stage. It the beginning low resolution would be enough, even at the end high resolution is only needed at the pixel's neighborhood