In general, it has been shown time and time again that this approach fails for neural network based models.
If you can train a neural network that goes from a to b and a network that goes from b to c, you can usually replace that combination with a simpler network that goes from a to c directly.
This makes sense, as there might be information in a that we lose by a conversion to b. A single neural network will ensure that all relevant information from a that we need to generate c will be passed to the upper layers.
Yes this is true, you do lose some information between the layers, and this increased expressibility is the big benefit of using ML instead of classic feature engineering. However, I think the gain would be worth it for some use cases. You could for instance take an existing image, run that through a semantic segmentation model, and then edit the underlying image description. You could add a yellow hat to a person without regenerating any other part of the image, you could edit existing text, change a person's pose, you could probably more easily convert images to 3D, etc.
It's probably not a viable idea, I just wish for more composable modules that lets us understand the models' representation better and change certain aspects of them, instead of these massive black boxes that mix all these tasks into one.
I would also like to add that the text2image models already have multiple interfaces between different parts. There's the text encoder, the latent to pixel space VAE decoder, controlnets, and sometimes there a separate img2imgstyle transfer at the end. Transformers already process images patchwise, but why does those patches have to be even square patches instead of semantically coherent areas?
If you can train a neural network that goes from a to b and a network that goes from b to c, you can usually replace that combination with a simpler network that goes from a to c directly.
This makes sense, as there might be information in a that we lose by a conversion to b. A single neural network will ensure that all relevant information from a that we need to generate c will be passed to the upper layers.