I think Sutton is right that if we had limitless cpu, any human split up would b...

I think Sutton is right that if we had limitless cpu, any human split up would be inferior. So indeed since we are far away from limitless cpu, we divide and compose.

But i think we're onto something!

Voice to image indeed might give better results than text to image, since voice has some vibe to it (intonation, tone, color, stress on certain words, speed and probably even traits we don't know yet) that will color or even drastically influence the image output.