If we want to get pedantic, in context of practical transformer models it has never been text to text. It has always been tokens to tokens. The tokens can represent anything. "multimodal" is a marketing term.
That I use regularly to actually drive from no-too-far from LHR to Paris and back. It's a thing I actually do.
And I can tell you, it might be theoretically in Google Maps land to do the journey in 6 hours, but IRL in this scenario it won't happen. Actual empirical evidence.
Fair enough, but only ~5% of Microsoft revenue comes from ad dollars. It's not clear to me why I should switch to a browser which is funded primarily with ad money vs. just turning off the Edge features I don't like. (FWIW, I used Firefox for many years, and my daily driver is currently Chrome.)