I suspect they are gonna need some local offload capabilities for Computer Use, the repeated screen reading can definitely be done locally on modern machines, otherwise the cost maybe impractical.
Maybe we need some agent running on the PC to offload some of these tasks. It could scrape the display at 30 or 60 Hz and produce a textual version of what's going on for the model to consume.