Just inspect the memory content of the process. It's all just numbers at the end of the day & algorithms do not have any understanding of what the numbers mean other than generating other numbers in response to the input numbers. For the record I agree w/ OP, screenshots are not a good interface for the same reasons that trains, subways, & dedicates lanes for mass transit are obviously superior to cars & their associated attendant headaches.
Maybe some day, sure. We may eventually live in a utopia where everyone has quick, efficient, accessible mass transit available that allows them to move between any two points on the globe with unfettered grace.
That'd be neat.
But for now: The web exists, and is universal. We have programs that can render websites to an image in memory (solved for ~30 years), and other programs that can parse images of fully-rendered websites (solved for at least a few years), along with bots that can click on links (solved much more recently).
Point was process memory is the source of truth, everything else is derived & only throws away information that a neural network can use to make better decisions. Presentation of data is irrelevant to a neural network, it's all just numbers & arithmetic at the end of the day.