I would assume that the test runner translates the natural language instruction into a deterministic selector and only re-does that translation when the selector fails. At least that's how I would try to implement it..
This is the right idea and how we do it at TestDriver.ai. The deterministic selector still has about 20% fuzz matching rate, and if it fails it trys to recover.