This fundamental issue seems to be totally lost on the LLM-heads.
I do not want additional uncertainty deep in the development cycle.
I can tolerate the uncertainty while I'm writing. That's where there is a good fit for these fuzzy LLMs. Anything past the cutting room floor and you are injecting uncertainty where it isn't tolerable.
I definitely do not want additional uncertainty in production. That's where the "large action model" and "computer use" and "autonomous agent" cases totally fall apart.
It's a mindless extension something like: "this product good for writing... let's let it write to prod!"
Same goes with the real people, we all can do mistakes and AI Agents would get better over time, and will be ahead of many specialist pretty soon, but probably not perfect before AGI, just as we are.
Ideally it does. Users, super users, admins, etc. Though one might point out exactly how much effort we put into locking down what they can do. I think one might be able to expand this to build up a persona for how LLMs should interface with software in production, but too many applications give them about the same level of access as a developer coding straight into production. Then again, how many company leaders would approve of that as well if they thought it would get things done faster and at lower cost?
It’s only deterministic for each version of the app. Versions change: UI elements move, change their title slightly. Irrelevant promo popups appear, etc. For a deterministic solution, someone has to go and update the tests to handle all of that. Good ‘accessibility hygiene’ can help, but many apps lack that.
And then there are truly dynamic apps like games or simulators. There may be no accessibility info to deterministically code to.
There is great approach based on test-id strategy, basically it's a requirement for the frontend teams to cover all interactive elements with test-id's.
It allows to make tests less flaky and writing them is increasing dramatically, also works with mobile as well, usually elements for the main flows doesn't change that often, you'll still need to update them.
I did stable mobile UI tests with this approach as well, worked well
I agree that it can seem counterintuitive at first to apply LLM solutions to testing. However, in end-to-end testing, we’ve found that introducing a level of flexibility can actually be beneficial.
Take, for example, scenarios involving social logins or payments where external webviews are opened. These often trigger cookie consent forms or other unexpected elements, which the app developer has limited control over. The complexity increases when these elements have unstable identifiers or frequently changing attributes. In such cases, even though the core functionality (e.g., logging in) works as expected, traditional test automation often fails, requiring constant maintenance.
The key, as to other comments, is ensuring the solution is good at distinguishing between meaningful test issues and non issues.
I would assume that the test runner translates the natural language instruction into a deterministic selector and only re-does that translation when the selector fails. At least that's how I would try to implement it..
This is the right idea and how we do it at TestDriver.ai. The deterministic selector still has about 20% fuzz matching rate, and if it fails it trys to recover.
I think it’s less of an issue for e2e testing because e2e testing sucks. If teams did it well in general you would be completely correct, but in many places a LLM will be better even if it hallucinates. As such I think there will be a decent market for products like this, even if they aren’t may not even really be testing what you think they are testing. Simply because that may well be way better than the e2e testing many places already do.
In many cases you’re correct though. We have a few libraries where we won’t use Typescript because even though it might transpire 99% correctly, the fact that we have to check, is too much work for it to be worth our time in those cases. I think LLMs are similar, once in a while you’re not going to want them because checking their work takes too much resources, but for a lot of stuff you can use them. Especially if your e2e testing is really just pseudo jobbing because some middle manager wanted it, which it unfortunately is far too often. If you work in such a place you’re going to recommend the path of least resistance and if that’s LLM powered then it’s LLM powered.
On the less bleak and pessimistic side, if the LLM e2e output is good enough to be less resource consuming, even if you have to go over it, then it’s still a good business case.
I work in the field and built a tool that has way less flakiness than deterministic solutions.
The issue is testing environments are always imperfect because (a) they are stateful and (b) there's always some randomness in actual production software. Some teams have very clean testing environment but most don't.
So being non-deterministic is actually an advantage, in practice.
I think that the hope/dream here is to make end-to-end tests less flakey. It would be great to have navigation and assertions commands that are robust against simple changes in the app that aren't relevant to the test case.
No. Both of the requirements "to interact" and "based on what it looks like" require unshakable foundations in reality - which current models clearly do not have.
They will inevitably hallucinate interactions and observations and therefore decrease reliability. Worse, they will inject a pervasive sense of doubt into the reliability of any tests they interact with.
Yes, you are correct that it entirely lays in the reputation of the AI.
This discussion leads to interesting question, which is "what is quality?"
Quality is determined by perception. If we can agree that an AI is acting like a user and it can use your website, we can assume that a user can use your website and therefor it is "quality".
For more, read "Zen and the Art of Motorcycle Maintenance"