I'm working on Alumnium (https://alumnium.ai). It's an open-source library to simplify web application testing with Selenium/Playwright.
I aim to create a stable and affordable tool that allows me to eliminate most of the support code I write for web tests (page objects, locators, etc.) and replace it with human-readable actions and assertions. These actions and assertions are then translated by an LLM into browser instructions. The tool, however, should still leverage all existing infrastructure (test runner, CI/CD, Selenium infrastructure).
So far, it's working well on simple websites (e.g., a calculator, TodoMVC), and I'm currently working on scaling it to large web applications.
I experimented with Computer Use and even though it's pretty cool, I ended up not using it for 2 main reasons:
1. It's unreasonably expensive. A single test "2+2=4" for a web calculator costs around $0.15. I run roughly 1k tests per month on CI and I don't want to spend $150 on those. The approach I took with Alumnium costs me $4 per month for the same amount of tests.
2. It tries too hard to make the test pass even when it's not possible. When I intentionally introduced bugs in applications, Computer Use sometimes pretended the everything was fine and marked the test passed. Alumnium on the other hand attempts to fail as early as possible.
For the 1st point, I generate a script with hashed check points so next run is automated unless something changes in the UI to invoke AI. I make this possible by proxy wrapping playwright library so I can take over every method. Users use playwright like they always have but with one extra method called act.
Omini parser lets you split section of the UI to hash and watch for changes that are relevant.
> For the 1st point, I generate a script with hashed check points so next run is automated unless something changes in the UI to invoke AI. I make this possible by proxy wrapping playwright library so I can take over every method. Users use playwright like they always have but with one extra method called act.
How would you determine that something changed in UI by just looking at a screenshot? Would you additionally compare HTML/DOM or approximate the two screenshots?
> Omini parser lets you split section of the UI to hash and watch for changes that are relevant.
I wasn't aware, thanks for sharing!
> For 2, can you give some examples?
Specifically, if you take the Shortest tool (https://shortest.com), a test runner powered by Computer Use API, write a test "Validate the task can be pinned" for https://todomvc.com/examples/vue/dist/, and run it — it passes. It should have failed because there is no way to "pin" tasks in the app, yet it pretends that completing the task is the same as pinning.
Compare parts of screenshot and see if they changed. I didn't want to use DOM at all. My hypothesis was multi model AI agents will get cheaper over time (Gemini flash is crazy cheap) and people would start putting in attacks in the DOM to confuse AI.
Additionally, existing tools that I used struggled interacting with sites like reddit. So I set out to skip DOM and focus on a generalized approach.
I tried to go cheaper by using ui-tars, open source model by bytedance to run test locally without needing anthropic but it wasn't reliable enough.
That short test link is interesting. I didn't know they existed. Wow, the field is moving fast.
I aim to create a stable and affordable tool that allows me to eliminate most of the support code I write for web tests (page objects, locators, etc.) and replace it with human-readable actions and assertions. These actions and assertions are then translated by an LLM into browser instructions. The tool, however, should still leverage all existing infrastructure (test runner, CI/CD, Selenium infrastructure).
So far, it's working well on simple websites (e.g., a calculator, TodoMVC), and I'm currently working on scaling it to large web applications.