I'm working on Alumnium (https://alumnium.ai). It's an open-source library to si...

cmdtab · 2025-02-24T15:20:51 1740410451

Pretty cool. I built my own framework to do something similar very recently.

Microsoft omni parser and claude computer use alone can take you very far in testing almost anything.

p0deje · 2025-02-24T15:29:21 1740410961

I experimented with Computer Use and even though it's pretty cool, I ended up not using it for 2 main reasons:

1. It's unreasonably expensive. A single test "2+2=4" for a web calculator costs around $0.15. I run roughly 1k tests per month on CI and I don't want to spend $150 on those. The approach I took with Alumnium costs me $4 per month for the same amount of tests.

2. It tries too hard to make the test pass even when it's not possible. When I intentionally introduced bugs in applications, Computer Use sometimes pretended the everything was fine and marked the test passed. Alumnium on the other hand attempts to fail as early as possible.

cmdtab · 2025-02-24T20:02:39 1740427359

For the 1st point, I generate a script with hashed check points so next run is automated unless something changes in the UI to invoke AI. I make this possible by proxy wrapping playwright library so I can take over every method. Users use playwright like they always have but with one extra method called act.

Omini parser lets you split section of the UI to hash and watch for changes that are relevant.

For 2, can you give some examples?

p0deje · 2025-02-25T19:03:09 1740510189

> For the 1st point, I generate a script with hashed check points so next run is automated unless something changes in the UI to invoke AI. I make this possible by proxy wrapping playwright library so I can take over every method. Users use playwright like they always have but with one extra method called act.

How would you determine that something changed in UI by just looking at a screenshot? Would you additionally compare HTML/DOM or approximate the two screenshots?

> Omini parser lets you split section of the UI to hash and watch for changes that are relevant.

I wasn't aware, thanks for sharing!

> For 2, can you give some examples?

Specifically, if you take the Shortest tool (https://shortest.com), a test runner powered by Computer Use API, write a test "Validate the task can be pinned" for https://todomvc.com/examples/vue/dist/, and run it — it passes. It should have failed because there is no way to "pin" tasks in the app, yet it pretends that completing the task is the same as pinning.

cmdtab · 2025-02-25T21:21:31 1740518491

Compare parts of screenshot and see if they changed. I didn't want to use DOM at all. My hypothesis was multi model AI agents will get cheaper over time (Gemini flash is crazy cheap) and people would start putting in attacks in the DOM to confuse AI.

Additionally, existing tools that I used struggled interacting with sites like reddit. So I set out to skip DOM and focus on a generalized approach.

I tried to go cheaper by using ui-tars, open source model by bytedance to run test locally without needing anthropic but it wasn't reliable enough.

That short test link is interesting. I didn't know they existed. Wow, the field is moving fast.