Hacker News new | past | comments | ask | show | jobs | submit | p0deje's comments login

Have you experimented with using text-only models and DOM/accessibility tree for interaction with a ? I'm currently working on the open-source test automation tool (https://alumnium.ai) and the accessibility tree w/o screenshots works pretty well as long as the website provides decent support for ARIA attributes or at least has proper HTML5 structure.


On most pages, we don't need vision, and the DOM alone is sufficient. We have not worked with the accessibility tree yet, but it's a great idea to include that. Do you have any great resources on where to get started?


> On most pages, we don't need vision, and the DOM alone is sufficient.

I misunderstood looking at demo videos, it seemed like you constantly update elements with borders/IDs so I assumed that's what is then passed to vision.

> Do you have any great resources on where to get started?

A great place to start is https://chromium.googlesource.com/chromium/src/+/main/docs/a....


I'm working on Alumnium (https://alumnium.ai). It's an open-source library to simplify web application testing with Selenium/Playwright.

I aim to create a stable and affordable tool that allows me to eliminate most of the support code I write for web tests (page objects, locators, etc.) and replace it with human-readable actions and assertions. These actions and assertions are then translated by an LLM into browser instructions. The tool, however, should still leverage all existing infrastructure (test runner, CI/CD, Selenium infrastructure).

So far, it's working well on simple websites (e.g., a calculator, TodoMVC), and I'm currently working on scaling it to large web applications.


Pretty cool. I built my own framework to do something similar very recently.

Microsoft omni parser and claude computer use alone can take you very far in testing almost anything.


I experimented with Computer Use and even though it's pretty cool, I ended up not using it for 2 main reasons:

1. It's unreasonably expensive. A single test "2+2=4" for a web calculator costs around $0.15. I run roughly 1k tests per month on CI and I don't want to spend $150 on those. The approach I took with Alumnium costs me $4 per month for the same amount of tests.

2. It tries too hard to make the test pass even when it's not possible. When I intentionally introduced bugs in applications, Computer Use sometimes pretended the everything was fine and marked the test passed. Alumnium on the other hand attempts to fail as early as possible.


For the 1st point, I generate a script with hashed check points so next run is automated unless something changes in the UI to invoke AI. I make this possible by proxy wrapping playwright library so I can take over every method. Users use playwright like they always have but with one extra method called act.

Omini parser lets you split section of the UI to hash and watch for changes that are relevant.

For 2, can you give some examples?


> For the 1st point, I generate a script with hashed check points so next run is automated unless something changes in the UI to invoke AI. I make this possible by proxy wrapping playwright library so I can take over every method. Users use playwright like they always have but with one extra method called act.

How would you determine that something changed in UI by just looking at a screenshot? Would you additionally compare HTML/DOM or approximate the two screenshots?

> Omini parser lets you split section of the UI to hash and watch for changes that are relevant.

I wasn't aware, thanks for sharing!

> For 2, can you give some examples?

Specifically, if you take the Shortest tool (https://shortest.com), a test runner powered by Computer Use API, write a test "Validate the task can be pinned" for https://todomvc.com/examples/vue/dist/, and run it — it passes. It should have failed because there is no way to "pin" tasks in the app, yet it pretends that completing the task is the same as pinning.


Compare parts of screenshot and see if they changed. I didn't want to use DOM at all. My hypothesis was multi model AI agents will get cheaper over time (Gemini flash is crazy cheap) and people would start putting in attacks in the DOM to confuse AI.

Additionally, existing tools that I used struggled interacting with sites like reddit. So I set out to skip DOM and focus on a generalized approach.

I tried to go cheaper by using ui-tars, open source model by bytedance to run test locally without needing anthropic but it wasn't reliable enough.

That short test link is interesting. I didn't know they existed. Wow, the field is moving fast.


Happy to hear it works well for you! Let me know if there are any issues or features missing.


Security Kit for Drupal: https://www.drupal.org/project/seckit. I built it when I was a junior QA engineer both learning how to program in PHP and doing first steps in the security. I open sourced it, pretty much moved to Ruby and forgot about it just to learn several years later that it's used on 50k websites across the world.


I think even Heroku is not the first company offering this to public. There was an app called Teatro.io which did the same in 2014. It's down now, but I've found https://web.archive.org/web/20140614011544/http://teatro.io/.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: