Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's just bizarre to force a computer to go through a GUI to use another computer. Of course it's going to be expensive.


Not at all! Programs, and websites, are built for humans, and very very rarely offer non-GUI access. This is the only feasible way to make something useful now. I think it's also the reason why robots will look like humans, be the same proportions as humans, have roughly the same feet and hands as humans: everything in the world was designed for humans. That being the foundation is going to influence what's built on top.

For program access, one could claim this is even how linux tools usually do it: you parse some meant-for human text to attempt to extract what you want. Sometimes, if you're lucky, you can find an argument that spits out something meant for machines. Funny enough, Microsoft is the only one that made any real headway for this seemingly impossible goal: powershell objects [1].

https://learn.microsoft.com/en-us/powershell/scripting/learn...


And to take a historic analogy, cars today are as wide as they are because that's about how wide a single lane roadway is. And a single lane roadway is as wide as it is because that's about the width of two horses drawing a carriage.


The story goes that this two horses width also limited the size of the space shuttle's boosters (SRB), so we ended up taking this sort of path-dependence off to space.


With UIPath, Appian, etc. the whole field of RPA (robotic process automation) is a $XX billion industry that is built on that exact premise (that it's more feasible to do automation via GUIs than badly built/non-existing APIs).

Depending on how many GUI actions correspond to one equivalent AI orchestrated API call, this might also not be too bad in terms of efficiency.


Most of the GUIs are Web pages, though, so you could just interact directly with an HTTP server and not actually render the screen.

Or you could teach it to hack into the backend and add an API...

Oh, and on edit, "bizarre" and "multi-billion-dollar-industry" are well known not to be mutually exclusive.


>Most of the GUIs are Web pages, though, so you could just interact directly with an HTTP server and not actually render the screen.

The end goal isn't just web pages (And i wouldn't say most GUIs are web pages). Ideally, you'd also want this to be able to navigate say photoshop or any other application. And the easier your method can switch between platforms and operating systems the better

We've already built computer use around GUIs so it's just much easier to center LLMs around them too. Text is an option for the command line or the web but this isn't an easy option for the vast majority of desktop applications, nevermind mobile.

It's the same reason general purpose robots are being built into a human form factor. The human form isn't particularly special and forcing a machine to it has its own challenges but our world and environment has been built around it and trying to build a hundred different specialized form factors is a lot more daunting.


You are not familiar with this market. The goal of a UI Path is to replicate what a human does and being able to get it to production without the help of any IT/Engineering teams.

Most GUIs are in fact not web pages, that's a relatively newer development in the Enterprise side. So while some of them may be a web page, the goal is to be able to touch everything a user is doing in the workflow which very likely includes local apps.

This iteration from Anthropic is still engineering focused but you can see the future of this kind of tooling bypassing engineering/it teams entirely.


Building an entirely new world for agents to compute in is far more difficult than building an agent that can operate in a human world. However i'm sure over time people will start building bridges to make it easier/cheaper for agents to operate in their own native environment.

It's like another digital transformation. Paper lasted for years before everything was digitalized. Human interfaces will last for years before the conversational transformation is complete.


I am just a dilettante, but I imagined that eventually agents will be making API calls directly via browser extension, or headless browser.

I assumed everyone making these UI agents will create a library of each URL's API specification, trained by users.

Does that seem workable?


Maybe fixing this for AI will finally force good accessibility support on major platforms/frameworks/apps (we can dream).


I really hope so. Even macOS voice control which has gotten pretty good is buggy with Messages, which is a core Apple app.


Agentic workflows built ontop of Electron apps running JavaScript. It's software evolution in action!


Yeah super weird that we didn't design our GUIs anticipating AI bots. Can't fuckin believe what we've done.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: