Is creating a giant wobbly world of unreliable AIs all talking to each other in effort to get their own tasks accomplished, is that leaving us much better off than humans doing everything themselves?
Better yet, if you have an application that does exactly what you want, why would you (or an AI representing you) want to do anything other than use that application? Sure you could execute this binary and get what you want, OR you could reach to the AI-net and make some other AI do it from scratch every time. With inherently less reliable results.
Sorry I should of specified, this is assuming a world with perfect AIs.
The world right now is pretty strict just because of how software has to be, which has a lot of upsides and downsides. But there's some wobblyness because of bugs, which break contracts.
But I think then in the future you have AI which doesn't make mistakes and you also have contracts.
Like the airline agent booking your flight (human or AI) has a contract - they can only do certain things. They can't sell you a ticket for one dollar. Before applications we just wrote these contracts as processes, human processes. Human often break processes. Perfect AI won't.
And to us, humans, this might even be completely transparent.
Like in the future I go to a website because I want to see fancy flight plans or whatever and choose something.
Okay, my AI goes to the airline and gets the data, then it arranges it into a UI on the fly. Maybe I can give it rules for how I typically like those UI presented.
So there's no application. It works like an executive assistant at a job. Like if I want market research, I don't use an application for that. I ask my executive assistant. And then, one week later, I have a presentation and report with that research.
That takes a week though, perfect AI can do it instantly.
And for companies, they don't make software or applications anymore. They make business processes, and they might have a formal way for specifying them. Which is similar to programming in a way. But it's much higher level. I identify the business flow and what my people (or AI) are allowed to do, and when, and why.
Oh I have no expectation of this ever coming true, I'm just day dreaming or thinking out loud. Like, how could our systems evolve in such a world and what would that look like? I think it's a fun thought experiment.
Hah! Especially seeing the same articles, and pro/con discussions feels like a form of the preceding lines:
_"There's only one subject on HN for Christ's sake. _Tips For Agentic AI Use_,_Gemini CLI_, _Building X with Claude_, they're the same thing!"_
Not to be rude, but what about understanding the "transcendental nature" of LLMs allows people to build more, faster, or with better maintainability than a "hardened industry titan"?
New generations are always leapfrogging those that came before them, so I don't find it too hard to believe even under more pessimistic opinions of LLM usefulness.
They are young and inexperienced today, but won't stay that way for long. Learning new paradigms while your brain is still plastic is an advantage, and none of us can go back in time.
> They are young and inexperienced today, but won't stay that way for long.
I doubt that. For me this is the real dilemma with a generation of LLM-native developers. Does a worker in a fully automated watch factory become better at the craft of watchmaking with time?
I think the idea that LLMs are just good at "automating" is the old curmudgeon idea that young people won't have.
I think the fundamental shift is something like having ancillary awareness of code at all but high capability to architect and drill down into product details. In other words, fresh-faced LLM programmers will come out the gate looking like really good product managers.
Similar to how C++ programmers looked down on web developers for not knowing all about malloc and pointers. Why dirty your mind with details that are abstracted away? Someone needs to know the underlying code at some point, but that may be reserved for the wizards making "core libraries" or something.
But the real advancement will be not being restricted by what used to be impossible. Why not a UI that is generated on the fly on every page load? Or why even have a webform that people have to fill out, just have the website ask users for the info it needs?
But do those watches tell time better? or harder? or louder? Once you have the quartz crystal and have digital watches, mechanical movements became obsolete. Rolex and Patek Philippe are still around, but it's more of a curiosity than anything.
we’ve been taught to think of programs as sculptures, shaped by the creator to a fixed purpose. with LLMs, the greatest advance isn’t in being able to make larger and more detailed sculptures more quickly; it’s that you can make the sculptures alive.
But who _wants_ a program to be alive? To be super clear, I love the tech behind LLMs and other transformers. But when discussing regular, run of the mill software projects that don't require AI capabilities - do you really need to have the understanding of the transcendental nature of LLMs to do that job well?
That's all well and good to ask as a what if, but in terms of practical applications, the vast majority of the time you want to trade the other way around whenever possible - you want your system to work reliably. That's the cornerstone of any project that wants to be useful for others.
Interesting. Would you mind elaborating a bit on your workflow? In my work I go back and forth between the "stock" GUIs, and copy-pasting into a separated terminal for model prompts. I hate the vibe code-y agent menu in things like Cursor, I'm always afraid integrated models will make changes that I miss because it really only works with checking "allow all changes" fairly quickly.
Ah, yeah. Some agentic coding systems try to force you really heavily into clicking a loud. I don't think it's intentional, but like, I don't think they're really thinking through the workflow of someone who's picky and wants to be involved as much as I am. So they make it to that, you know, canceling things is really disruptive to the agent or difficult or annoying to do or something. And so it kind of railroads you into letting the agent do whatever it wants, and then trying to clean up after, which is a mess.
Typically, I just use something like QwenCode. One of the things I like about it, and I assume this is true of Gemini CLI as well, is that it's explicitly designed to make it as easy as possible to interrupt an agent in the middle of its thought or execution process and redirect it, or to reject its code changes and then directly iterate on them without having to recapitulate everything from the start. It's as easy as just hitting escape at any time. So I tell it what I want to do by usually giving like a little markdown formatted you know paragraph or so that's you know got some bullet points or some numbers maybe a heading or two, explaining the exact architecture and logic I want for a feature, not just the general feature. And then I let it kind of get started and I see where it's going. And if I generally agree with the approach that it's taking, then I let it turn out a diff. And then if I like the diff after reading through it fully, then I accept it. And if there's anything I don't like about it at all, then I hit Escape and tell it what to change about the disc before it even gets to merge it in.
There are three advantages to this workflow over the chat GPT copy and paste workflow.
One is that the agent can automatically use grep and find and read source files, which makes it much easier and more convenient to load it up with all of the context that it needs to understand the existing style architecture and purpose of your codebase. Thus, it typically generates code that I'm willing to accept more often without me doing a ton of legwork.
The second is that it allows the agent to automatically of its own accord, run things like linters, type checkers, compilers, and tests, and automatically try to fix any warnings or errors in that result, so that it's more likely to produce correct code that adheres to whatever style guide I've provided. Of course, again I could run those tools manually, manually and copy and paste the output into a chat window, but that's just enough extra effort and friction after I've gotten what's ostensibly something working that I know I would be likely to be lazy and not do that at some point. This sort of ensures that it's always done. Some tools like OpenCode even automatically run LSPs and linters and feed that back into the model after the diff is applied automatically, thus allowing it to automatically correct things.
Third, this has the benefit of forcing the AI to use small and localized diffs to generate code, instead of regenerating whole files or just autoregressively completing or filling in the middle for things, which makes it way easier to keep up with what it's doing and make sure you know everything that's going on. It can't slip subtle modifications past you, or, and doesn't tend to generate 400 lines of nonsense.
Jon Gjengset (jonhoo) Who is famously fastidious did a stream on live coding where he did something similar in terms of control. Worth of a watch if that is a style you want to explore.
I don't have the energy to do that for most things I am writing these days which are small PoC where the vibe is fine.
I suspect as you do more, you will create dev guides and testing guides that can encapsulate more of that direction so you won't need to micromanage it.
If you used Gemini CLI, you picked the coding agent with the worst output. So if you got something that worked to your liking, you should try Claude.
> I suspect as you do more, you will create dev guides and testing guides that can encapsulate more of that direction so you won't need to micromanage it.
Definitely. Prompt adherence to stuff that's in an AGENTS/QWEN/CLAUDE/GEMINI.md is not perfect ime though.
>If you used Gemini CLI, you picked the coding agent with the worst output. So if you got something that worked to your liking, you should try Claude.
I'm aware actually lol! I started with OpenCode+GLM 4.5 (via OpenRouter), but I started burning through cache extremely quickly, and I can't remotely afford Claude Code, so I was using qwen-code mostly just for the 2000 free requests a day and prompt caching abilities, and because I prefer Qwen 3 Coder to Gemini... anything for agentic coding.
You can use Claude Code against Kimi K2, DeepSeek, Qwen, etc. The 20$ a month plan gets you access to a token amount of sonnet for coding, but that wouldn't be indicative of how people are using it.
We gave Gemini CLI a spin, it is kinda unhinged, I am impressed you were able to get your results. After reading through the Gemini CLI codebase, it appears to be a shallow photocopy knockoff of Claude Code, but it has no built in feedback loops or development guides other than, "you are an excellent senior programmer ..." the built in prompts are embarrassingly naive.
> You can use Claude Code against Kimi K2, DeepSeek, Qwen, etc.
Yeah but I wouldn't get a generous free tier, and I am Poor lmao.
> I am impressed you were able to get your results
compared to my brief stint with OpenCode and Claude Code with claude code router, qwen-code (which is basically a carbon copy of gemini cli) is indeed unhinged, and worse than the other options, but if you baby it just right you can get stuff done lol
Counterpoint: a cabinet has always been a cabinet and nobody expects it to be anything but a cabinet. Rarely are software projects as repeatable and alike to each other as cabinets are.
Software is codified rules and complexity, which is entirely aribtrary, and builds off of itself in an infinite number of ways. That makes it much more difficult to turn into factory output cabinetry.
I think more people should read "No Silver Bullet" because I hear this argument a lot and I'm not sure it holds. There _are_ niches in software that are artisanal craft, that have been majorly replaced (like custom website designers and stock WordPress templates), but the vast majority of the industry relies on cases where turning software into templates isn't possible, or isn't as efficient, or conflicts with business logic.
Counterpoint: I forget where I originally read this thought but consider compilers. At one point coding was writing assembly and now it’s generally not, sometimes some people still do it but it is far from the norm. Now, usually, you “write code” in an abstraction (possibly of an abstraction) and magic takes care of the rest.
While I imagine “make an app that does X” won’t be as useful as “if … else” there is a middle ground where you’re relinquishing much of the control you currently are trying to retain.
As complexity in a program increases, getting to the level of detail of defining the if...else becomes important. Using plain English to define the business logic, and allowing AI to fill in the gaps, will likely lead to a lot of logic errors that go uncaught until there is a big problem.
For the AI to avoid this, I'd imagine it would need to be directed not to assume anything, and instead ask for clarification on each and every thing, until there is no more ambiguity about what is required. This would be a very long and tedious back and forth, where someone will want to delegate the task to someone else, and at that point, the person might as well write their own logic in certain areas. I've found myself effectively giving sudocode to the LLM to try to properly explain the logic that is needed.
I mean that's basically all high level programming languages are, right?
I would argue that as an industry we love high level programming languages, because they allow you to understand what you are writing, much easier than looking at assembly code. Excellent for the vast majority of needs.
But then people go right on and build complicated frameworks and libraries with those languages, and very quickly the complexity (albeit presented much better for reading) comes back into a project.
Sometines you need the complexity because they make the problem simpler to solve. Especially if you have a bunch of them. Take something like a task runner, or a crude framework, or numpy… just be aware of the lower abstraction level to detect when it conflicts with the main problem.
There will be niches in research, high performance computing & graphics, security, etc. But we’re in the last generation or two that’s going to hand write their own CRUD apps. That’s the livelihood of a lot of software developers around the world.
Do people handwrite those? If you take something like Laravel and Rails. You get like 90% of the basics done by executing a few commands. The rest of it is the actual business logic and integration with obscure platforms.
I hear this denigration of CRUD apps all the time, but people forget that CRUD apps can be as complex or simple as they need to be. A CRUD app is identified as such by its purpose, not the sophistication of it.
Right now I'm wring a web app that basically manages data in a db, but guess the kinds of things I have to deal with. Here a a few (there are many much more), in no particular order:
- Caching and load balancing infrastructure.
- Crafting an ORM that handles caching, transactions, and, well, CRUD, but in a consistent, well-typed, and IO-provider-agnostic manner (IO providers could be: DBs like Postgres, S3-compatible object stores, Redis, Sqlite, Localstorage, Filesystems, etc. Yes, I need all those).
- Concurrent user access in manner that is performant and avoids conflicts.
- Db performance for huge datasets (so consideration of indexes, execution plans, performant queries, performant app architecture, etc, etc)
- Defining fluent types for defining the abstract API methods that form the core of the system
- Defining types to provide strong typing for routes that fulfill each abstract API method.
- Defining types to provide strongly-typed client wrappers for each abstract API method
- How to choose the best option for application- and API security (Cookies?, JWT?, API keys? Oauth?)
- Choosing the best UI framework for the requirements of the app. I actually had to write a custom react-like library for this.
- Media storage strategy (what variants of each media to generate on the server, how to generate good storage keys, etc.
- Writing tooling scripts that are sophisticated enough to help me type-check, build, text, and deploy the app just in the way I want
- Figuring out effective UI designs for CRUD pages, with sorting, filtering, paging, etc built in . This is far from simple. For just one example, naive paging is not performant, I need to use keyset pagination.
- Doing all the above with robust, maintainable, and performant code
- Writing effective specs and docs for all my decisions and design for the the above
- And many many more! I've been working on this "CRUD" app for years as a greenfield project that will be the flagship product of my startup.
I don't really disagree with you about handwriting CRUD apps. But I'm not sure that having an off-the-shelf solution, from AI output or not, that would spin up CRUD interfaces would _actually_ erase software as an industry.
To me it's similar to saying that there's no need for lawmakers after we get the basics covered. Clearly it's absurd, because humans will always want to expand on (or subtract from) what's already there.
Hard disagree on every point. Just because implementations aren't always perfect does not mean you should not have public services.
I know a librarian who spends an inordinate amount of time helping the elderly and tech illiterate members of the public with creating emails, because they're necessary. However, you can't create emails anywhere without a phone number these days - a post office option would fix that.
Email already gets enormous amounts of spam, and the only reason most don't see it is because private service providers like Google expend resources filtering them out. Why would a business not be able to charge for premium filter services on an email they don't host? Not to mention that private email services send you ads.
To be clear, I'm not saying we should shut down Gmail tomorrow, but having a free public email service option would allow many people to use internet infrastructure they don't have. It's an accessibility problem that should be addressed in the public's interest as well as shareholders.
I'm not trying to take away from the thrust of your point. But pragmatically it seems like it could be in the scope of libraries to maintain some $4/mo prepaid SIMs to facilitate people signing up for new online accounts. Win-win for serving both the poor and people who care about privacy.
But what happens when the Gov decides they don't want to fund it anymore? Or the gov decides something shouldn't be funded.. Say truckers on strike, or wiki-leaks? Well then boom we have the same game, just a different player.
"flippantly tossing words around devalues them and debases the conversation." Agreed- and that's exactly what you are doing with the word, "no."
Soldiers are murdering an entire population- or as many of them as they can, seemingly- for political purposes that desire that population to simply not exist anymore. To say that is _not_ a genocide devalues the meaning of the word.
They're not "murdering an entire population"; although many thousands of Palestinians have been killed, it's still a tiny percentage of the total population.
But it's not necessary to murder an entire population for it to count as genocide. Any attempt to destroy a people counts, including forced sterilization, re-education, mass deportations, etc.
But it's also clear that Israel has explicitly targeted civilians, help workers, journalists, refugee camps, food distribution, and I've even read about them shooting people hiding in churches. None of those are valid targets.
* Hamas keeps its missiles, arms and other military equipment inside or underneath schools and hospitals
* UNRWA was functioning as an arms dealer by putting arms inside of bags of flour or other food items
* Hamas generally has its fighters not wear uniform, but instead wear civilian clothes or even niqabs (where only the eyes are visible). Making it extremely difficult for the IDF to determine who is a combatant and who isn't- and guaranteeing mistakes will be made.
* Hamas also uses child soldiers or orders children to throw stones at IDF soldiers - again ensuring IDF soldiers have to always be afraid the person in front of them is going to kill them and that they have to make split second decisions on what to do about it
Ah yes, the human shield argument. Like the "tunnels" and graphics provided by the IDF. Convenient isn't? Every hospital, apartment block, school and refugee camp has hamas in them, so everything is fair game.
ya it's pretty FUCKED UP that HAMAS does that, and Iran funds it, isn't it? or do you think Israel just wants to slaughter people weaker than them because they can? if that was their aim why did they wait until 10/8 to start doing it? they could have done it any time in the last 30 years.
> seemingly something happened by the democratically elected government of Gaza on 10/7
Gaza doesn't have a democratically elected government, and one of the reasons Palestine (of which Gaza is a region) does not have a democratically elected government is that Israel has exercised its power as an occupying power administering large parts of Palestine directly and controlling the rest indirectly to prevent elections which have been jointly agreed on by the two main factions.
And they’ve done that specifically to maintain the current violent and divided status quo, which they leverage as pretext to continue their long policy of genocide.
well that's not a very convincing argument. That's just a failure to recognize when the use of a tool- base64 decoder- is needed, not a reasoning problem at all, right?
A moderately smart human who understands how Base64 works can decode it by hand without external tools other than pen and paper. Coming up with the exact steps to perform is a reasoning problem.
That's not really a cop out here: both models had access to the same tools.
Realistically there are many problems that non-reasoning models do better on, especially when the answer cannot be solved by a thought process: like recalling internal knowledge.
You can try to teach the model the concept of a problem where thinking will likely steer it away from the right answer, but at some point it becomes like the halting problem... how does the model reliably think its way into the realization a given problem is too complex to be thought out?
Translating to BASE64 is a good test to see how well it works as a language translator without changing things, because its the same skill for an AI model.
If the model changes things it means it didn't really capture the translation patterns for BASE64, so then who knows what it will miss when translating between languages if it can't even do BASE64?
If the reasoning model was truly reasoning while the flash model was not then by definition shouldn’t it be better at knowing when to use the tool than the non-reasoning model? Otherwise it’s not really “smarter” as claimed, which seems to line up perfectly with the paper’s conclusion.
I don't know whether Flash uses a tool or not, but it answers pretty quickly. However, Pro opts to use its own reasoning, not a tool. When I look at the reasoning train, it pulls and pulls knowledge endlessly, refining that knowledge and drifting away.
I understand that the core similarities are there, but I disagree. The comparisons have been around since I started browsing HN years ago. The moderation on this site, for one, emphasizes constructive conversation and discussion in a way that most subreddits can only dream of.
It also helps that the target audience has been filtered with that moderation, so over time this site (on average) skews more technical and informed.
This sites commenters attempt to apply technical solutions to social problems, then pats itself on the back despite their comments being entirely inappropriate to the problem space.
There's also no actual constructive discussion when it comes to future looking tech. The Cybertruck, Vision Pro, LLMs are some of the most recent items that were absolutely inaccurately called by the most popular comments. And their reasoning for their prediction had no actual substance in their comments.
And the crypto asset discussions are very nontechnical here, veering into elementary and inaccurate philosophical discussions, despite this being a great forum to talk about technical aspects. every network has pull requests and governance proposals worth discussing, and the deepest discussion here is resurrected from 2012 about the entire concept not having a licit use case that the poster could imagine
HackerNews isn't not exactly like reddit, sure, but it's not much better. People are much better behaved, but still spread a great deal of misinformation.
One way to gauge this property of a community is whether people who are known experts in a respective field participate in it, and unfortunately there are very few of them on HackerNews (this was not always the case). I've had some opportunities to meet with people who are experts, usually at conferences/industry events, and while many of them tend to be active on Twitter... they all say the same things about this site, namely that it's simply full of bad information and the amount of effort needed to dispel that information is significantly higher than the amount of effort needed to spread it.
Next time someone posts an article about a topic you are intimately familiar with, like top 1% subject matter expert in... review the comment section for it and you'll find just heaps of misconceptions, superficial knowledge, and my favorite are the contrarians who take these very strong opinions on a subject they have some passing knowledge about but talk about their contrarian opinion with such a high degree of confidence.
One issue is you may not actually be a subject matter expert on a topic that comes up a lot on HackerNews, so you won't recognize that this happens... but while people here are a lot more polite and the moderation policies do encourage good behavior... moderation policies don't do a lot to stop the spread of bad information from poorly informed people.
There was a lot of pseudo science being published and voted up in the comments with Ivermectin/HCQ/etc and Covid, when those people weren't experts and before the Ivermectin paper got serious scrutiny.
The other aspect is that people on here think they're that if they are an expert in one thing, they instantly become an expert in another thing.
This is of course true is some cases and less true in others.
I consider myself an expert in one tiny niche field (computer generated code), and when that field comes up (on HN and elsewhere) over the last 30 years the general mood (from people who don't do it) is that it's poor quality code.
Pre-AI this was demonstrably untrue, but meh, I don't need to convince you, so I accept your point of view, and continue doing my thing. Our company revenue is important to me, not the opinion of done guy on the internet.
(AI has freshened the conversation, and it is currently giving mixed results, which is to be expected since it is non-deterministic. But I've been doing deterministic generation for 35 years.)
So yeah. Lots of comments from people who don't fo something, and I'm really not interested in taking the time to "prove" them wrong.
But equally I think the general level of discussion in areas where I'm not an expert (but experienced) is high. And around a lot of topics experience can be highly different.
For example companies, employees and employers come in all sorts of ways. Some folk have been burned and see (all) management through a certain light. Whereas of course, some are good, some are bad.
Yes, most people still use voting as a measure of "I agree with this", rather than the quality of the discussion, but that's just people, and I'm not gonna die on that hill.
And yeah, I'm not above joining in on a topic I don't technically use or know much about. I'll happily say that the main use for crypto (as a currency) is for illegal activity. Or that crypto in general is a ponzi scheme. Maybe I'm wrong, maybe it really is the future. But for now, it walks like a duck.
So I both agree, and disagree, with you. But I'm still happy to hang out here and get into (hopefully) illuminating discussions.
Frankly, no. As an obvious example that can be stated nowadays: musk has always been an over-promising liar.
Eg just look at the 2012+ videos of thunderf00t.
Yet people were literally banned here just for pointing out that he hasn't actually delivered on anything in the capacity he promised until he did the salute.
It's pointless to list other examples, as this page is- as dingnuts pointed out - exactly the same and most people aren't actually willing to change their opinion based on arguments. They're set in their opinions and think everyone else is dumb.
> Yet people were literally banned here just for pointing out that he hasn't actually delivered on anything in the capacity he promised until he did the salute.
I'd be shocked if they (you?) were banned just for critiquing Musk. So please link the post. I'm prepared to be shocked.
I'm also pretty sure that I could make a throwaway account that only posted critiques of Musk (or about any single subject for that matter) and manage to keep it alive by making the critiques timely, on-topic and thoughtful or get it banned by being repetitive and unconstructive. So would you say I was banned for talking about <topic>? Or would you say I was banned for my behavior while talking about <topic>?
Aside from the fact that I highly doubt anyone was banned as you describe, EM’s stories have gotten more and more grandiose. So it’s not the same.
Today he’s pitching moonshot projects as core to Tesla.
10 years ago he was saying self-driving was easy, but he was also selling by far the best electric vehicle on the market. So lying about self driving and Tesla semis mattered less.
Fwiw I’ve been subbed to tf00t since his 50 part creationist videos in early 2010s.
I don’t see how that example refutes their point. It can be true both that there have been disagreeable bans and that the bans, in general, tend to result in higher quality discussions. The disagreeable bans seem to be outliers.
> They're set in their opinions and think everyone else is dumb.
Well, anyway, I read and post comments here because commenters here think critically about discussion topics. It’s not a perfect community with perfect moderation but the discussions are of a quality that’s hard to find elsewhere, let alone reddit.
Is creating a giant wobbly world of unreliable AIs all talking to each other in effort to get their own tasks accomplished, is that leaving us much better off than humans doing everything themselves?
Better yet, if you have an application that does exactly what you want, why would you (or an AI representing you) want to do anything other than use that application? Sure you could execute this binary and get what you want, OR you could reach to the AI-net and make some other AI do it from scratch every time. With inherently less reliable results.