Kinda funny but I think LLM-assisted workflows are frequently slow -- that is, if I use the "refactor" features in my IDE it is done in a second, if I ask the faster kind of assistant it comes back in 30 seconds, if I ask the "agentic" kind of assistant it comes back in 15 minutes.
I asked an agent to write an http endpoint at the end of the work day when I had just 30 min left -- my first thought was "it took 10 minutes to do what would have taken a day", but then I thought, "maybe it was 20 minutes for 4 hours worth of work". The next day I looked at it and found the logic was convoluted, it tried to write good error handling but didn't succeed. I went back and forth and ultimately wound up recoding a lot of stuff manually. In 5 hours I had it done for real, certainly with a better test suite than I would have written on my own and probably better error handling.
As a counter example (re: agents), I routinely delegate simple tasks to Claude Code and get near-perfect results. But I've also had experiences like yours where I ended up wasting more time than saved. I just kept trying with different types of tasks, and narrowed it down to the point where I have a good intuition for what works and what doesn't. The benefit is I can fire off a request on my phone, stick it in my pocket, then do a code review some time later. This process is very low mental overhead for me, so it's a big productivity win.
Except the tokens you insert have meaning, and some yield better results than others. Not like a slot machine at all, really. Last I checked, those only have 1 possible input, no way to improve your odds.
Not really, it's not a zero-sum game. You're not competing against anything, you're working with something. It's just a tool that takes practice, has some variability and isn't free. Like most things in life. More like buying corn or having friends.
I bought a bunch of poker chips and taught Texas Hold'em to my kids. We have a fantastic time playing with no money on the line, just winning or losing the game based on who wins all the chips.
How's that different from a human developer? Give the same task to different developers and you'll get different levels of correctness and quality. Give the task to the same developer on different days and it is the same.
Its a lot faster to give a task to an ai agent than a developer. The agent is always at their desk, always listening, and will immediately prioritize whatever you tell it to do.
An ai agent always has capacity, does not have competing priorities, nor does it have ideas about what does or does not fall within their "scope of work".
I don't know how to do it with Claude Code, but I was at a beach vacation for the past few days and I was studying french on my phone with an webapp that I made. Sometimes I'd notice something bug me, and I used cursor's "background agents" tool to ask it to make a change. This is essentially just a website where you can type in your request, and they allocate a VM, check out your repository, then run the cursor LLM agent inside that VM to implement your requested changes, then push it and create a pull request to your repo. Because I have CI/CD setup, I then just merged the change and waited for it to deploy (usually going for a swim in-between).
I realized as I was doing it that I wouldn't be able to tell anyone about it because I would sound like the most obnoxious AI bro ever. But it worked! (For the simple requests I used it on.) The most annoying part was that I had to tell it to run rustfmt every time, because otherwise it would fail CI and I wouldn't be able to merge it. And then it would take forever to install a rust toolchain and figure out how to run clippy and stuff. But it did feel crazy to be able to work on it from the beach. Anyway, I'm apparently not very good at taking vacations, lol
My dev environment works perfectly on Termux, and so does Claude Code. So I just run `claude` like normal, and everything is identical to how I do it on desktop.
The cost is in the context switching. Throw 3 tasks that came 15, 20 and 30 min later. The first is mostly ok, you finish by hand. The second have some problems, ask for a rework. Then came the other and, while ok, is have some design problems. Ask another rework. Comes back the second one, and you have to remember the original task and what things you asked for change.
I've already written about this several times here. I think the current trend of LLMs chasing benchmark scores are going in the wrong direction at least as programming tools. In my experience they get it wrong with enough probability, so I always need to check the work. So I end up in a back and forth with the LLM and because of the slow responses it becomes a really painful process and I could often have done the task faster if I sat down and thought about it. What I want is an agent that responds immediately (and I mean in subseconds) even if some benchmark score is 60% instead of 80%.
Programmers (and I'm including myself here) often go to great lengths to not think, to the point of working (with or without a coding assistant) for hours in the hope of avoiding one hour of thinking. What's the saying? "An hour of debugging/programming can save you minutes of thinking," or something like that. In the end, we usually find that we need to do the thinking after all.
I think coding assistants would end up being more helpful if, instead of trying to do what they're asked, they would come back with questions that help us (or force us) to think. I wonder if a context prompt that says, "when I ask you to do something, assume I haven't thought the problem through, and before doing anything, ask me leading questions," would help.
I think Leslie Lamport once said that the biggest resistance to using TLA+ - a language that helps you, and forces you to think - is because that's the last thing programmers want to do.
I do both. I like to develop designs in my head, and there’s a lot of trial and error.
I think the results are excellent, but I can hit a lot of dead ends, on the way. I just spent several days, trying out all sorts of approaches to PassKeys/WebAuthn. I finally settled on an approach that I think will work great.
I have found that the old-fashioned “measure twice, cut once” approach is highly destructive. It was how I was trained, so walking away from it was scary.
> I have found that the old-fashioned “measure twice, cut once” approach is highly destructive. It was how I was trained, so walking away from it was scary.
To be fair it’s great advice when you’re dealing with atoms.
> Programmers (and I'm including myself here) often go to great lengths to not think, to the point of working (with or without a coding assistant) for hours in the hope of avoiding one hour of thinking. What's the saying? "An hour of debugging/programming can save you minutes of thinking," or something like that. In the end, we usually find that we need to do the thinking after all.
This is such a great observation. I'm not quite sure why this is. I'm not a programmer, but a signal-processing/system engineer/researcher. The weird thing seems that it's the process of programming that causes the "not-thinking" behaviour, e.g. when I program a simulation and I find that I must have a sign error somewhere in my implementation (sometimes you can see this from the results), I end up switching every possible sign around instead of taking a pen and pencil and comparing theory and implementation, if I do other work, e.g. theory, that's not the case. I suspect we try to avoid the cost of the context switch and try to stay in the "programming-flow".
This is your brain trying conserve your energy/time by recollecting/brute-forcing/following known patterns, instead of diving into unknown. Otherwise known as „being lazy” / procrastinating.
There is an illusion that the error is tiny, and its nature is obvious, so it could be fixed by an instant, effortless tweak. Sometimes it is so (when the compiler complains about a forgotten semicolon), sometimes it may be arbitrarily deeply wrong (even it manifests just as a reversed sign).
Sometimes thinking and experimenting go together. I had to do some maintenance on some Typescript/yum that I didn't write but had done a little maintenance.
Typescript can make astonishingly complex error messages when types don't match up so I went through a couple of rounds of showing the errors to the assistant and getting suggestions to fix it that were wrong but I got some ideas and did more experiments and over the course of two days (making desired changes along the way) I figured out what was going wrong and cleared up the use of types such that I was really happy with my code and when I saw a red squiggle I usually knew right away what was wrong and if I did ask the assistant it would also get it right right away.
I think there's no way I would have understood what was going on without experimenting.
Agree, also llms change the balance of plan vs do for me, sometimes it cheaper to do & review than up-front plan.
When you can see what goes wrong with the naive plan you then have all the specific context in front of you for making a better plan.
If something is wrong with the implementation then I can ask the agent to then make a plan which avoids the issues / smells I call out. This itself could probably be automated.
The main thing I feel I'm "missing" is, I think it would be helpful if there were easier ways to back up in the conversation such that the state of the working copy was restored also. Basically I want the agent's work to be directly integrated with git such that "turns" are commits and you can branch at any point.
I like that prompt idea. Because I hatehatehate when it just starting “doing work”. Those things are much better as sounding board for ideas and clarifying my thinking than writing one-shot code.
I dictate rambling, disorganized, convoluted thoughts about a new feature into a text file.
I tell Claude Code or Gemini CLI to read my slop, read the codebase, and write a real functional design doc in Markdown, with a section on open issues and design decisions.
I'll take a quick look at its approach and edit the doc to tweak its approach and answer a few open questions, then I'll tell it to answer the remaining open questions itself and update the doc.
When that's about 90% good, I'll tell the local agent to write a technical design doc to think through data flow, logic, API endpoints and params and test cases.
I'll have it iterate on that a couple more rounds, then tell it to decompose that work into a phased dev plan where each phase is about a week of work, and each task in the phase would be a few hours of work, with phases and tasks sequenced to be testable on their own in frequent small commits.
Then I have the local agent read all of that again, the codebase, the functional design, the technical design, and the entire dev plan so it can build the first phase while keeping future phases in mind.
It's cool because the agent isn't only a good coder, it's also a decent designer and planner too. It can read and write Markdown docs just as well as code and it makes surprisingly good choices on its own.
And I have complete control to alter its direction at any point. When it methodically works through a series of small tasks it's less likely to go off the rails at all, and if it does it's easy to restore to the last commit and run it again.
1. Shame on you, that doesn't sound like fun vibe coding, at all!
2. Thank you for the detailed explanation, it makes a lot of sense. If AI is really a very junior dev that can move fast and has access to a lot of data, your approach is what I imagine works - and crucially - why there is such a difference in outcomes using it. Because what you're saying is, frankly, a lot of work. Now, based on that work you can probably double your output as a programmer, but considering the many code bases I've seen that have 0 documentation, 0 tests, I think there is a huge chunk of programmers that would never do what you're doing because "it's boring".
3. Can you share maybe an example of this, please:
> and write a real functional design doc in Markdown, with a section on open issues and design decisions.
I agree with your comment in general, however I would say that on my field, the resistence to TLA+ isn't having to think, rather having to code twice without guarantees that it actually maps to the theorical model.
Tools like Lean and Dafny are much more appreciated, as they generate code from the model.
But both Dafny and Lean (which are really hard to put in the same category [1]) are used even less than TLA+, and the problem of formally tying a spec to code exists only when you specify at a level that's much higher than the code, which is what you want most of the time because that's where you get the most bang for you buck. It's a little like saying that the resistance to blueprints is that a rolled blueprint makes a poor hammer.
TLA+ is for when you have a 1MLOC database written in Java or a 100KLOC GC written in C++ and you want to make sure your design doesn't lead to lost data or to memory corruption/leak (or for some easier things, too). You certainly can't do that with Dafny, and while I guess you could do it in Lean (if you're masochistic and have months to spare), it wouldn't be in a way that's verifiably tied to the code.
There is no tool that actually formally ties spec to code in any affordable way and at real software scale, and I think the reason people say they want what doesn't exist is precisely because they want to avoid the thinking that they'll have to do eventually anyway.
[1]: Lean and TLA+ are sort-of similar, but Dafny is something else altogether.
> That is not the case for the TLA+ spec and your 1MLOC Java Database.
That is the case. Of course, nobody bothers to write the TLA+ proof that that is the case, because even if somebody had the resources to do it, the ROI on doing that is not good. If you can avoid 4 major bugs with 10 hours of work, you probably won't want to work an extra 10,000 hours to avoid two additional minor ones. That most people choose to stop when the ROI gets bad and not when they achieve perfection is not a problem.
The question isn't what tool guarantees perfection (there isn't one), but what toolset can reduce the greatest number of (costly) bugs with the least effort, and tools that help you think rigorously about design are a part of such a toolset.
> You hope with fingers crossed that you've implemented the design, but have you?
The same way you always validate that you've implemented what you intended - which is more than just keeping your fingers crossed - except that TLA+'s job is to make sure that what you intend actually works (if implemented).
> While Dafny might not be the answer we should strive to find a good way to do refinement.
TLA+ does refinement in a much more powerful way than Dafny. Neither is able to do it from a high-level design to a large and realistic codebase, certainly in any afforable way, but nothing can. I guess that is a problem, but it's not the problem we can solve, and there are other big problems that we can.
Too defeatist. If much of the software infrastructure of the world was built on say... Idris, we could do it. That's the promise of dependent types, proof carrying code.
Can we extend that to large scale software? There's no obvious barrier to it, beyond a lack of existing provably correct code to build upon.
I don't expect this to change, however, since the cost/benefit ratio just isn't there. And that makes me sad. We build everything on quicksand.
It's like saying that there's no barrier to turning lead into gold except for the fact that it's easier to get gold by other means; that's a good thing! The cost of deriving formal deductive proofs is high, while unsound methods are much cheaper and highly effective.
The reason proofs can be expensive is that proving things has an intrinsic computational complexity cost (it's a search problem in a large space). A decade ago I summarised some relevant results of the last three or four decades about the difficulty in proving programs correct: https://pron.github.io/posts/correctness-and-complexity
If software were generally simple enough for all of its interesting properties to be easily proven, that would mean that there wouldn't be much point in the software at all. I think it was a formal methods researcher/practitioner at NASA who once said something like, "computers were built to surprise us; if they didn't, there would be no point in building them in the first place."
> We build everything on quicksand.
You do realise that while an abstract algorithm or a program on the page can be proven correct, it is impossible to prove that a software system is correct, because it's a physical system, and not a mathematical object anymore (you cannot prove that hardware will behave as specified). If mathematical proofs were the only thing that's not "quicksand", then everything in the physical world is quicksand.
Appreciation isn't the same as market share, formal proofs in general are pretty much inexistent in enterprise, unless there are legal requirements to do otherwise.
I fail to see how you validate that TLA+ model is actually correctly mapped into the written Java code.
> formal proofs in general are pretty much inexistent in enterprise, unless there are legal requirements to do otherwise.
Formal proofs are rarely used when specifying with TLA+, too, BTW. Writing formal proofs (as you would in Lean) has a very low ROI, and even formal method fans (like me) would tell you that's a tool you should reach for very rarely, and only if you must.
> I fail to see how you validate that TLA+ model is actually correctly mapped into the written Java code.
You don't (not even with Lean), but that we can't have cars that are completely crash-proof doesn't mean that's the standard for accepting or rejecting a safety measure. With TLA+ you can make sure that the design that you have (and you can't validate is actually implemented in code with or without TLA+) is actually good.
In other words, it lets us think about design rigorously. Maybe that's not all we wish for, but it's a lot, and it's not like there are better, easier ways of doing that. Of course, if the goal is to avoid thinking hard about design, then a tool that helps us think even harder isn't what we want.
> "An hour of debugging/programming can save you minutes of thinking,"
I get what you're referring to here, when it's tunnel-vision debugging. Personally I usually find that coding/writing/editing is thinking for me. I'm manipulating the logic on screen and seeing how to make it make sense, like a math problem.
LLMs help because they immediately think through a problem and start raising questions and points of uncertainty. Once I see those questions in the <think> output, I cancel the stream, think through them, and edit my prompt to answer the questions beforehand. This often causes the LLM's responses to become much faster and shorter, since it doesn't need to agonise over those decisions any more.
it's funny, I feel like I'm the opposite and it's why I truly hate working with stuff like claude code that constantly wants to jump into implementation. I want to be in the driver's seat fully and think about how to do something thoroughly before doing it. I want the LLM to be, at most, my assistant. Taking on the task of being a rubber duck, doing some quick research for me, etc.
It's definitely possible to adapt these tools to be more useful in that sense... but it definitely feels counter to what the hype bros are trying to push out.
In general agreement about the need to think it through, and she should be careful to not oraise the other extreme.
> "An hour of debugging/programming can save you minutes of thinking"
The trap so many dev fall into is assuming code behaves like they think it is. Or believing documentation or seemingly helpful comments. We really want to believe.
People's mental image is more often than not wrong, and debugging tremendously helps bridge the gap.
Absolutely!
I have used Copilot for a few weeks and then stopped when I worked on a machine that didn't have Copilot installed and I immediately struggled with even basic syntax.
Now I often use LLMs as advanced rubber ducks. By describing my problems, the solution often comes to my mind on its own and sometimes the responses I get are enough for me to continue on my own.
In my opinion, letting LLMs directly code can be really harmful for the software developers, because they forget to think for themselves.
Maybe I'm wrong and I am just slow to accept the new reality, but I try to keep writing most of my code on my own and improve my coding skills more than my prompting skills (while still using these tools, of course).
For me, LLMs are like a grumpy and cynical old senior dev who is forced to talk in a very positive manner and who has fun trickling in some completely random bullshit between his actual helpful advice.
World of LLMs or not, development should always strive for being fast. In the LLM World, users should always have the controls on accuracy Vs speed. (Though we can try for improving both and not one way or other). For eg at rtrvr.ai we use Gemini Flash as our default and did benchmarking on flash too with 0.9 min per task in the benchmark still yielding top results. That said, I have to accept there are certain web tasks on tail end sites that needs pro to accurately navigate at this point. This is the limitation given our reliance on Gemini models straight up, once we move to our models trained on web trajectories this hopefully will not be a problem.
If using off the shelf LLMs always have a bottleneck of their speed.
The only thing I've found that LLM speeds up my work is a sort of advanced find replace.
A prompt like " I want to make this change in the code where any logic deals with XXX. To be/do XXX instead/additionally/somelogicchange/whatever"
It has been pretty decent at these types of changes and saves time of poking though and finding all the places I would have updated manually in a way that find/replace never could. Though I've never tried this on a huge code base.
You would be right about the code but probably wrong about the you. I’ve done such requests to clean up code written over the years by dozens of other people copying patterns around because ship was king… until it wasn’t. (They worked quite well, btw.)
sometimes you want a cutpoint for a refactor and only that refactor. And turns out that there is no nice abstraction that is useful beyond that refactor.
I knew someone would make this comment. I almost added a "I'm probably not leet enough to avoid these situations" disclaimer. It seemed a bit pointlessly self deprecating.
You don't always get to choose the state of or the way a system you work in/with is designed. In this case I was working in a limited scripting language that I have no choice about.
Keep that nose turned up. I'm sure you are leet10xninja. Maybe work on your reading comprehension before you dump on someone though as I already specified that I greatly simplified for comment sake.
No slight was intended. I've learned a lot of techniques; I don't consider this a matter of elitism. I have generally been fortunate to have control over the architecture of my projects; when I encounter something like this in someone else's code, I can at least raise my concern. Sometimes the code is that way because doing it the other obvious way would lead to some other inconvenience. Hence "a sign", not proof. It's worth investigating signs.
I supposed you haven’t tried emacs grep mode or vim quickfix? If the change is mechanical, you create a macro and be done in seconds. If it’s not, you still got the high level overview and quick navigation.
Finding and jumping to all the places is usually easy, but non trivial changes often require some understanding of the code beyond just line based regex replace. I could probably spend some time recording a macro that handles all the edge cases, or use some kind of AST based search and replace, but cursor agent does it just fine in the background.
Code structure is simple. Semantics is where it get tough. So if you have a good understanding of the code (and even when you don't), the overview you get from one of those tools (and the added interactivity) is nice for confirming (understanding) the needed actions that needs to be done.
> cursor agent does it just fine in the background
That's for a very broad definition of fine. And you still need to review the diff and check the surrounding context of each chunk. I don't see the improvement in metrics like productivity and cognitive load. Especially if you need to do serveral rounds.
You mentioned grep-mode, which to my knowledge is just bringing up a buffer with all the matches for a regex and easily jumping to each point (I use rg.el myself). For the record, this is basically the same thing as VSCode's search tool.
Now, once you have that, to actually make edits, you have to record a macro to apply at each point or just manually do the edit yourself, no? I don't pretend LLMs are perfect, but I certainly think using one is a much better experience for this kind of refactoring than those two options.
Maybe it's my personal workflow, but I either have sweeping changes (variable names, removing dependencies) which are easily macroable, or very targeted one (extracting functions, decoupling stuff,..,). For both, this navigation is a superpower and coupled with the other tools of emacs/vim, edit is very fast. That rely on a very good mental model of the code, but any question can be answered quickly with the above tools.
For me, it's like having a moodboard with code listings.
Yes I've done this kind of refactoring for ages using emacs macros and grep. Language Server and tree-sitter in emacs has made this faster (when I can get all the dependencies setup correctly that is.) Variable name edits and function extraction is pretty much table stakes in most modern editors like IntelliJ, VSCode, Zed, etc. IIRC Eclipse had this capability 15-20 years ago.
I used to have more patience for doing it the grep/macro way in emacs. It used to feel a bit zen, like going through the code and changing all the call-sites to use my new refactor or something. But I've been coding for too long to feel this zen any longer, and my own expectations for output have gotten higher with tools like language-server and tree-sitter.
The kind of refactorings I turn to an LLM for are different, like creating interfaces/traits out of structs or joining two different modules together.
I'm decent at that kind of stuff. However thats not really what I'm talking about. For instance today I needed two logic flows. One for data flowing in one direction. Then a basically but not quite reversed version of the same logic for when the data comes back. I was able to write the first version then tell the LLM
"Now duplicate this code but invert the logic for data flowing in the opposite direction."
I'm simplifying this whole example obviously but that was the basic task I was working on. It was able to spit out in a few seconds what would have taken me probably more than an hour and at least one tedium headache break. I'm not aware of any pre LLM way to do something like that.
Or a little while back I was implementing a basic login/auth for a website. I was experimenting with high output token LLM's (i'm not sure that's the technical term) and asked it to make a very comprehensive login handler. I had to stop it somewhere in the triple digits of cases and functions. Perhaps not a great "pro" example of LLM but even though it was a hilariously over complex setup it did give me some ideas I hadn't thought about. I didn't use any of the code though.
Its far from the magic LLM sellers want us to believe but it can save time same as various emac/vim tricks can to devs that want to learn them.
emacs macros aren't the same. You need to look at the file, observe a pattern, then start recording the macro and hope the pattern holds. An LLM can just do this.
I am familiar with grep-mode and have used that and macro recording for years. I've been using emacs for 20 years. grep-mode (these days I use rg) just brings up all the matches which lets me use a macro that I recorded. That's not the same as telling Claude Code to just make the change. Macros aren't table stakes but find-replace across projects is table stakes in pretty much any post-emacs/vim code editor (and both emacs and vimlikes obviously have plenty of support for this.)
I guess it depends? The "refactor" stuff, if your IDE or language server can handle it, then yeah I find the LLM slower for sure. But there are other cases than an LLM helps a lot.
I was writing some URL canonicalization logic yesterday. Because we rolled this out as an MVP, customers put URLs in all sorts of ways and we stored it into the DB. My initial pass at the logic failed on some cases. Luckily URL canonicalization is pretty trivially testable. So I took the most used customers from our DB, send them to Claude and told Claude to come up with the "minimum spanning test cases" that cover this behavior. This took maybe 5-10 sec. I then told Zed's agent mode using Opus to make me a test file and use these test cases to call my function. I audited the test cases and ended up removing some silly ones. I iterated on my logic and that was that. Definitely faster than having to do this myself.
All the references to LLMs in the article seemed out-of-place like poorly done product placement.
LLMs are the anti-thesis of fast. In fact, being slow is a perceived virtue with LLM output. Some sites like Google and Quora (until recently) simulate the slow typed output effect for their pre-cached LLM answers, just for credibility.
Eeeh, I spend less time writing code, but way more time reviewing and correcting it. I'm not sure I come ahead overall, but it does make development less boilerplaty and more high level, which leads to code that otherwise wouldn't have been written.
I wonder if you observe this when you use it in a domain you know well versus a domain you know less well.
I think LLM assistants help you become functional across a more broad context -- and completely agree that testing and reviewing becomes much, much more important.
E.g - a front end dev optimizing database queries, but also being given nonsensical query parameters that don't exist.
That sounds plausible if the senior did lots of simple coding tasks and moves that work to an agent. Then the senior basically has to be a team lead and do code reviews/qa.
A senior can write, test, deploy, and possibly maintain a scalable microservice or similar sized project without significant hand-holding in a reasonable amount of time.
A junior might be able to write a method used by a class but is still learning significant portions and concepts either in the language, workflow orchestration, or infrastructure.
A principal knows how each microservice fits into the larger domain they service, whether they understand all services and all domains they serve.
A staff has significant principal understanding across many or all domains an organization uses, builds, and maintains.
AI code assistance help increase breadth and, with oversight, improve depth. One can move from the "T" shape to "V" shape skillset far easier, but one must never fully trust AI code assistants.
I switch to vs code from cursor many times a day just to use their python refactoring feature. The pylance server that comes with cursor doesn't support refactoring.
I asked an agent to write an http endpoint at the end of the work day when I had just 30 min left -- my first thought was "it took 10 minutes to do what would have taken a day", but then I thought, "maybe it was 20 minutes for 4 hours worth of work". The next day I looked at it and found the logic was convoluted, it tried to write good error handling but didn't succeed. I went back and forth and ultimately wound up recoding a lot of stuff manually. In 5 hours I had it done for real, certainly with a better test suite than I would have written on my own and probably better error handling.
See https://www.reddit.com/r/programming/comments/1lxh8ip/study_...