As someone that started using Co-work, I feel like I am going insane with the frequency that I have to keep telling it to stay on task.
If you ask it to do something laborious like review a bunch of websites for specific content it will constantly give up, providing you information on how you can continue the process yourself to save time. Its maddening.
That’s pretty funny when compared with the rhetoric like “AI doesn’t get tired like humans.” No, it doesn’t, but it roleplays like it does. I guess there is too much reference to human concerns like fatigue and saving effort in the training.
This is what happens when a bunch of billionaires convince people autocomplete is AI.
Don't get me wrong, it's very good autocomplete and if you run it in a loop with good tooling around it, you can get interesting, even useful results. But by its nature it is still autocomplete and it always just predicts text. Specifically, text which is usually about humans and/or by humans.
You are not wrong, but after having started working with LLMs, I have this feeling that many humans are simply autocomplete engines too. So LLMs might be actually close to AGI, if you define "general" as "more than 50% of the population".
Humans are absolutely auto-complete engines, and regularly produce incorrect statements and actions with full confidence in it being precisely correct.
Just think about how many thousands of times you've heard "good morning" after noon both with and without the subsequent "or I guess I should say good afternoon" auto-correct.
Well the essence of software engineering is taking this complex real world tasks and breaking them down into simpler parts until they can be done by simple (conceptually) digital circuits.
So it's not surprising that eventually autocomplete can reach up from those circuits and take on some tasks that have already been made simple enough.
I think what's so interesting is how uneven that reach is. Some tasks it is better than at least 90% of devs and maybe even superhuman (which, in this case, I mean better than any single human. I've never seen an LLM do something that a small team couldn't do better if given a reasonable amount of time). Other cases actual old school autocomplete might do a better job, the extra capabilities added up to negative value and its presence was a distraction.
Sometimes there is an obvious reason why (solving a problem with lots of example solution online vs working with poorly documented proprietary technologies), but other times there isn't. They certainly have raised the floor somewhat, but the peaks and valleys remain enormous which is interesting.
To me that implies there is both lots of untapped potential and challenges the LLM developers have not even begun to face.
Yep. The veil of coherence extends convincingly far by means of absurd statistical power, but the artifacts of next token prediction become far more obvious when you're running models that can work on commodity hardware
> As someone that started using Co-work, I feel like I am going insane with the frequency that I have to keep telling it to stay on task.
Used to have the same thing happening when using Sonnet or Opus via Windsurf.
After switching to Claude Code directly though (and using "/plan" mode), this isn't a thing any more.
So, I reckon the problem is in some of these UI/things, and probably isn't in the models they're sending the data to. Windsurf for example, which we no longer use due to the inferior results.
In my experience all of the models do that. It's one of the most infuriating things about using them, especially when I spend hours putting together a massive spec/implementation plan and then have to sit there babysitting it going "are you sure phase 1 is done?" and "continue to phase 2"
I tend to work on things where there is a massive amount of code to write but once the architecture is laid down, it's just mechanical work, so this behavior is particularly frustrating.
I hope you will excuse my ignorance on this subject, so as a learning question for me: is it possible to add what you put there as an absolute condition, that all available functions and data are present as an overarching mandate, and it’s simply plug and chug?
Recently it seems that even if you add those conditions the LLMs will tend to ignore them. So you have to repeatedly prompt them. Sometimes string or emphatic language will help them keep it “in mind”.
If found it better to split in smaller tasks from a first overall analysis and make it do only that subtask and make it give me the next prompt once finished (or feed that to a system of agents). There is a real threshold from where quality would be lost.
If you ask it to do something laborious like review a bunch of websites for specific content it will constantly give up, providing you information on how you can continue the process yourself to save time. Its maddening.