I’ve had some success with claude cli agents at some scale with a memory architecture - but it roughly reads like a massive index, where it crawls through a trail of breadcrumbs to piece together all the info it needs to do a task. It’s fairly tedious to maintain, and it’s always a battle maintaining reasonable context size and token spend.
I’d say it’s like 85% reliable on any given task, and since I supervise it, this is good enough for me. But for something to be useful autonomously, that number needs to be several 9’s to be useful at all, and we’re no world near that yet.
I’m currently watching someone trying and failing to roll openclaw out at scale in an org and they believe in it so much it’s very difficult to convince them even with glaring evidence staring them in the face that it will not work
I’d say it’s like 85% reliable on any given task, and since I supervise it, this is good enough for me. But for something to be useful autonomously, that number needs to be several 9’s to be useful at all, and we’re no world near that yet.
I’m currently watching someone trying and failing to roll openclaw out at scale in an org and they believe in it so much it’s very difficult to convince them even with glaring evidence staring them in the face that it will not work