That’s bad because visiting an evil site can easily trick your browser into performing one of those requests using your own credentials. CORS doesn’t stop the backend state effect from happening.
Fyi Kilocode has low credibility. They’ve been blasting AI subreddits with lots of clickbaity ads and posts, sometimes claiming things that are outright false.
As far as spend per dev- I can’t even manage to use up the limits on my $100 Claude plan. It gets everything done and I run out of things to ask it. Considering that the models will get better and cheaper over time, I’m personally not seeing a future where I will need to spend that much more than $100 a month.
Lots of signs point to a conclusion that the Opus and Sonnet models are fundamentally better at coding, tool usage, and general problem solving across long contexts. There is some kind of secret sauce in the way they train the models. Dario has mentioned in interviews that this strength is one of the company's closely guarded secrets.
And I don't think we have a great eval benchmark that exactly measures this capability yet. SWE Bench seems to be pretty good, but there's already a lot of anecdotal comments that Claude is still better at coding than GPT 5, despite having similar scores on SWE Bench.
I've been testing AI as a beta reader for >100k novels, and I can tell you with 100% certainty that Claude gets confused about things across long contexts much sooner than either O3/GPT5 or Gemini 2.5. In my experience Gemini 2.5 and O3/GPT5 run neck and neck until around 80-100k tokens, then Gemini 2.5 starts to pull ahead and by 150k tokens it's absolutely dominant. Claude is respectable but clearly in third place.
Yeah, agree that the benchmarks don't really seem to reflect the community consensus. I wonder if part of it is the better symbiosis between the agent (Claude Code) and the Opus and Sonnet models it uses, which supposedly are fine-tuned on Claude Code tool calls? But agree, there is probably some additional secret sauce in the training, perhaps to do with RL on multi-step problems...
I've had days where it really does feel like 5x or 10x...
Here's what the 5x to 10x flow looks like:
1. Plan out the tasks (maybe with the help of AI)
2. Open a Git worktree, launch Claude Code in the worktree, give it the task, let it work. It gets instructions to push to a Github pull request when it's done. Claude gets to work. It has access to a whole bunch of local tools, test suites, and lots of documentation.
3. While that terminal is running, I go start more tasks. Ideally there are 3 to 5 tasks running at a time.
4. Periodically check on the tabs to make sure they're not stuck or lost their minds.
5. Finally, review the finished pull requests and merge them when they are ready. If they have issues then go back to the related chat and tell it to work on it some more.
With that flow it's reasonable to merge 10 to 20 pull requests every day. I'm sure someone will respond "oh just because there are a lot of pull requests, doesn't mean you are productive!" I don't know how to prove to you that the PRs are productive other than just say that they are each basically equivalent to what one human does in one small PR.
A few notes about the flow:
- For the AI to work independently, it really needs tasks that are easy to medium difficulty. There are definitely 'hard' tasks that need a lot of human attention in order to get done successfully.
- This does take a lot of initial investment in tooling and documentation. Basically every "best practice" or code pattern that you want to use use in the project must be written down. And the tests must be as extensive as possible.
Anyway the linked article talks about the time it takes to review pull requests. I don't think it needs to take that long, because you can automate a lot..
- Code style issues are fully automated by the linter.
- Other checks like unit test coverage can be checked in the PR as well.
- When you have a ton of automated tests that are checked in the PR, that also reduces how much you need to worry about as a code reviewer.
With all those checks in place, I think it can pretty fast to review a PR. As the human you just need to scan for really bad code patterns, and maybe zoom in on highly critical areas, but most of the code can be eyeballed pretty quickly.
What type of software are you building with this workflow? Does it handle PII, need data to be exact, or have any security implications?
Because I might just not have a great imagination, but it's very hard for me to see how you basically automate the review process on anything that is business critical or has legal risks.
Mainly working on a dev tool / SaaS app right now. The PII is user names & email.
On the security layer, I wrote that code mostly by hand, with some 'pair programming' with Claude to get the Oauth handling working.
When I have the agent working on tasks independently, it's usually working on feature-specific business logic in the API and frontend. For that work it has a lot of standard helper functions to read/write data for the current authenticated user. With that scaffolding it's harder (not impossible) for the bot to mess up.
It's definitely a concern though, I've been brainstorming some creative ways to add extra tests and more auditing to look out for security issues. Overall I think the key for extremely fast development is to have an extremely good testing strategy.
I appreciate the helpful reply, honestly. One other question - are people currently using the app?
I think where I've become very hesitant is a lot of the programs that I touch has customer data belonging to clients with pretty hard-nosed legal teams. So it's quite difficult for me to imagine not reviewing the production code by hand.
Yeah the outrage is a little artificial and definitely premature.
Some facts for sanity:
1- The poster of this blog article is Kilocode who makes a (worse) competitor to Claude Code. They are definitely capitalizing on this drama as much as they can. I’ve been getting hit by Reddit ads all day from Kilocode, all blasting Anthropic, with the false claim that their plan was "unlimited".
2- No one has any idea yet what the new limits will be, or how much usage it actually takes to be in the top 5% to be affected. The limits go into effect in a few days. We'll see then if all the drama was warranted.
There’s been a ton of ‘service overloaded’ errors this week so it makes sense that they had to adjust it.
Personally I’ve never hit a usage limit on the $100 plan even when running several Claude tabs at once. I can’t imagine how people can max out the $200 plan.
The AI definitely gets confused when there is a lot of stuff happening. It helps if you try to make the commands as easy as possible. Like, change 'make' so that '-j8' is default, or add scripts like make-check.sh that does 'make -j check', or add an MCP server that has commands for the most common actions (tell the AI to write an MCP server for you).
Hooks would probably help, I think you could add a hook to auto-reject the bot when it calls the wrong thing.
I'm pretty sure that the $100 / $200 plans are a net profit for Anthropic. Most users (including myself) don't come close to the usage limit.
I've been trying to max out the $100 plan and I have a problem where I run out of stuff to ask it to do. Even when I try to have multiple side projects at once, Claude just gets stuff done, and then sits idle.
reply