More

paulhodge · 2025-08-14T04:06:34 1755144394

That’s bad because visiting an evil site can easily trick your browser into performing one of those requests using your own credentials. CORS doesn’t stop the backend state effect from happening.

paulhodge · 2025-08-11T21:05:15 1754946315

Fyi Kilocode has low credibility. They’ve been blasting AI subreddits with lots of clickbaity ads and posts, sometimes claiming things that are outright false.

As far as spend per dev- I can’t even manage to use up the limits on my $100 Claude plan. It gets everything done and I run out of things to ask it. Considering that the models will get better and cheaper over time, I’m personally not seeing a future where I will need to spend that much more than $100 a month.

paulhodge · 2025-08-08T15:48:33 1754668113

Lots of signs point to a conclusion that the Opus and Sonnet models are fundamentally better at coding, tool usage, and general problem solving across long contexts. There is some kind of secret sauce in the way they train the models. Dario has mentioned in interviews that this strength is one of the company's closely guarded secrets.

And I don't think we have a great eval benchmark that exactly measures this capability yet. SWE Bench seems to be pretty good, but there's already a lot of anecdotal comments that Claude is still better at coding than GPT 5, despite having similar scores on SWE Bench.

CuriouslyC · 2025-08-09T13:22:14 1754745734

I've been testing AI as a beta reader for >100k novels, and I can tell you with 100% certainty that Claude gets confused about things across long contexts much sooner than either O3/GPT5 or Gemini 2.5. In my experience Gemini 2.5 and O3/GPT5 run neck and neck until around 80-100k tokens, then Gemini 2.5 starts to pull ahead and by 150k tokens it's absolutely dominant. Claude is respectable but clearly in third place.

https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o... https://longbench2.github.io/

itsafarqueue · 2025-08-10T15:18:43 1754839123

Really useful comment thanks. Reminder that LLMs aren’t just for coding.

libraryofbabel · 2025-08-08T18:06:43 1754676403

Yeah, agree that the benchmarks don't really seem to reflect the community consensus. I wonder if part of it is the better symbiosis between the agent (Claude Code) and the Opus and Sonnet models it uses, which supposedly are fine-tuned on Claude Code tool calls? But agree, there is probably some additional secret sauce in the training, perhaps to do with RL on multi-step problems...

pcwelder · 2025-08-09T05:45:03 1754718303

I get similar accuracy to claude code using claude desktop app with a file+bash mcp (different tools same performance).

My guess for why GPT5 scores more on benchmarks is that they evaluate on well defined tasks with all instructions given at the start.

Real life is multi turn. Multiple set of prompts to adhere to. This is where Claude is likely better.

paulhodge · 2025-08-05T15:46:35 1754408795

I've had days where it really does feel like 5x or 10x...

Here's what the 5x to 10x flow looks like:

1. Plan out the tasks (maybe with the help of AI)

2. Open a Git worktree, launch Claude Code in the worktree, give it the task, let it work. It gets instructions to push to a Github pull request when it's done. Claude gets to work. It has access to a whole bunch of local tools, test suites, and lots of documentation.

3. While that terminal is running, I go start more tasks. Ideally there are 3 to 5 tasks running at a time.

4. Periodically check on the tabs to make sure they're not stuck or lost their minds.

5. Finally, review the finished pull requests and merge them when they are ready. If they have issues then go back to the related chat and tell it to work on it some more.

With that flow it's reasonable to merge 10 to 20 pull requests every day. I'm sure someone will respond "oh just because there are a lot of pull requests, doesn't mean you are productive!" I don't know how to prove to you that the PRs are productive other than just say that they are each basically equivalent to what one human does in one small PR.

A few notes about the flow:

- For the AI to work independently, it really needs tasks that are easy to medium difficulty. There are definitely 'hard' tasks that need a lot of human attention in order to get done successfully.

- This does take a lot of initial investment in tooling and documentation. Basically every "best practice" or code pattern that you want to use use in the project must be written down. And the tests must be as extensive as possible.

Anyway the linked article talks about the time it takes to review pull requests. I don't think it needs to take that long, because you can automate a lot..

- Code style issues are fully automated by the linter.

- Other checks like unit test coverage can be checked in the PR as well.

- When you have a ton of automated tests that are checked in the PR, that also reduces how much you need to worry about as a code reviewer.

With all those checks in place, I think it can pretty fast to review a PR. As the human you just need to scan for really bad code patterns, and maybe zoom in on highly critical areas, but most of the code can be eyeballed pretty quickly.

samtp · 2025-08-05T20:15:08 1754424908

What type of software are you building with this workflow? Does it handle PII, need data to be exact, or have any security implications?

Because I might just not have a great imagination, but it's very hard for me to see how you basically automate the review process on anything that is business critical or has legal risks.

paulhodge · 2025-08-05T20:40:46 1754426446

Mainly working on a dev tool / SaaS app right now. The PII is user names & email.

On the security layer, I wrote that code mostly by hand, with some 'pair programming' with Claude to get the Oauth handling working.

When I have the agent working on tasks independently, it's usually working on feature-specific business logic in the API and frontend. For that work it has a lot of standard helper functions to read/write data for the current authenticated user. With that scaffolding it's harder (not impossible) for the bot to mess up.

It's definitely a concern though, I've been brainstorming some creative ways to add extra tests and more auditing to look out for security issues. Overall I think the key for extremely fast development is to have an extremely good testing strategy.

samtp · 2025-08-05T20:45:43 1754426743

I appreciate the helpful reply, honestly. One other question - are people currently using the app?

I think where I've become very hesitant is a lot of the programs that I touch has customer data belonging to clients with pretty hard-nosed legal teams. So it's quite difficult for me to imagine not reviewing the production code by hand.

paulhodge · 2025-08-05T21:21:01 1754428861

No this app isn't launched yet. And yeah, customer data is definitely a very valid thing to be concerned about.

paulhodge · 2025-07-30T00:52:25 1753836745

Yeah the outrage is a little artificial and definitely premature.

Some facts for sanity:

1- The poster of this blog article is Kilocode who makes a (worse) competitor to Claude Code. They are definitely capitalizing on this drama as much as they can. I’ve been getting hit by Reddit ads all day from Kilocode, all blasting Anthropic, with the false claim that their plan was "unlimited".

2- No one has any idea yet what the new limits will be, or how much usage it actually takes to be in the top 5% to be affected. The limits go into effect in a few days. We'll see then if all the drama was warranted.

paulhodge · 2025-07-18T03:36:10 1752809770

There’s been a ton of ‘service overloaded’ errors this week so it makes sense that they had to adjust it.

Personally I’ve never hit a usage limit on the $100 plan even when running several Claude tabs at once. I can’t imagine how people can max out the $200 plan.

paulhodge · 2025-07-01T17:57:24 1751392644

Very fun and some of these ideas are.. actually not terrible.

paulhodge · 2025-07-01T17:05:50 1751389550

The AI definitely gets confused when there is a lot of stuff happening. It helps if you try to make the commands as easy as possible. Like, change 'make' so that '-j8' is default, or add scripts like make-check.sh that does 'make -j check', or add an MCP server that has commands for the most common actions (tell the AI to write an MCP server for you).

Hooks would probably help, I think you could add a hook to auto-reject the bot when it calls the wrong thing.

paulhodge · 2025-07-01T17:01:32 1751389292

I'm pretty sure that the $100 / $200 plans are a net profit for Anthropic. Most users (including myself) don't come close to the usage limit.

I've been trying to max out the $100 plan and I have a problem where I run out of stuff to ask it to do. Even when I try to have multiple side projects at once, Claude just gets stuff done, and then sits idle.

paulhodge · 2025-07-01T16:53:44 1751388824

That's limited to simple string matching with `*` so it can't handle anything complex.