I asked Claude Code to remove jQuery. It failed miserably

simonw · 2026-02-13T14:12:52 1770991972

How did you have it testing its code changes? Did you tell it to use Playwright or agent-browser or anything like that?

If coding agents can't test the code as they're editing it they're no different from pasting your entire codebase into ChatGPT and crossing your fingers.

At one point you mention it hadn't run "npm test" - did it run that once you directly told it to?

I start every one of my coding agent sessions with "run uv run pytest" purely to confirm that it can run the tests and seed the idea with it that tests exist and matter to me.

Your post ends with a screenshot showing you debating a C# syntax thing with the bot. I recommend telling it "write code that demonstrates if this works or not" in cases like that.

aurareturn · 2026-02-13T14:19:17 1770992357

  If coding agents can't test the code as they're editing it they're no different from pasting your entire codebase into ChatGPT and crossing your fingers.

Out of curiosity, how do you get Claude Code or Codex to actually do this? I asked this question here before:

https://news.ycombinator.com/item?id=46792066

simonw · 2026-02-13T14:34:19 1770993259

I don't use CLAUDE.md, I instead use simple token-efficient conventions.

Most importantly all of my Python projects use a pyproject.toml file with this pattern:

  [dependency-groups]
  dev = ["pytest"]

Which means I can tell the agent:

  Run "uv run pytest"

And it will run the tests - without first needing to setup a virtual environment or install dependencies or anything like that. I wrote more about that pattern here: https://til.simonwillison.net/uv/dependency-groups

For more complex test suites I'll give it more detailed instructions.

For testing web apps I used to tell it "use playwright" or "use playwright Python".

I'm currently experimenting with my own simple CLI browser automation tool. This means I can tell it:

  Run "uvx rodney --help" and then use 
  rodney to test this change

The --help output tells it everything it needs to use the tool - here's that document in the repo: https://github.com/simonw/rodney/blob/10b2a6c81f9f3fb36ce4d1...

I've recently started having the bots "manually" test changes with a new tool I built called Showboat. It's less than a week old but it's so far been working really well: https://simonwillison.net/2026/Feb/10/showboat-and-rodney/

maleldil · 2026-02-14T00:19:33 1771028373

If you don't use CLAUDE.md, do you tell the agent to run pytest every single session?

SJMG · 2026-02-13T14:28:50 1770992930

Instruct it to test as it goes along. Add whatever testing base command to your list of trusted tools.

littlecranky67 · 2026-02-13T14:34:55 1770993295

Not surprised. The amount of jQuery pasta code from the 2010s the models are trained on make it probably look like all jQuery-specific stuff is plain JavaScript. Plus in my experience (and lucky for me as a mostly FE dev) AIs suck at all things frontend (relative to other scenarios). They just never got trained on the real, rendered output in the browser so they can't "see" and complete the feedback loop during training. Most tests in Javascript projects genereate <div>-soup - so the AI gets trained on that output as a feedback, vs. the actual browser rendered image.

re-thc · 2026-02-13T14:00:29 1770991229

You don't remove jQuery. EVER. You'll lose all the $.

SAI_Peregrinus · 2026-02-13T16:38:48 1771000728

You can use any POSIX shell to get lots of $ back in your code.

re-thc · 2026-02-13T16:58:09 1771001889

That's not Webscale.

lenerdenator · 2026-02-13T14:23:58 1770992638

> Why AI is so bad at vanilla JS and HTML, when there's no React/Vue in a project?

Because we're still paying for Brendan Eich's mistakes 30 years later (though Brendan isn't, apparently), and even an LLM trained on an unfathomably-large corpus of code by experts at hundreds of millions of dollars of expense can't unscrew it. What, like, even is a language's standard library, man?

> The moment you point it at a real, existing codebase - even a small one - everything falls apart

That's not been my experience with running Claude to create production code. Plan mode is absolutely your friend, as is tuning your memory files and prompts. You'll need to do code reviews as before, and when it makes changes that you don't like (like patching in unit tests), you need to correct it.

Also, we use hexagonal architecture, so there are clean patterns for it to gather context from. FWIW, I work in Python, not JS, so when Claude was trained on it, there weren't twenty wildly different flavor-of-the-week-fifteen-years-ago frameworks and libraries to confuse it.

If JS sucks to write as a human, it will suck even more to write as a LLM.

coldcode · 2026-02-13T14:08:57 1770991737

For any AI post, there seems like that one person for whom it worked great, and a whole lot where it didn't. Your mileage may vary...

Some things AI does well, many things it may be not worth the effort entailed, and some where it downright sucks and may even be harmful. The question is will it ever change the curve to where it is useful most of the time?

mingus88 · 2026-02-13T15:22:23 1770996143

Like any tool, you get better at using it. YMMV indeed.

The author of this article could probably have, for example, written most of this into the project’s Claude.md and the AI would learn what not to do.

Instead they wrote it up as a blog post which is unsurprisingly not going to net quality software.

Having some way for Claude to test what it wrote is critical as well. It will learn on its own very fast if it can see the error messages and iterate on it like any other developer would

Sounds like the author had tests that Claude never ran. Sounds misconfigured to me. Again, did the author learn how to use the tool?

rado · 2026-02-13T14:07:29 1770991649

Refactoring jQuery to vanilla JS was one of my first AI dev experiences a couple of years ago and it was great.

aadarshkumaredu · 2026-02-13T19:40:38 1771011638

Removing jQuery isn’t a mechanical find-and-replace task.

jQuery is often deeply intertwined with: • Event delegation patterns • Implicit DOM readiness assumptions • Legacy plugin ecosystems • Cross-browser workarounds

An LLM can translate $(selector).addClass() to element.classList.add(). But it struggles when behavior depends on subtle timing, plugin side effects, or undocumented coupling.

The hard part isn’t syntax replacement. It’s preserving behavioral invariants across the app.

AI is decent at scaffolding migrations, but for legacy front-end codebases, you still need test coverage and incremental refactors. Otherwise it’s easy to “remove jQuery” and silently break UX flows.

kittikitti · 2026-02-13T14:19:09 1770992349

Removing jQuery is a great task and one I hope to implement in some of my JavaScript code bases. Thank you for this post. I don't know exactly why but I've found these agents to be less useful when it's counterintuitive from popular coding methods. Although there are many reasons why replacing jQuery is a great idea, coding agents may fail on this because so much of their training data requires jQuery. For example, many top comments on StackOverflow utilize jQuery, perhaps to address the same logic you are trying to replace.

Arubis · 2026-02-13T14:12:17 1770991937

That sounds like a realistic outcome for a real engineer, too.

simonw · 2026-02-13T14:42:47 1770993767

Were you using --dangerously-skip-permissions or were you approving every edit and every tool use?

Which tools did it use?

tommy_axle · 2026-02-13T15:01:54 1770994914

If doing it directly fails (not surprising) wouldn't the next thing (maybe the first thing) to do was to have AI write a codemod to do what needed to be done then apply the codemod? Then all you need to do is get the codemod right and apply it to as many files as you need. Seems much more predictable and context-efficient.

simonw · 2026-02-13T15:04:03 1770995043

This should work really well, but you still need to first ensure the agent is able to test the code (both through automated tests and "manually" poking at it) so it can verify the changes made actually work.

lenerdenator · 2026-02-13T14:15:28 1770992128

jQuery simply turned the tables and executed a `$( ".Claude_Code" ).remove();`. Now Anthropic's services are down across several regions and emergency meetings are being held with stakeholders.

jQuery: It's Going Absolutely Nowhere™

dana321 · 2026-02-13T14:15:03 1770992103

Its a slot machine, you need to revert the changes and try again!

cbg0 · 2026-02-13T14:16:38 1770992198

Seeing some of the pictures where OP says "MOTHERFUCKER" in the prompts and how simplistic some of the questions provided are gives me a feeling that CC is being used incorrectly.

My experience with 4.6 has been that it gobbles up tokens like crazy but it's pretty smart otherwise. Even the latest LLMs need a lot of context to know what they're working on, which versions to target, access to some MCP like Context7 to get up to date documentations(especially for js/ts).

My non-tech friends have a tendency to talk to AI like a person and then complain about the quality of the answers and I always tell them: ask your question, with one or two follow-ups max then start a new conversation. Also, provide as much relevant context as possible to get the best answer, even if it seems obvious. I'd expect a SWE to already be aware of this stuff.

I've been able to find obscure edge cases thanks to Claude and I've also had it produce code that does the opposite of what I asked even with a clear prompt, but that's the nature of LLMs.

padjo · 2026-02-13T14:06:51 1770991611

This sounds like something I would have done with sed

gitaarik · 2026-02-14T06:56:08 1771052168

This sounds like something my AI agent would say

andai · 2026-02-13T22:20:33 1771021233

>Not exactly rewriting a fucking C compiler in Rust from scratch or whatever they claimed it did.

Proof that web dev is harder than compiler dev ;)

For real though, I do think web has a higher cognitive load than other types of programming. I always thought it was weird, but the stuff people said was hard (making an online multiplayer game) turned out to be way easier than the stuff people said was easy (I still have no idea how React works after learning it 9 times).

Also I think compilers are surprisingly straightforward. At least unoptimized ones. It's about translation, which is basically a functional thing. Whereas frontend web dev is all about infinite global implementation details screwing each other over in real time with race conditions.

Anon1096 · 2026-02-13T14:18:25 1770992305

> Also, why not run "npm run test" at some point? We have tons of tests. I even have an integration test that crawls the entire fucking app recusrively link-by-link in a headless browser and reports on JS errors. CLAUDE.md has all the info.

I'm a little baffled by this post. The author claims to have "Wrote a comprehensive CLAUDE.md with detailed instructions." and yet didn't have "run the tests" anywhere? I realize this post is going to be a playground for bashing on AI but I just wish the prompt was published or even better, if it's open source let other people try. Seems like the perfect case to throw claude code in a wiggum loop at overnight.

mingus88 · 2026-02-13T15:29:29 1770996569

Exactly, if Claude is making these types of mistakes, write a better claude.md instead of a blog post.

My company uses an obscure DSL with a name shared with a popular OSS project. Claude was worthless because it kept suggesting code in that other language.

Well, we wrote an MCP so Claude could run and test its code and reference the language docs. It’s amazing now. It makes mistakes like this post and then just fixes it and tests again.

pllbnk · 2026-02-13T23:11:07 1771024267

Your quoted excerpt implies that CLAUDE.md had this information. Having used Claude Code more than enough I have faced so many issues like the blog's author that I could have written a very similar post (I am not an FE dev though).

suddenlybananas · 2026-02-13T16:58:11 1771001891

It's super intelligent but it can't be bothered to run tests unless specifically told to?

simonw · 2026-02-13T17:42:33 1771004553

Personally I prefer my agents not to run random commands on my machine without me telling them to first.

Imagine you just cloned some random project from GitHub and fired up Claude Code in that folder, but it turned out to be malicious and running 'npm test' stole all your files.

suddenlybananas · 2026-02-14T08:56:23 1771059383

If it's super intelligent, surely it could glance at tests before running them and figure whether it was malicious or not.

simonw · 2026-02-14T13:44:44 1771076684

Tests have dependencies. Crawling all of those dependencies to check for malicious code could require inspecting millions of lines of code, if you could even obtain the code.

It's also beginning to sound like needing to solve the halting problem.

suddenlybananas · 2026-02-14T17:11:26 1771089086

Come on man. You're being unserious here.

simonw · 2026-02-14T17:32:03 1771090323

I'm really not. You're the one arguing about a "super intelligent" strawman.

suddenlybananas · 2026-02-15T13:18:47 1771161527

Look, I know you have a lot invested in this project but I don't see why you think it is somehow unreasonable to expect an AI agent to run tests in a repository. You don't need super intelligence for that.

simonw · 2026-02-15T15:15:38 1771168538

Of course I went agents to run tests in a repository - I do that all the time.

I don't want the agent to run tests in a new repository until I've given it the go-ahead to do that.

josefritzishere · 2026-02-13T14:23:23 1770992603

suprise factor zero.

q3k · 2026-02-13T13:53:34 1770990814

You're holding it wrong. I just spent 14 hours (high on coke) working with Claude to generate an agent orchestration framework that has already increased my output to 20x over just using Copilot. Adapt or you'll be left behind and forever part of the permanent underclass.

chasd00 · 2026-02-13T14:11:55 1770991915

That’s nothing I used a Claude code to put together a totally new agent harness model architecture that can cook 30min brownies in only 20mjnutes!

bogzz · 2026-02-13T14:17:52 1770992272

CDDOL is undoubtedly the future, it is just sad seeing all these negative comments. It's like those people don't even know they've been made redundant already.

It's not too late to jump on the Cocaine-Driven Development Orchestrated by LLMs train.

nananana9 · 2026-02-13T14:07:31 1770991651

Tomorrow you'll write 20 agent orchestration frameworks in 14 hours!

q3k · 2026-02-13T14:12:03 1770991923

Amen! I'm pissing blood faster than I can increase my credit card limit for token use, but we'll make it. The 200x (10x from LLM + 20x from orchestration) means that by the end of 2026 we'll all be building $1MM ARR side projects daily.

esseph · 2026-02-13T14:51:05 1770994265

I would love to subscribe to your newsletter to hear more about this topic.

neya · 2026-02-13T14:20:18 1770992418

I built a windmill with Claude. I created a skills.md and followed everything by the book. But now, I have to supply power to keep the windmill running. What am I doing wrong?

ladyprestor · 2026-02-13T14:29:37 1770992977

You didn't mention the $1M ARR!

xcubic · 2026-02-13T14:07:31 1770991651

Can you share details about this? Do you have a repo?

gherkinnn · 2026-02-13T14:09:46 1770991786

Doesn't coke come with mania?

Either way, OP is holding it wrong and vague hypebro comments like yours don't help either. Be specific.

Here's an example: I told Claude 4.5 Opus to go through our DB migration files and the ORM model definitions and point out any DB indexes we might be missing based on how the data is being accessed. It did so, ingested all the controllers as well and a short while later presented me with a list of missing indexes, ordered by importance and listing why each index would speed up reads and how to test the gains.

Now, I have no way of knowing how exhaustive the analysis was, but the suggestions it gave were helpful, Claude did not recommend over-indexing, and considered read vs write performance.

The equivalent work would have taken me a day, Claude gave me something helpful in a matter of minutes.

Now, I for one could not handle the information stream of 20 such analyses coming in. I can't even handle 2 large feature PRs in parallel. This is where I ask for more specifics.

dmbche · 2026-02-13T14:13:06 1770991986

Parent comment seems sarcastic

morkalork · 2026-02-13T14:48:02 1770994082

I believe it's in reference to things like this:

https://steve-yegge.medium.com/gas-town-emergency-user-manua...

weakfish · 2026-02-13T14:15:28 1770992128

Parent comment is a joke I think, but there’s something ironic (Poe’s law?) about it being possibly _not_ a joke

beepbooptheory · 2026-02-13T14:18:18 1770992298

Why go through all migration files if you're looking for missing indices in the present? That doesn't seem to make sense when you could just look at the schema as it stands? Either way, why would this take you a day? How many tables do you have?

SJMG · 2026-02-13T14:26:51 1770992811

There's a parenthetical offset about being high on coke for 14 hours. It's obviously a joke.

bogzz · 2026-02-13T14:20:35 1770992435

Sniped.

netdevphoenix · 2026-02-13T14:03:05 1770991385

For the oblivious: /s

snarf21 · 2026-02-13T14:18:58 1770992338

This one is a lot harder to tell because there are some AI bros who claim similar things but are completely serious. Even look at Show HN now: There used to be ~20-40 posts per day but now there are 20 per HOUR.

(Please oh please can we have a Show HN AI. I'm not interested in people's weekend vibe coded app to replace X popular tool. I want to check out cool projects wher people invested their passion and time.)

defraudbah · 2026-02-13T14:02:53 1770991373

that's a pretty long time to be on someones cok

bdangubic · 2026-02-13T14:04:26 1770991466

the time you are on coke = the time there is coke around to be had :)

re-thc · 2026-02-13T14:12:55 1770991975

It’s Claude Coke

Insanity · 2026-02-13T14:22:55 1770992575

Well, it’ll definitely make you hallucinate!

aurareturn · 2026-02-13T13:58:32 1770991112

  The moment you point it at a real, existing codebase - even a small one - everything falls apart.

Not my experience. It excels in existing codebases too.

I often ask it "I have this bug. Why?" And it almost always figures it out and fixes it. Huge code base.

Codex user, not Claude Code.

netdevphoenix · 2026-02-13T14:06:54 1770991614

> Not my experience. It excels in existing codebases too.

Why don't you prove it?

1. Find an old large codebase in codeberg (avoiding the octopus for obvious reasons)

2. Video stream the session and make the LLM convo public

3. Ask your LLM to remove jQuery from the db and submit regular commits to a public remote branch

Then we will be able to judge if the evidence stands

aurareturn · 2026-02-13T14:09:15 1770991755

I don't have to prove it. I do it every single day at work in a real production codebase that my business relies on.

And I don't remove jQuery every day. Maybe the OP is right that Opus 4.6 sucks at removing jQuery. I don't know. I've never asked an AI to do it.

    The moment you point it at a real, existing codebase - even a small one - everything falls apart.

This statement is absolutely not true based on my experience. Codex has been amazing for me at existing code bases.

netdevphoenix · 2026-02-13T14:12:11 1770991931

Extraordinary claims require extraordinary evidence. "Works on my machine" ain't it.

aurareturn · 2026-02-13T14:15:23 1770992123

Is it an extraordinary claim that Opus 4.6 or GPT 5.3 works amazing on existing code bases in my experience?

That's funny. I feel like it's the opposite. Claiming that Opus 4.6 or GPT 5.3 fails as soon as you point them to an existing code base, big or small, is a much more extraordinary claim.

simonw · 2026-02-13T14:14:02 1770992042

What are the obvious reasons?

netdevphoenix · 2026-02-16T13:21:22 1771248082

I thought it would be obvious: OpenAI has used repos on GitHub as training data. Would be like testing someone using a past paper publicly available.

Are you planning on carrying out the experiment? Regardless of the outcome, it would be of value to developers.

simonw · 2026-02-16T15:53:57 1771257237

Why wouldn't they train on Codeberg too?

It's pretty hard to block automated uses of "git clone".

netdevphoenix · 2026-02-17T10:19:37 1771323577

Why would they? Github has 28 million public repos, Codeberg only hit 300k last year. Anyway, Codeberg was just a placeholder for 'repo source _less_ likely to be in their training data'. Codeberg was quick candidate for a place to find a big old codebase with non-sensitive data.

It is indeed hard but the guys at Codeberg are certainly an order of magnitude better than Github as they opted out of the main AI crawlers, regularly block IPs known to belong to AI startups and they allow you to make your repos only be accessible to logged in users.

You seem be going on a tangent, here. Main point was about performing a well documented test anyway.

simonw · 2026-02-17T12:33:14 1771331594

My question about the "obvious" thing was genuine - it wasn't obvious to me.

bsaul · 2026-02-13T14:02:20 1770991340

Not my experience too and i'm on claude code. I'd be really curious to see what when wrong in OP case. Maybe too much indication ? Could it be that it used a fast model instead of the deep ones ?

aurareturn · 2026-02-13T14:06:25 1770991585

No, OP said he used the Max Opus 4.6.

Anyways, I think one area where Codex and Claude Code falls short is that they do not test the changes they made by using the app.

In this case, the LLM should ideally render the page in a real browser, and actually click on the buttons to verify. Best if the LLM test it before the changes, and then after so that it is the same. Maybe it should take a screenshot of before the change, then take a screenshot after. And match.

I asked why Codex and Claude don't do this here: https://news.ycombinator.com/item?id=46792066

threetonesun · 2026-02-13T14:12:28 1770991948

Yeah, if you have these tools in place to validate it's changes you can quickly iterate with it to the right results. But think through how it's making UI changes and it becomes obvious quickly why it can make absolutely wrong and terrible guesses about the implementation details, it can't _see_ what it's doing, or interact with it, it's just pattern matching other implementations its seen.

aurareturn · 2026-02-13T14:17:35 1770992255

Yea, the next breakthrough for Codex or Claude Code would be to actually use/test the app like a real human would during the development process.

simonw · 2026-02-13T14:41:01 1770993661

Here's a document produced by Claude Code using my Showboat testing tool this morning to help explore SeaweedFS (a local S3 clone) - it includes trying things out with curl and getting screenshots from Chrome using my Rodney tool: https://github.com/simonw/research/blob/main/seaweedfs-testi...

mwigdahl · 2026-02-13T14:24:15 1770992655

You can easily do this, at least with Claude Code. Ask it to install and use Playwright to confirm rendering and flow. You're correct that it is a failing to not do this. When you do, it definitely helps cut down on bugs.

EDIT: Sorry, just noticed you said "real browser". Haven't tried this but Playwright gets you a long way down the road.

aurareturn · 2026-02-13T14:26:58 1770992818

Will check it out. Looks like there is also chrome-devtools-mcp for Codex.

lenerdenator · 2026-02-13T14:26:11 1770992771

FWIW, I've found Playwright tests to be a decent way of getting Claude to do what you're talking about.

throwup238 · 2026-02-13T14:17:38 1770992258

See the /chrome command in Claude code.

n4r9 · 2026-02-13T14:06:03 1770991563

They say explicitly what model they're using.

uludag · 2026-02-13T14:07:57 1770991677

There could be a whole spectrum of types of repositories where these tools exceed and fail. I can immagine a large repository, poorly documented, with confusing inconsistent usages/patterns, in a dynamic language, with poor tests will almost always lead to failure.

I honestly think that size and age alone are sufficient to lead these tools into failure cases.

aurareturn · 2026-02-13T14:10:56 1770991856

It could be. I mainly use LLMs with Typescript and Go, both typed languages.

netdevphoenix · 2026-02-13T14:09:51 1770991791

> I often ask it "I have this bug. Why?" And it almost always figures it out and fixes it. Huge code base.

Is your AI PR publicly available in github?

aurareturn · 2026-02-13T14:11:31 1770991891

No. I don't do any open source work. I work for a private company.

whiplash451 · 2026-02-13T14:19:55 1770992395

These two things are not mutually exclusive.