Hacker Newsnew | past | comments | ask | show | jobs | submit | manmal's commentslogin

A few thoughts on this:

- I wouldn’t outsource my brain to CC when it comes to checking CC’s output. Very mixed results in my experience, and it might discourage further exploration/thinking if you’ve already performed the checklist CC has given you (satisficing).

- Slash commands are the idiomatic way to memoize often used prompts, I wonder why author put them in CLAUDE.md?

- I’m also a bit skeptical that, aside from strict rules CC needs to follow, the encouragements/enchantments for writing good code author put in CLAUDE.md really work. But who knows.

- I DO like the caveats section at the end a lot. This is probably the most important piece of the article, when it comes to large codebases. Never just accept the first draft. Review everything with high suspicion and bend the output to your own style and taste. Otherwise, you‘re pushing legacy code.


I know what you‘re writing is the whole point of vibe coding, but I‘d strongly urge you to not do this. If you don’t review the code an LLM is producing, you‘re taking on technical debt. That’s fine for small projects and scripts, but not for things you want to maintain for longer. Code you don’t understand is essentially legacy code. LLM output should be bent to our style and taste, and ideally look like our own code.

If that helps, call it agentic engineering instead of vibe coding, to switch to a more involved mindset.


Since agents are good only at greenfield projects, the logical conclusion is that existing codebases have to be prepared such that new features are (opinionated) greenfield projects - let all the wiring dangle out of the wall so the intern just has to plug in the appliance. All the rest has to be done by humans, or the intern will rip open the wall to hang a picture.

Hogwash. If you can't figure out how to do something with project Y from npm try checking it out from Github with WebStorm and asking Junie how to do it -- often you get a good answer right away. If not you can ask questions that can help you understand the code base. Don't understand some data structure which is a maze of Map<String, Objects>(s) it will scan how it is used and give you draft documentation.

Sure you can't point it to a Jira ticket and get a PR but you certainly can use it as a pair programmer. I wouldn't say it is much faster than working alone but I end up writing more tests and arguing with it over error handling means I do a better job in the end.


> Sure you can't point it to a Jira ticket and get a PR

You absolutely can. This is exactly what SWE-Bench[0] measures, and I've been amazed at how quickly AIs have been climbing those ladders. I personally have been using Warp [1] a lot recently and in quite a lot of low-medium difficulty cases it can one-shot a decent PR. For most of my work I still find that I need to pair with it to get sufficiently good results (and that's why I still prefer it to something cloud-based like Codex [2], but otherwise it's quite good too), and I expect the situation to flip over the coming couple of years.

[0] https://www.swebench.com/

[1] https://www.warp.dev/

[2] https://openai.com/index/introducing-codex/


How does Warp compare to others you have tried?

I've not used it for long enough yet for this to be a strong opinion, but so far I'd say that it is indeed a bit better than Claude Code, as per the results on Terminal Bench[0]. And on a side note, I quite like the fact that I can type shell commands and chat commands interchangeably into the same input and it just knows whether to run it or respond to it (accidentally forgetting the leading exclamation mark has been a recurring mistake for me in Claude Code).

[0] https://www.tbench.ai/


What you describe is not using agents at all, which my comment was aimed at if you read the first sentence again.

Julie is marketed as an “agent” and it definitely works harder than the Jetbrains AI assistant.

They’re not. They’re good at many things and bad at many things. The more I use them the more I’m confused about which is which.

They are called slot machines for a reason.

I think agents have a curve where they're kinda bad at bootstrapping a project, very good if used in a small-to-medium-sized existing project and then it goes downhill from there as size increases, slowly.

Something about a brand-new project often makes LLMs drop to "example grade" code, the kind you'd never put in production. (An example: claude implemented per-task file logging in my prototype project by pushing to an array of log lines, serializing the entire thing to JSON and rewriting the entire file, for every logged event)


Better to tell them exactly how this and that is done, with some examples.

The main gripe seems to be duplication of semantics across the codebase, and loss of centralized configurability. Makes sense, since LLMs can’t fit a whole codebase into their context and are not aware of shared behavior unless you tell them it exists.

Sleep is all you need, then?

What kind of bugs do you find this way, besides missing sanitization?

Pointer errors. Null pointer returns instead of using the correct types. Flow/state problems. Multithreading problems. I/O errors. Network errors. Parsing bugs... etc

Basically the whole world of bugs introduced by someone being a too smart C/C++ coder. You can battletest parsers quite nicely with fuzzers, because parsers often have multiple states that assume naive input data structures.


You can use the fuzzer to generate test cases instead of writing test cases manually.

For example you can make it generate queries and data for a database and generate a list of operations and timings for the operations.

Then you can mix assertions into the test so you make sure everything is going as expected.

This is very useful because there can be many combinations of inputs and timings etc. and it tests basically everything for you without you needing to write a million unit tests


That sounds worse than letting an LLM dream up tests tbh. I wouldn’t consider grooming a huge number of tests for their usefulness after they‘ve been generated randomly. And just keeping all of them will just lock the implementation in place where it currently is, not validate its correctness.

You can often find memory errors not directly related to string handling with fuzz testing. More generally, if your program embodies any kind of state machine, you may find that a good fuzzer drives it into states that you did not think should exist.

That sounds a bit like using a jackhammer to drive in a nail. Wouldn’t it be smarter to enumerate edge cases and test all permutations of those?

Would it even be possible to enumerate all edge cases and test all the permutations of them in non-trivial codebases or interconnected systems? How do you know when you have all of the edge cases?

With fuzzing you can randomly generate bad input that passes all of your test cases that were written using by whatever method you have already been using but still causes the application to crash or behave badly. This may mean that there are more tests that you could write that would catch the issue related to the fuzz case, or the fuzz case itself could be used as a test.

Using probability you can get to 90 or 99% or 99.999% or whatever confidence level you need that the software is unaffected by bugs based on the input size / number of fuzz test cases. In many non-critical situations the goal may not be 100% but 'statistically very unlikely with a known probability and error'


Thanks for elaborating, I might start fuzzing.

> Tests are the source of truth more so than your code

Tests poke and prod with a stick at the SUT, and the SUT's behaviour is observed. The truth lives in the code, the documentation, and, unfortunately, in the heads of the dev team. I think this distinction is quite important, because this question:

> Do we have a bug? Or do we have a bad test?

cannot be answered by looking at the test + the implementation. The spec or people have to be consulted when in doubt.


> The spec

The tests are your spec. They exist precisely to document what the program is supposed to do for other humans, with the secondary benefit of also telling a machine what the program is supposed to do, allowing implementations to automatically validate themselves against the spec. If you find yourself writing specs and tests as independent things, that's how you end up with bad, brittle tests that make development a nightmare — or you simply like pointless busywork, I suppose.

But, yes, you may still have to consult a human if there is reason to believe the spec isn't accurate.


Unfortunately, tests can never be a complete specification unless the system is simple enough to have a finite set of possible inputs.

For all real-world software, a test suite tests a number of points in the space of possible inputs and we hope that those points generalize to pinning down the overall behavior of the implementation.

But there's no guarantee of that generalization. An implementation that fails a test is guaranteed to not implement the spec, but an implementation that passes all of the tests is not guaranteed to implement it.


> Unfortunately, tests can never be a complete specification

They are for the human, which is the intended recipient.

Given infinite time the machine would also be able to validate against the complete specification, but, of course, we normally cut things short because we want to release the software in a reasonable amount of time. But, as before, that this ability exists at all is merely a secondary benefit.


  > The tests are your spec.
That's not quite right, but it's almost right.

Tests are an *approximation* of your spec.

Tests are a description, and like all descriptions are noisy. The thing is it is very very difficult to know if your tests have complete coverage. It's very hard to know if your description is correct.

How often do you figure out something you didn't realize previously? How often do you not realize something and it's instead pointed out by your peers? How often do you realize something after your peers say something that sparks an idea?

Do you think that those events are over? No more things to be found? I know I'm not that smart because if I was I would have gotten it all right from the get go.

There are, of course, formal proofs but even they aren't invulnerable to these issues. And these aren't commonly used in practice and at that point we're back to programming/math, so I'm not sure we should go down that route.


> Tests are a description

As is a spec. "Description" is literally found in the dictionary definition. Which stands to reason as tests are merely a way to write a spec. They are the same thing.

> The thing is it is very very difficult to know if your tests have complete coverage.

There is no way to avoid that, though. Like you point out, not even formal proofs, the closest speccing methodology we know of to try and avoid this, is immune.

> Tests are an approximation of your spec.

Specs are an approximation of what you actually want, sure, but that does not change that tests are the spec. There are other ways to write a spec, of course, but if you went down that road you wouldn't also have tests. That would be not only pointless, but a nightmare due to not having a single source of truth which causes all kinds of social (and sometimes technical) problems.


  > that does not change that tests are the spec.
I disagree. It's, like you say, one description of your spec but that's not the spec.

  > not having a single source of truth
Well that's the thing, there is no single source of truth. A single source of truth is for religion, not code.

The point of saying this is to ensure you don't fall prey to fooling yourself. You're the easiest person for you to fool, after all. You should always carry some doubt. Not so much it is debilitating, but enough to keep you from being too arrogant. You need to constantly check that your documentation is aligned to your specs and that your specs are aligned to your goals. If you cannot see how these are different things then it's impossible to check your alignment and you've fooled yourself.


> You need to constantly check that your documentation is aligned to your specs

Documentation, tests, and specs are all ultimately different words for the same thing.

You do have to check that your implementation and documentation/spec/tests are aligned, which can be a lot of work if you do so by hand, but that's why we invented automatic methods. Formal verification is theoretically best (that we know of) at this, but a huge pain in the ass for humans to write, so that is why virtually everyone has adopted tests instead. It is a reasonable tradeoff between comfort in writing documentation while still providing sufficient automatic guarantees that the documentation is true.

> If you cannot see how these are different things

If you see them as different things, you are either pointlessly repeating yourself over and over or inventing information that is, at best, worthless (but often actively harmful).


  > different words for the same thing
You're still misunderstanding and missing the layer of abstraction, which is what I'm (and others are) talking about

We have 3 objects: doc, test, spec. How do you prove they are the same thing?

You are arguing that they all point to the same address.

I'm arguing they all have the same parent.

I think it's pretty trivial to show that they aren't identical, so I'll give two examples (I'm sure you can figure out a few more trivial ones):

  1) the documentation is old and/or incorrect, therefore isn't aligned with tests. Neither address nor value are equivalent here.
  2) docs are written in natural language, tests are written in programming languages. I wouldn't say that the string "two" (or even "2") is identical to the integer 2 (nor the float 2). Duck typing may make them *appear* the same and they may *reference* the same abstraction (or even object!), but that is a very different thing than *being* the same. We could even use the classic Python mistake of confusing "is" with "==" (though that's a subset of the issue here).
Yes, you should simplify things as much as possible but be careful to not simplify further

> We have 3 objects: doc, test, spec. How do you prove they are the same thing?

You... don't? There is nothing good that can come from trying to understand crazy. Best to run away as fast as possible if you ever encounter this.

> You are arguing that they all point to the same address.

Oh? I did say if you document something the same way three different times (even if you give each time a different name, as if that somehow makes a difference), you are going to pointlessly end up with the same thing. I am not sure that necessarily equates to "the same address". In fact,

> I'm arguing they all have the same parent.

I also said that if they don't end up being equivalent documentation then you will only find difference in information that isn't useful. And that often that information becomes detrimental (see some of the adjacent comments that go into that problem). This is "having the same parent".

In reality, I "argued" both. You'd have better luck if you read the comments before replying.

> you should simplify things as much as possible but be careful to not simplify further

Exactly. Writing tests, documentation, or specs (whatever you want to call it; it all caries the same intent) in natural language certainly feels simpler in the moment, but you'll pay the price later. In reality, you at very least need a tool that supports automatic verification. That could mean formal verification, but, as before it's a beast that is tough to wrangle. More realistically, tests are going to be the best choice amid all the tradeoffs. Industry (including the Haskell fanbois, even) have settled on it for good reason.

> docs are written in natural language, tests are written in programming languages.

Technically "docs" is a concept of less specificity. Documentation can be written in natural language, that is true, but it can also be written in code (like what we call tests), or even pictures or video. "Tests" carries more specificity, being a particular way to write documentation — but ultimately they are the same thing. Same goes for "spec". It describes a different level of specificity (less specific than "tests", but more specific than "docs"), but not something entirely different. It is all documentation.


  > In reality, I "argued" both. 
I mean it is hard to have this conversation because you will say that they are the same thing and then leverage the fact that they aren't while disagreeing with me but using nearly identical settings to my examples.

I mean if your argument is that a mallard (test) and a muscovy (docs) are both types of ducks but a mallard is not a muscovy and a muscovy is not a mallard, then I fail to see how we aren't on the same page. I can't put it any clearer than this: all mallards are ducks but not all ducks are mallards. In other words, a mallard is a duck, but it is not representative of all ducks. You can't look at a mallard and know everything there is to know about ducks. You'll be missing stuff. If you treat your mallard and duck as isomorphic you're going to land yourself into trouble, even if most (domesticated) ducks are mallards.

It isn't that complex and saying "don't be overly confident" isn't adding crazy amounts of complexity that is going to overwhelm yourself. It's simply a recognition that you can't write a perfect spec.

Look, munificent[0] is saying the same thing. So is Kinrany[1], and manmal[2]. Do you think we're all wrong? In exactly the same way?

Besides, this whole argument is literally a demonstration of our claim. If you could write a perfect spec you'd (and we'd) be communicating perfectly and there'd be no hangup. But if that were possible we wouldn't need to write code in programming languages in the first place![3]

[0] https://news.ycombinator.com/item?id=44713138

[1] https://news.ycombinator.com/item?id=44713314

[2] https://news.ycombinator.com/item?id=44712266

[3] https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667...


> I mean if your argument is that a mallard (test) and a muscovy (docs) are both types of ducks

To draw a more reasonable analogy with how the words are actually used on a normal basis, you'd have fowl (docs), ducks (specs), and mallards (tests). As before, the terms change in specificity, but do not refer to something else entirely. Pointing at a mallard and calling it a duck, or fowl, doesn't alter what it is. It is all the very same animal.

Yes, fowl could also refer to chickens just as documentation could refer to tax returns. 'Tis the nature of using a word lacking specificity. But from context one should be able to understand that we're not talking about tax returns here.

But I don't have an "argument". High school debate team is over there.

> It's simply a recognition that you can't write a perfect spec.

That was recognized from the onset. What is the purpose of adding this again?

> Do you think we're all wrong?

We're all bad at communicating, if that's what you are straining to ask. Which isn't exactly much of a revelation. We've both already indicated as such, as have many commenters that came before us.


None of the four: code, tests, spec, people's memory, are the single source of truth.

It's easy to see them as four cache layers, but empirically it's almost never the case that the correct thing to do when they disagree is to blindly purge and recreate levels that are farther from the "truth" (even ignoring the cost of doing that).

Instead, it's always an ad-hoc reasoning exercise in looking at all four of them, deciding what the correct answer is, and updating some or all of them.


What does SUT stand for? I'm not familiar with the acronym

Is it "System Under Test"? (That's Claude.ai's guess)


That's what Wiktionary says too. Lucky guess, Claude.

It is.

It’s definitely geriatric.

Geriatric spice was the worst spice girl.

ChatGPT lists clickable sources in a lot of nontrivial queries. Those sites don’t even need to pay OpenAI for the traffic (yet). If you ask „what’s happening in the world today“, you might get 20 links. How is this worse, exactly?

How many people click the links? What happens to LLMs if people don’t provide training data anymore because nobody visits their sites?

Cloudflare publishes a "crawl-to-refer" ratio, which can be used to estimate the traffic from LLMs:

https://radar.cloudflare.com/ai-insights#crawl-to-refer-rati...


They will either pay for it to be generated or get good enough at producing synthetic data that actually improves LLM quality.

So either even higher costs and hope that a bug problem of LLMs get solved somehow.

Given how much data they need that will be pretty expensive, I mean really really expensive. How many people can write good training data and how much per day?

Doesn’t sound sustainable.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: