Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> there really is no moat.

For ChatGPT and Gemini, yes.

But for Claude, they have a very deep & big one: Its the only model that gets production ready output on the first detailled prompt. Yesterday I used my tokens til noon, so I tried some output from Gemini & Co. I presented a working piece of code which is already in production:

1. It changed without noticing things like "Touple.First.Date.Created" and "Touple.Second.Date.Created" and it rendered the code unworking by chaning to "Touple.FirstDate" and "Touple.SecondDate"

2. There was a const list of 12 definitions for a given context, when telling to rewrite the function it just cut 6 of these 12 definitions, making the code not compiling - I asked why they were cut: "Sorry, I was just too lazy typing" ?? LOL

3. There is a list include holding some items "_allGlobalItems" - it changed the name in the function simply to "_items", code didnt compile

As said, a working version of a similar function was given upfront.

With Claude, I never have such issues.



I have used Claude (incl. Opus 4.6) fairly extensively, and Claude still spits out quality that is far below what I would call production ready - both littered with smaller issues, but also the occasional larger blunder. Particularly when doing anything non-trivial, and even when guiding it in detail (although that admittedly reduces the amount of larger structural issues).

Maybe it is tech stack dependent (I have mostly used it with C#/.NET), but I have heard people say the same for C#. The only conclusion I have been able to draw from this, is that people have very different definitions of production ready, but I would really like to see some concrete evidence where Claude one-shots a larger/complex C# feature or the like (with or without detailed guidance).


> C#/.NET

same here :)

> one-shots a larger/complex C# feature

I can show you a timeseries data-renderer which was created with 1 initial very large prompt and then 3 following "change this and that" prompts. The file is around 5000 lines and everything works fine & exactly as specified.


> The file is around 5000 lines

Yep, this is another case of different standards for "production ready."


Caught, good one! :-))

++1


Feel free to share it, would be very curious - ideally alongside the prompts.


Do you have an email address?


You can use this: hnthrowaway.outboard407@passmail.net


Sent


I see this over and over again. I don't dispute your experience. My experience with ESP32 development has been unreasonably positive. My codebase is sitting around 600k LoC and is the product of several hundred Opus 4.x Plan -> Agent -> Debug loops. I review everything that goes through, but I'm reviewing the business logic and domain gotchas, not dumb crap like what you and so many others describe.

What is so strange to me is that surely there is more C# out there than ESP-IDF code? I don't have a good explanation beyond saying that my codebase is extensively tested and used; I would know very quickly if it suddenly started shitting the bed in the way you explain.


600k lines of code for anything on the ESP32 sounds like the absolute polar opposite of “good”


Tell us you've never built anything significant without telling us?


Tell us you know nothing about embedded without telling us


Okay Mr. account created 22 days ago...

https://news.ycombinator.com/item?id=47213963


Everyone knows internet points make someone more of an expert. Especially on websites that have the most inane political discussions frequently and has tanked in quality to only marginally better than Reddit


Isn't it funny that you're literally the problem that you're describing?


> My experience with ESP32 development has been unreasonably positive. My codebase is sitting around 600k LoC and is the product of several hundred Opus 4.x Plan -> Agent -> Debug loops.

I feel like this is an example of people having different standards of what “good” code is and hence the differing opinions of how good these tools are. I’m not an embedded developer but 600K LOC seems like a lot in that context, doesn’t it? Again I could be way off base here but that sounds like there must be a lot of spaghetti and copy-paste all over the codebase for it to end up that large.


I don't think it's that large. Keep in mind embedded projects take few if any dependencies. The standard library in most languages is far bigger than 600k loc.


I work with ESP32 devices and 600k lines of code is insane.


I'm curious: What does this device do?


It's wild to come back to this after a day away and have the takeaway from my attempt to answer the question with punditry about the size of my codebase from people who don't have any idea what my device does.

Answering this question directly puts me in an awkward spot because I realized last fall that there was absolutely no way that I could talk about what I'm working on in a way that can be associated with my product because there's so much anti-AI activism right now. That sucks, because I'd like to be "loud and proud" but I have a family to feed. I strongly suspect that versions of my story are playing out for hundreds of entrepreneurs right now.

Here's what I can describe: it's an ESP32-P4 based consumer device with about 45 ESP-IDF components that all communicate over an event bus. There's a substantially modified LVGL front-end with a 3D rendering engine and SVG-like 2D animation in front of a driver for a customized variation of the ST7789. There is substantial custom code for both USB host and client functions across various modes of operation. There's custom drivers for several sensors and haptic feedback. There's a very elaborate menu UI system which is also backed by a BBS style terminal configuration system for power users. There's an assignable action system with about 40 actions that all have their own state machines and a lot of mutex locking. There's a very involved and feature-dense trigger scheduling system. There's a very flexible data stream routing matrix. There's a full suite of command line scripts for most functions. There's a self-hosted web app for configuration that also implements a screen share functionality via an HTML canvas object so that I can record videos of what's happening on the device with OBS without having to point a DSLR at it from a gantry.

Honestly, I could go on and on, but all of the people who think that 600kloc is a lot [sight unseen] are following YouTube tutorials and can eat me.

I responded to you because you asked politely. I hope it was an interesting reply.


The more code is out there, the worse is the average in the training dataset. There will be legacy approaches and APIs, poor design choices, popular use cases irrelevant for your context etc that increase the chances of output not matching your expectations. In Java world this is exactly how it works. I need 3-5 iterations with Claude to get things done the way I expect, sometimes jumping straight to manual refactoring and then returning the result to Claude for review and learning. My CLAUDE.md (multiple of them) are growing big with all patterns and anti-patterns identified this way. To overcome this problem model needs specialized training, that I don‘t think the industry knows how to approach (it has to beat the effort put in the education system for humans).


> To overcome this problem model needs specialized training, that I don‘t think the industry knows how to approach

We already have coding tuned models i.e. Codex. We should just have language / technology specific models with a focus on recent / modern usage.

Problem with something like Java is too old -- too many variants. Make a cut off like at least above Java 8 or 17.


> We should just have language / technology specific models with a focus on recent / modern usage.

The “just” part is a big assumption. It is far from easy, given that modern best practices are always underspecified. The effective model for coding must have reasoning signals to be much stronger than coding patterns, and that, I suspect, requires very different architecture.


I also believe this must be true. Try asking Claude to program in Forth, I find the results to be unreasonably good. That's probably because most of the available Forth to train on is high quality.


Interesting - what kind of structural issues have you encountered?

Is these more related to the existing source code or is this a bad pattern thar you would never do regardless of the existing code?


I don't get it though. Why do you expect perfect responses? Humans continually make mistakes, and AI is trained on human data. Yet there seems to be this higher bar of expectation for the latter. Somehow people expect this thing that's been around for a few weeks/months, and cannot learn anything more beyond its training cutoff date, to always do a better job than a human who's been around for 20+ years and is able to learn on their own until death.


I don't expect that - am merely responding to the parent comments claim that Claude consistently one-shots production ready code (which does not at all match my observations).


> Its the only model that gets production ready output on the first detailled prompt. Yesterday I used my tokens til noon, so I tried some output from Gemini & Co. I presented a working piece of code which is already in production:

One does often hear that where LLMs shine is with greenfield code generation but they all start to struggle working with pre-existing code. It could be that this wasn't a like for like comparison.

That said I do personally feel Claude to produce far better results than competitors.


> One does often hear that where LLMs shine is with greenfield code generation but they all start to struggle working with pre-existing code. It could be that this wasn't a like for like comparison.

In my experience working in a large codebase with a good set of standards that's not the case, I can supply examples already existing in the codebase for Claude to use as a guidance and it generates quite decent code.

I think it's because there's already a lot of decent code for it to slurp and derive from, good quality tests at the functional level (so regressions are caught quickly).

I do understand though that on codebases with a hodge podge of styles, varying quality of tests, etc. it probably doesn't work as well as in my experience but I'm quite impressed about how I can do the thinking, add relevant sections of the code to the context (including protocols, APIs, etc.), describe what I need to be done, and get a plan back that most times is correct or very close to correct, which I can then iterate over to fix gaps/mistakes it made, and get it implemented.

Of course, there are still tasks it fails and I don't like doing multiple iterations to correct course, for those I do them manually with the odd usage here and there to refactor bits and pieces.

Overall I believe if your codebase was already healthy you can have LLMs work quite well with pre-existing code.


> One does often hear that where LLMs shine is with greenfield code generation but they all start to struggle working with pre-existing code.

Don't we all?


Whether we do or not it's besides the point. The comparison was between Claude, which produced competent greenfield code, and Gemini which struggled with brownfield. The comparison is stacked in Claude's favour.


I'm better at pre-existing code, if only because empty text files give me writers block.


Nope.


Greenfield implementation is not flawless as well.


The only sources of these “it works flawlessly” I know of are:

- literal Claude ads I see online

- my underperforming coworkers whose code I’ve had to cleanup and know first hand that no, it wasn’t flawless

This kind of sentiment is gaslighting CTOs everywhere though. Very annoying.


"better at" != "flawless"


That's been my experience too. I'm using the recent free trial of OpenAI Plus to vibe code, and from this I would say that if Claude Code is a junior with 1-3 years of experience, OpenAI's Codex is like a student coder.


Does it depend on what type of programming you do? Doing Swift/SwiftUI work, I have exactly the opposite experience. I’ve been using both recently, and I want to use Claude alone (especially after the last week’s events), but Codex is just so much faster and better.


Swift/SwiftUI are two of the three experimental projects I'm using Codex on, the other is a physics simulation in python.

It keeps trying to re-invent the wheel, does a bad job of it.

The physics sim was supposed to be a thin wrapper around existing libraries, but instead of that it tried to write all the simulation code itself as a "fallback" (but it was broken), and never actually installed the real simulators that already did this stuff despite being told to use them in the first place. The last few dozen(!) prompts from me have been pairs of ~["Find all cases where you've re-invented the wheel, add them to the planning document", "now do them"]. And it's still not finished removing the original nonsense, so far as I can tell.

One of the two Swift experiments is just a dice roller, it took about 10 rounds of non-compiling metal shaders (I don't know metal, which is why I didn't give up and do that by hand after 4) before I managed to get that to work, and when it did work it immediately broke it again on the next four rounds. It wrote its own chart instead of using Swift Charts, and did it badly. It tried to put all the hamburger menu options into a UIAlertController. Something blocks the UI for several seconds when you change the dice font. I didn't count how many attempts it took to correctly label the D4.

The other Swift experiment was a musical instrument app, that got me to the prototype stage, eventually, but in a way that still felt like a student's project rather than a junior's project.


> Find all cases where you've re-invented the wheel

Did you put in the original prompt the "wheels" you wanted it to use? It's a toss-up when you aren't very specific about what you want.


For the swift apps, at least half of the errors are of a type where I wouldn't expect to have needed to tell someone to not do it like that, and only a student could reasonably be expected to not know better.

For the python physics sim, step 1 was to generate the plan, the prompt included "I want actual plasma physics, including high-density, high-field regimes, externally applied fields, etc., so consider which FOSS libraries would suit this.", and then it proceeded itself to choose some existing libraries, and I made sure those specific named FOSS libraries actually ended up in the plan.

My first clue this wasn't going to work was that even from step 1 it was pushing for writing all the simulation code and not actually using e.g. WarpX despite that it itself had suggested WarpX. In fact, even when WarpX was in the plan, it was "integrate" rather than "just use this from the get-go".

I may well throw the whole thing out and try again with Claude when this trial expires. Most of the runs have been comically non-physical, to the extent you don't even need a physics degree to notice, or even a physics GCSE.


Definitely give Claude a go. I've been nothing but impressed by the performance thus far. My only gripe is the usage limit that I keep hitting.


Already have done, hence earlier comment: https://news.ycombinator.com/item?id=47204959

Claude made far fewer mistakes in general, never gave me non-compiling code.


(Just outside edit window, I now realise I was ambiguous in this comment, it was more like "Find all cases where you've re-invented the wheel, add their removal to the planning document")


I find it very much matters. I find Gemini better for pretty frontends, Claude opus for planning. Gemini and opus for code reviews. Codex is great when I want the LLM do follow instructions more strictly- good if you already have a detailed design.

Definitely depends on your use.


Do you use GPT-5.3-Codex Extra High or another model?


> But for Claude, they have a very deep & big one: Its the only model that gets production ready output on the first detailled promp

That's not a moat though. Claude itself wasn't there 6 months ago and there's no reason to think Chinese open models won't be at this level in a year at most.

To keep its current position Claude has to keep improving at the same pace as the competitor.


> Its the only model that gets production ready output on the first detailled prompt.

That's, just, like, your opinion, man.


...and of a lot of colleagues in and out of my sector :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: