Hacker News new | past | comments | ask | show | jobs | submit | mrdependable's comments login

I always see these reports about how much better AI is than humans now, but I can't even get it to help me with pretty mundane problem solving. Yesterday I gave Claude a file with a few hundred lines of code, what the input should be, and told it where the problem was. I tried until I ran out of credits and it still could not work backwards to tell me where things were going wrong. In the end I just did it myself and it turned out to be a pretty obvious problem.

The strange part with these LLMs is that they get weirdly hung up on things. I try to direct them away from a certain type of output and somehow they keep going back to it. It's like the same problem I have with Google where if I try to modify my search to be more specific, it just ignores what it doesn't like about my query and gives me the same output.


It's overfitting.

Some people say they find LLMs very helpful for coding, some people say they are incredibly bad.

I often see people wondering if the some coding task is performed well or not because of availability of code examples in the training data. It's way worse than that. It's overfitting to diffs it was trained on.

"In other words, the model learns to predict plausible changes to code from examples of changes made to code by human programmers."

https://arxiv.org/abs/2206.08896


... which explains why some models are better at code than others. The best coding models (like Claude 3.7 Sonnet) are likely that good because Anthropic spent an extraordinary amount of effort cultivating a really good training set for them.

I get the impression one of the most effective tricks is to load your training set up with as much code as possible that has comprehensive automated tests that pass already.


I've often experienced that I had what I thought an obscure and very intellectually challenging coding problem, and after prompting the LLM, it basically one-shotted it.

I've been profoundly humbled by the the experience, but then it occurred to me that what I thought to be an unique problem has been solved by quite a few people before and the model had plenty of references to pull from.


Do you have any examples?

Yeah for the positive example, I described the syntax of a domain-specific-language, and the AI basically one-shotted the parsing rules, that only needed minor fixes.

For a counterexample, working on any part of a codebase that's 100% application specific business logic, with our custom abstractions, the AI is usually so lost that it's basically not even worth using it, as the chances of writing correct and usable code is next to zero.


> ... which explains why some models are better at code than others.

No. It explains why models seem better at code in given situations. When your prompt mapped to diffs in the training data that are useful to you they seem great.


I've been writing code with LLM assistance for over two years now and I've had plenty of situations where I am 100% confident the thing I am doing has never been done by anyone else before.

I've tried things like searching all of the public code on GitHub for every possible keyword relevant to my problem.

... or I'm writing code against libraries which didn't exist when the models were trained.

The idea that models can only write code if they've seen code that does the exact same thing in the past is uninformed in my opinion.


Strongly agree.

This seems to be very hard for people to accept, per the other comments here.

Until recently I was willing to accept an argument that perhaps LLMs had mostly learned the patterns; e.g. to maybe believe 'well there aren't that many really different leetcode questions'.

But with recent models (eg sonnet-3.7-thinking) they are operating well on such large and novel chunks of code that the idea they've seen everything in the training set, or even, like, a close structural match, is becoming ridiculous.


All due respect to Simon but I would love to see some of that groundbreaking code that the LLMs are coming up with.

I am sure that the functionalities implemented are novel but do you really think the training data cannot possibly have had the patterns being used to deliver these features, really? How is it that in the past few months or years people suddenly found the opportunity and motivation to write code that cannot possibly be in any way shape or form represented by patterns in the diffs that have been pushed in the past 30 years?


When I said "the thing I am doing has never been done by anyone else before" I didn't necessarily mean groundbreaking pushes-the-edge-of-computer-science stuff - I meant more pedestrian things like "nobody has ever published Python code to condense and uncondense JSON using this new format I just invented today": https://github.com/simonw/condense-json

I'm not claiming LLMs can invent new computer science. I'm saying it's not accurate to say "they can only produce code that's almost identical to what's in their training data".


> "they can only produce code that's almost identical to what's in their training data"

Again, you're misinterpreting in a way that seems like you are reacting to the perception that someone attacked some of your core beliefs rather than considering what I am saying and conversing about that.

I never even used the words "exact same thing" or "almost identical". Not even synonyms. I just said overfitting and quoted from an OpenAI/Anthropic paper that said "predict plausible changes to code from examples of changes"

Think about that. Don't react, think. Why do you equate overfitting and plausibility prediction with "exact" and "identical". It very obviously is not what I said.

What I am getting at is that a cannon will kill the mosquito. But drawing a fly swatter in the cannonball and saying the plastic ones are obsolete now would be in bad faith. No need to say to someone pointing that out that they are claiming that the cannon can only fire on mosquitoes that have been swatted before.


I don't think I understood your point then. I matched it with the common "LLMs can only produce code that's similar to what they've seen before" argument.

Reading back, you said:

> I often see people wondering if the some coding task is performed well or not because of availability of code examples in the training data. It's way worse than that. It's overfitting to diffs it was trained on.

I'll be honest: I don't understand what you mean by "overfitting to diffs it was trained on" there.

Maybe I don't understand what "overfitting" means in this context?

(I'm afraid I didn't understand your cannon / fly swatter analogy either.)


It's overkill. The models do not capture knowledge about coding. They overfit to the dataset. When one distills data into a useful model the model can be used to predict future behavior of the system.

That is the premise of LLM-as-AI. By training these models on enough data, knowledge of the world is purported as having been captured, creating something useful that can be leveraged to process new input and get a prediction of the trajectory of the system in some phase space.

But this, I argue, is not the case. The models merely overfit to the training data. Hence the variable results perceived by people. When their intentions and prompt fit to the data in the training, the model appears to give good output. But the situation and prompt do not, the models do no "reason" about it and "infer" anything. It fails. It gives you gibberish or go in circles, or worse if there is some "agentic" arrangement if fails to terminate and burns tokens until you intervene.

It's overkill. And I am pointing out it is overkill. It's not a clever system for creating code for any given situation. It overfits to training data set. And your response is to claim that my argument is something else, not that it's overkill but that it can only kill dead things. I never said that. I see it's more than capable of spitting out useful code even if that exact same code is not in the training dataset. But it is just automating the process of going through google, docs and stack overflow and assembling something for you. You might be good at searching and lucky and it is just what you need. You might not be so used to using the right keywords or just be using some uncommon language, or in a domain that happens to not be well represented and then it feels less useful. But instead of just coming up short as search, the model overkills and wastes your time and god knows how much subsidized energy and compute. Lucky you if you're not burning tokens on some agentic monstosity.


You are correct that variable results could be a symptom of a failure to generalise well beyond the training set.

Such failure could happen if the models were overfit, or for other reasons. I don't think 'overfit', which is pretty well defined, is exactly the word you mean to use here.

However, I respectfully disagree with your claim. I think they are generalising well beyond the training dataset (though not as far beyond as say a good programmer would - at least not yet). I further think they are learning semantically.

Can't prove it in a comment except to say that there's simply no way they'd be able to successfully manipulate such large pieces of code, using English language instructions, it they weren't great at generalisation and ok at understanding semantics.


I understand your position. But I think you're underestimating just how much training data is used and how much information can be encoded in hundreds of billions of parameters.

But this is the crux of the disagreement. I think the models overfit to the training data hence the fluctuating behavior. And you think they show generalization and semantic understanding. Which yeah they apparently do. But the failure modes in my opinion show that they don't and would be explained by overfitting.


If that's the case, it turns out that what I want is a system that's "overfitted to the dataset" on code, since I'm getting incredibly useful results for code out of it.

(I'm not personally interested in the whole AGI thing.)


Good man I never said anything about AGI. Why do you keep responding to things I never said?

This whole exchange was you having knee-jerk reactions to things you imagined I said. It has been incredibly frustrating. And at the end you shrug and say "eh it's useful to me"??

I am talking about this because of deceitfulness, resource efficiency, societal implications of technology.


"That is the premise of LLM-as-AI" - I assumed that was an AGI reference. My definition of AGI is pretty much "hyped AI". What did you mean by "LLM-as-AI"?

In my own writing I don't even use the term "AI" very often because its meaning is so vague.

You're right to call me out on this: I did, in this earlier comment - https://news.ycombinator.com/item?id=43644662#43647037 - commit the sin of responding to something you hadn't actually said.

(Worse than that, I said "... is uninformed in my opinion" which was rude because I was saying that about a strawman argument.)

I did that thing where I saw an excuse to bang on one of my pet peeves (people saying "LLMs can't create new code if it's not already in their training data") and jumped at the opportunity.

I've tried to continue the rest of the conversation in good faith though. I'm sorry if it didn't come across that way.


> My definition of AGI is pretty much

Simon, intelligence exists (and unintelligence exists). When you write «I'm not claiming LLMs can invent new computer science», you imply intelligence exists.

We can implement it. And it is somehow urgent, because intelligence is very desirable wealth - there is definite scarcity. It is even more urgent after the recent hype has made some people perversely confused about the idea of intelligence.

We can and must go well beyond the current state.


I’ve spent a fair amount of time trying to coax assistance out of LLMs when trying to design novel or custom neural network architectures. They are sometimes helpful with narrow aspects of this. But more often, they disregard key requirements in favor of the common patterns they were trained on.

> The idea that models can only write code if they've seen code that does the exact same thing in the past is deeply uninformed in my opinion.

This is a conceited interpretation of what I said.


If this isn't what you meant, then what did you mean? To me, it's exactly how I read what you said.

I am sorry but that's nonsense.

I quoted the paper "Evolution through Large Models" written in collaboration between OpenAI and Anthropic researchers

"In other words, the model learns to predict plausible changes to code from examples of changes made to code by human programmers."

https://arxiv.org/pdf/2206.08896

> The idea that models can only write code if they've seen code that does the exact same thing in the past

How do you get "code that does the exact same thing" from "predicting plausible changes?"


That paper describes an experimental diff-focused approach from 2022. It's not clear to me how relevant it is to the way models like Claude 3.7 Sonnet (thinking) and o3-mini work today.

If do not you think past research by OpenAI and Anthropic on how to use LLMs to generate code is relevant to how Anthropic LLMs generate code 3 years later I really don't think it is possible to have a reasonable conversation about this topic with you.

Can we be sure that research became part of their mainline model development process as opposed to being an interesting side-quest?

Are Gemini and DeepSeek and Llama and other strong coding models using the same ideas?

Llama and DeepSeek are at least slightly more open about their training processes so there might be clues in their papers (that's a lot of stuff to crunch through though).


> overfitting

Are you sure it's not just a matter of being halfwitted?


The PR articles and astroturfing will continue until investors get satisfactory returns on their many billions dumped into these things.

LLMs are difficult to use. Anyone who tells you otherwise is being misleading.

I also think LLMs are more difficult to use for most tasks than is often flouted myself but I don't really jive with statements like "Anyone who tells you otherwise is being misleading". Most of the time I find they are just using them in a very different capacity.

I intended those words to imply "being misleading even if they don't know they are being misleading" - I made a better version of that point here: https://simonwillison.net/2025/Mar/11/using-llms-for-code/

> If someone tells you that coding with LLMs is easy they are (probably unintentionally) misleading you. They may well have stumbled on to patterns that work, but those patterns do not come naturally to everyone.


"Hey these tools are kind of disappointing"

"You just need to learn to use them right"

Ad infinitum as we continue to get middling results from the most overhyped piece of technology of all time.


That's why I try not to hype it.

Uh... You don't do anything but hype them.

I literally don't know who anyone on HN are except you and dang, and you're the one that constantly writes these ads for your LLM database product.


I think you and I must have different definitions of the word "hype".

To me, it means LinkedIn influencers screaming "AGI is coming!", "It's so over", "Programming as a career is dead" etc.

Or implying that LLMs are flawless technology that can and should be used to solve every problem.

To hype something is to provide a dishonest impression of how great it is without ever admitting its weaknesses. That's what I try to avoid doing with LLMs.


> without ever admitting its weaknesses

I don't think this part is necessary

"To hype something is to provide a dishonest impression of how great it is" is accurate.

Marketing hype is all about "provide a dishonest impression of how great it is". Putting the weaknesses in fine print doesn't change the hype

Anyways I don't mean to pile on but I agree with some of the other posters here. An awful lot of extremely pro-AI posts that I've noticed have your name on them

I don't think you are as critical of the tech as you think you are.

Take that for what you will


You're the biggest hype merchant for this technology on this entire website. Please.

I've been banging the drum about how unintuitive and difficult this stuff is for over a year now: https://simonwillison.net/2025/Mar/11/using-llms-for-code/

I'm one of the loudest voices about the so-far unsolved security problems inherent in this space: https://simonwillison.net/tags/prompt-injection/ (94 posts)

I also have 149 posts about the ethics of it: https://simonwillison.net/tags/ai-ethics/ - including one of the first high profile projects to explore the issue around copyrighted data used in training sets: https://simonwillison.net/2022/Sep/5/laion-aesthetics-weekno...

One of the reasons I do the "pelican riding a bicycle" thing is that it's a great way to deflate the hype around these tools - the supposedly best LLM in the world still draws a pelican that looks like it was done by a five year old! https://simonwillison.net/tags/pelican-riding-a-bicycle/

If you want AI hype there are a thousand places on the internet you can go to get it. I try not to be one of them.


I agree - the content you write about LLMs is informative and realistic, not hyped. I get a lot of value from it, especially because you write mostly as stream of consciousness and explains your approach and/or reasoning. Thank you for doing that.

The prompt injection articles you wrote really early in the tech cycle were really good and I appreciated them at the time.

Could a five year old do it in XML (SVG)? Could an artist? In one shot?

It's true that simonw writes a lot about LLMs, but I find his content to be mostly factual. Much of it is positive, but that doesn't mean it's hype.

In my experience, most people who say "Hey these tools are kind of disappointing" either refuse to provide a reproducible example of how it falls short, or if they do, it's clear that they're not using the tool correctly.

I'd love to see a reproducible example of these tools producing something that is exceptional. Or a clear reproducible example of using them the right way.

I've used them some (sorry I didn't make detailed notes about my usage, probably used them wrong) but pretty much there are always subtle bugs that if I didn't know better I would have overlooked.

I don't doubt people find them useful, personally I'd rather spend my time learning about things that interest me instead of spending money learning how to prompt a machine to do something I can do myself that I also enjoy doing.

I think a lot of the disagreements on hn about this tech is that both sides are mostly on the extremes of either "it doesn't work and at and is pointless" or "it's amazing and makes me 100x more productive" and not much discussion about the mid-ground of it works for some stuff and knowing what stuff it works well on makes it useful but it won't solve all your problems.


Why are you setting the bar at "exceptional". If it means that you can write your git commit messages more quickly and with fewer errors then that's all the payoff most orgs need to make them worthwhile.

> Why are you setting the bar at "exceptional"

Because that is how they are being sold to us and hyped

> If it means that you can write your git commit messages more quickly and with fewer errors then that's all the payoff most orgs need to make them worthwhile.

This is so trivial that it wouldn't even be worth looking into, it's basically zero value


> I'd love to see a reproducible example of these tools producing something that is exceptional.

I’m happy that my standards are somewhat low, because the other day I used Claude Sonnet 3.7 to make me refactor around 70 source files and it worked out really nicely - with a bit of guidance along the way it got me a bunch of correctly architected interfaces and base/abstract classes and made the otherwise tedious task take much less time and effort, with a bit of cleanup and improvements along the way. It all also works okay, after the needed amount of testing.

I don’t need exceptional, I need meaningful productivity improvements that make the career less stressful and frustrating.

Historically, that meant using a good IDE. Along the way, that also started to mean IaC and containers. Now that means LLMs.


I honestly think the problem is you are just a lot smarter than I am.

I find these tools wonderful but I am a lazy, college drop out of the most average intelligence, a very shitty programmer who would never get paid to write code.

I am intellectually curious though and these tools help me level up closer to someone like you.

Of course, if I had 30 more IQ points I wouldn't need these tools but I don't have 30 more IQ points.


The latest example for me was trying to generate a thumbnail of a PSD in plain C and figure out the layers in there as I was lazy to read the specs, with the objective to bundle it as a wasm and execute it on a browser, it never got to extract a thumbnail from a given PSD, it's very confident at making stuff but it never got anywhere despite spending a couple hours on it which would have been better spend reading specs and existing code on that topic

How are we supposed to give a reproducible example with a non-deterministic tool?

Ad infinitum

also "They will get better in no time"

That one's provably correct. Try comparing 2023-era GPT-3.5 with 2025's best models.

It's not provably correct if the comment is made toward 2025 models.

Gemini 2.5 came out just over two weeks ago (25th March) and is a very significant improvement on Gemini 2.0 (5th February), according to a bunch of benchmarks but also the all-important vibes.

LLMs are a casino. They're probabilistic models which might come up with incredible solutions at a drop of a hat, then turn around and fumble even the most trivial stuff - I've had this same experience from GPT3.5 to the latest and greatest models.

They come up with something amazing once, and then never again, leading me to believe, it's operator error, not pure dumb luck or slight prompt wording that lead me to be humbled once, and then tear my hair out in frustration the next time.

Granted, newer models tend to do more hitting than missing, but it's still far from a certainty that it'll spit out something good.


> "Hey these tools are kind of disappointing"

> "You just need to learn to use them right"

Admittedly, the first line is also my reaction to the likes of ASM or system level programming languages (C, C++, Rust…) because they can be unpleasant and difficult to use when compared to something that’d let me iterate more quickly (Go, Python, Node, …) for certain use cases.

For example, building a CLI tool in Go vs C++. Or maybe something to shuffle some data around and handle certain formatting in Python vs Rust. Or a GUI tool with Node/Electron vs anything else.

People telling me to RTFM and spend a decade practicing to use them well wouldn’t be wrong though, because you can do a lot with those tools, if you know how to use them well.

I reckon that it applies to any tool, even LLMs.


No, it's just you and yours.

IDK, maybe there's a secret conspiracy of major LLM providers to split users into two groups, one that gets the good models, and the other that gets the bad models, and ensure each user is assigned to the same bucket at every provider.

Surely it's more likely that you and me got put into different buckets by the Deep LLM Cartel I just described, than it is for you to be holding the tool wrong.


Was that on 3.7 Sonnet? I feel it's a lot worse than 3.5. If you can, try again but on Gemini 2.5.

I'm glad I'm not the only one that has found 3.5 to be better than 3.7.

When did 3.7 come out? I might have had the same experience. I think I have been using 3.5 with success, but I cannot remember exactly. I may have not used 3.7 for coding (as I had a couple of months break).


I will have to check, but apparently I have been using 3.5 with success, then. I will give 3.7 a try later, I hope it is really not that much worse, or is it? :(

This was 3.7. I did give Gemini a shot for a bit but it couldn’t do it either and the output didn’t look quite as nice. Also, I paid for a year of Claude so kind of feel stuck using it now.

Maybe I will give 3.5 a shot next time though.


Why do you say that?

This sounds more like a problem with you specifically not seeing value in art. Why would you want the incentives not to work in their favor?

Studio Ghibli might not have been affected yet, but only because the technology is not there yet. What's going to happen when someone can make a competing movie in their style with just a prompt? Should we all just be okay with it because it's been decided that Studio Ghibli has made enough money?

If the effort required to create that can just be ingested by a machine and replicated without consequence, how would it be viable for someone to justify that kind of investment? Where would the next evolution of the art form come from? Even if some company put in the time to create something amazing using AI that does require an investment, the precedent is that it can just be ingested and copied without consequence.

I think aside from what is legal, we need to think about what kind of world we want to live in. We can already plainly see what social media has done to the world. What do you honestly think the world will look like once this plays out?


> What's going to happen when someone can make a competing movie in their style with just a prompt?

Nothing? Just like how if some studio today invests millions of man-hours and does a competing movie in Studio Ghibli's aesthetic (but not including any Studio Ghibli's characters, branding, etc. - basically, not the copyrightable or trademarkable stuff) nothing out of ordinary is going to happen.

I mean, artistic style is not copyrightable, right?


You are missing the point entirely. If you can make a movie with just a prompt, who is going to invest the money creating something like a Ghibli movie just to have it ripped off? Instead people will just rip off what has already been done and everything just stagnates.

How is "movies and other great works of art that used to cost tens of millions of dollars to make now cost tens of dollars to make" a bad thing?

It means art can get more ambitious. Ghibli made their mark, and made their money. Now it's time for the next generation to have a turn.


The lower cost is not the bad thing. Allowing an AI to learn from it and regurgitate is the bad thing. If we can put anything into an AI and then say whatever it spits out is "clean", even though it is obviously imitating what it learned from, whoever puts the investment into trying something new becomes the sucker.

Also, I don't get this weird sense of entitlement people have over someone else's work. Just because it can be copied means it should belong to everyone?


> How is "movies and other great works of art that used to cost tens of millions of dollars to make now cost tens of dollars to make" a bad thing?

It's bad because you will never get an original visual style from now on. Everything will be copy-paste of existing styles, forever.


Can you please explain how did you jump to this conclusion?

I fail to see how artistic expression would cease to be a thing and how people will stop liking novelty. And as long as those are a thing, original styles will also be a thing.

If anything, making the entry barriers lower would result in more original styles, as art is [at least] frequently an evolutionary process, where existing ideas meet novel ones and mix in interesting ways. And even for the entirely novel (from-scratch, if that's a thing) ideas will still keep appearing - if someone thinks of something, they're still free to express themselves, as it was always the case. I cannot think of why people would stop painting with brushes, fingers or anything else.

Art exists because of human nature. Nothing changes in this regard.


I'm sorry, but I do not think I understand the idea why and how Studio Ghibli is being "ripped off" in this scenario.

As I've said, art styles are not considered copyrightable. You say I'm missing the point but I fail to see why. I've used lack of copyright protection as a reality check, a verifiable fact that can be used to determine the current consensus on the matter. Based on this lack of legal protection, I'm concluding that the societies have considered it's not something that needs to be protected, and thus that there is no "ripping off" in replicating a successful style. I have no doubts there are plenty of people who would think otherwise (and e.g. say that current state of copyright is not optimal - which can be very true), but they need to argue about copyright protections not technological accessibility. The latter merely exposes the former (by drastically lowering the cost barriers), but is not the root issue.

I also have doubts about your prediction of stagnation, particularly because you seem to ignore the demand side. People want novelty and originality, it was always the case and always will be (or at least for as long as human nature doesn't change). Things will change for sure (they always do), but I don't think a stagnation is a realistic scenario.


From the way Satya explains it, you can’t, but Microsoft plans to.

I only worry about organizing the things that cannot be forgotten, like customer communication and feature requests. For customer communication I use Intercom, and for features I use Linear. I also use Linear to plan out the product roadmap and organize my thoughts on why I am building certain features.

Everything else is a bit chaotic because there is so much to do. In the morning I take care of any customer communication that happened overnight, and always prioritize taking care of the customers first throughout the day. That way nothing is forgotten and if you answer customers quickly you can usually resolve matters quickly instead of them dragging on for multiple days.

For features, I prioritize anything that is blocking my sales pipeline, then features that were requested by customers, and after both of those are taken care of I focus on the larger long-term projects.

If there is any time left over, that is where the marketing takes place. This usually only happens once or twice a year for a day. Not saying it's right, but it is the reality. I always have big plans for marketing, but just need to get that new large system pushed first to really tie the messaging together.


Build something that there is already a market for but focus on a specific niche of customers and build it only for them. There are a lot of businesses that struggle to get software to work the way they need it to. For example, inventory tracking for flower shops.


Fully agree with what you said. The challenge is finding businesses that struggle so hard, and have an urgent need, such that they are willing to pay for a product with rough edges built by a rando dude. And for there to be enough revenue potential to sustain myself (I live in the UK).

Continuing on the example - a flower shop probably manages their inventory via spreadsheets. They're not happy with it, but I doubt they'll fork out $600/year for an inventory tracking app. And I think it'll be very hard for me to get in touch with flower shop owners to validate the idea - they live in a very different world to mine. So the only path to success is to publish an app on the Apple App Store and hope it takes off.

That's similar to what I ended up doing with the last startup project I had, unfortunately it didn't work out.


You said you worked sales right? In order to validate your idea you need to write a list of every flower shop in town and visit in person. 90% will be too busy but the remaining will drop priceless info.

Sounds like your main problem is not going after a specific problem that people have expressed, not that you think would work.


I've tried that. I had an idea for a process checklist for GitHub Actions pipelines. Came up with it after seeing people complain online, also had the issue myself. I reached out to everyone I could find who complained about it. The 30% who replied said that they wouldn't be interested in trying out a solution.

I've come to realize that just because people express problems, doesn't mean they'll do something about it. You need to find something where the pain is urgent, and unsurmountable. And that's very tricky.


>> You need to find something where the pain is urgent, and unsurmountable. And that's very tricky.

As you noted earlier, the tech is the easy part. As you note here, finding customers is the hard part.

Naturally doing the easy part first can often turn out yo be a waste of time.

You talked above about "your ideas". Don't start there. Start by walking around talking to people. Start in your local community. You're looking for pain, and a willingness to pay. Yes, it's the hard part. Yes it's work.


You need to find hair on fire problems. The classic vitamin vs. pain killer. You need to be a pain killer.


I highly recommend the book "The Mom Test" for tips on how to speak to customers and what kind of information you can get.

Just a few of the things I learnt from that book:

Don't ask if people would use your product. It's too hypothetical and people are too nice. "Sure, I'd probably use it. Looks useful. Keep working on it!". Then you never hear from them again. If you really think they'd be interested, ask for money.

Don't ask about people's problems in an abstract way. "Tell me about your business problems" is just too vague and wishy-washy. Ask about more specific things. Example: "You're having issues with double booking? Tell me about the last time it happened." Asking about the most recent time something happened is a great way to get concrete details.

Ask people how they have solved the problems or tried to solve the problem they talk about. "It was a major issue so we hired an extra member of staff to confirm all the stock lists". (Great, they're willing to spend real money on fixing this). "I searched around and tried out ABCSoft and XYZSoft but they were designed for a different kind of business and didn't really solve it for us" (pretty good, they're putting effort into finding solutions). "Oh, well, it's always been a big issue but we haven't tried to solve it yet" (Uh-oh, it's probably not really a big problem for them. It's not painful enough for them to try to solve).


I'm trying to switch to using preview environments right now to optimize my git workflow and it is a bit of a bummer that I can't just assign preview environments their own environment group. I am still trying to wrap my head around the best way to do this because I have a lot of environment variables.

I also have my object storage with Digital Ocean at the moment. Would love to have that all under one roof.


Preview environments need some polish. We'll get there.


A lot of people here seem to think this is somehow for their benefit, or that OpenAI and friends are trying to make something useful for the average person. They aren't spending billions of dollars to give everyone a personal assistant. They are spending billions now to save even more in wages later, and we are paying for the privilege of training their AI to do it. By the time this thing is useful enough to actually be a personal assistant, they will have released that capability in a model that is far too expensive for the average person.


This seems unreasonably pessimistic (or unreasonably optimistic in OpenAI's moat?). There are so, so many companies competing in this space. The cost will reflect the price of the hardware needed to run it: if it doesn't, they'll just lose to one of their many competitors who offer something similar for cheaper, e.g. whatever DeepSeek or Meta releases in the same space, with the cost driven to the bottom by commoditized inference companies like Together and Fireworks. And hardware cost goes down over time: even if it's unaffordable at launch, it won't be in five years.

They're not even the first movers here: Anthropic's been doing this with Claude for a few months now. They're just the first to combine it with a reasoning-style model, and I'd expect Anthropic to launch a similar model within the next few months if not sooner, especially now that there's been open-source replication of o1-style reasoning with DeepSeek R1 and the various R1-distills on top of Llama and Qwen.


And none of the competitors can make this technology profitable, either.


Isn't there every reason to believe the cost will come down?


Is there actually reason to believe costs will come down significantly? I've been under the impression that companies like OpenAI and Google have been selling this stuff at well below cost to drive adoption with the idea that over time efficiency improvements would make it possible, but that those improvements don't seem to be materializing, but I'm not particularly informed in this so I'd love to hear a more informed take.


The costs for OpenAI and Google aren't public, but if you look at the open-source models, inference is very cheap: for example, you can generally beat the public serverless prices by a factor of ~2 by using dedicated GPUs [1], and given that a 70b model costs about $1/million tokens serverless — and tend to perform similarly on benchmarks to 4o — OpenAI is most likely getting very fat profit margins at $2.50/million input tokens and $10/million output tokens.

The problem for them is making enough money for the training runs (where it seems like their strategy is to raise money on the hope they achieve some kind of runaway self-improving effect that grants them an effective monopoly on the leading models, combined with regulatory pushes to ban their competitors) — but it seems very unlikely to me that they're losing money serving the models.

1: https://fireworks.ai/blog/why-gpus-on-demand


The cost of what? Training a model or trained a served model? The cost of both benefit from economies of scale. If I had what openAI has I could imagine how to make it profitable tomorrow, because I could do that is why they HAVE to make it free without an account, to prevent anyone new from meaningfully entering the $0 to $20mth segment, they already know nobody can compete with the most advanced model.

If you look at their business strategy, it's top notch, anchor pricing on the 200, 20 sweet spot, probably costs them on average $5/mth to server the $20/mth customers, Take your $50m a year marketing budget and use it to buy servers, run a highly optimized "good enough" model that is basically just wikipedia in chatbot and you don't need to spend a dime on marketing if you don't want to, amazing top of funnel to the rest of your product line. I believe Sam when he says they're losing money on the $200/mth product, but it makes the $20/mth product look so good...

They're really playing business very well.


NNs don't benefit from economies of scale. Or rather specifically about how a majority of low utilization users can subsidize high utilization users. In NN world every new free tier user adds the same additional performance demand as the previous free users, every free user query needs to utilize a lot of compute.

So for example, there is a ratio of 10% paid users and 90% free users (just random numbers, not real). If they want more revenue they want to add more paid users, for example double them. But this means that free users needs to double too. And every real free user requires a lot of compute for his queries. Nothing to be cached, because all are different. No way to meaningfully offer "limited" features because the main feature is the LLM, maybe it is previous gen and a little bit cheaper to run, but not much. They can't offer too old software, because competitors will offer better quality and win.

So there is no realistic way to bring costs down. Analysts forecast they actually need to increase prices a lot to meet OAI targets, or it needs to have a financial intravenous line constantly, like the 500B$ announced by Trump.


The data is the moat.


Not for much longer, perhaps not even now. There's plenty of data avaliable to anyone, and people are finding ways to use that data more effectively.

Mid-term, I believe the only real moat is going to be human labor - that is, RLHF and other funny acronyms that boil down to getting people to chat with the model and rate how they feel about its answers.

Software improvements (architecture, training process, inference) are always one public paper or leak away from being available for free to anyone. Hardware improvements will spread too, because NVIDIA et al. would prefer to sell more chips than less chips. Meanwhile, human labor is notoriously expensive, only getting more expensive as economic conditions of people improve, and most importantly, whatever "spark" of human intelligence/consciousness there is, this is where it cannot be automated away - not until we get to human-level AGI.

Human labor is the one thing that you can only scale by throwing more money at it - which is why modern businesses seek to remove it from the equation as much as possible. Hell, the whole pursuit of AGI is in big part motivated by hope of eliminating labor costs entirely. Except, in this one pursuit, until AGI is reached, labor is a critical resource that has no substitute.

That's my mid-term prediction. Long-term, we'll hit AGI and moats won't matter anymore.


> There's plenty of data avaliable to anyone

I think you are just wrong here and so everything that follows is wishful.


Wikipedia exists and can be downloaded, and the full archive includes talk pages, CC-BY-SA: https://en.wikipedia.org/wiki/Wikipedia:Database_download

"Plenty" may be vague, but it's not wrong.


Does wikipedia help bots make coherent decisions for my personal grocery shopping based on my own habits, price point, allergies, and preferences?


Why wouldn't it? Are your allergies not listed, is your local economy not explained and thus median expectations of price sensitivity not inferrable, are the words you use to describe your preferences and habits so out of distribution?


Considering that OpenAI is barely able to stay ahead of open-source, and in some cases has fallen behind (i.e. 4o vs DeepSeek V3), I think the data moat doesn't really exist. And the difference between 4o and o1 isn't training on more data: quite famously it's scaling test-time compute, not scaling training.

"The data is the moat" was a pretty common belief a few years ago, but not anymore.


And you think they throw in $500B because they are altruistic?


I think it's less a problem of cost for the average person and more a problem of setting the market price for them at a fraction of the current one. This has such a deflationary impact that it's unlikely captured or even conceivable by the current economic models.

There's a problem of "target fixation" about the capabilities and it captures most conversation, when in fact most public focus should be on public policy and ensuring this has the impact that the society wants.

IMO whether things are going to be good or bad depends on having a shared understanding, thinking, discussion and decisions around what's going to happen next.


Exactly, every country should urgently have a public debate on how best to use that technology and make sure it's beneficial to society as a whole. Social media are a good example that a technology can have a net negative impact if we don't deploy it carefully.


Ok, this conversation about social media has cropped up time and time again and things haven't improved but got even worse. I don't expect we'll be able solve this problem with discussions only, so much money is being poured in that any discussion is likely to be completely neglected. Not saying that we shouldn't discuss this but more action is needed. I think the tech sector needs to be stripped of political power as it got way too powerful and is interfering with everything else.


I agree, though while everyone is having public debates, these companies are already in there greasing palms. I personally think the fact we are allowing them to extract so much value from our collective work is perverse. What I find even more sickening is how many people are cheering them on.

Let them make their AI if we have to. Let them use it to cure cancer and whatever other disease, but I don't think we should be allowing it to be used for commercial purposes.


For better or worse, there's a system and range of possibilities and any actionable steps need ot be within the realm of this reality, regardless of how anyone feels about it.

Public information and the ability for public to analyze, understand and eventually decide what's best for them is by and large the most relevant aspect. Your decisions are drastically different if you learn soemthing can or cannot be avoided.

You can't dissallow commercial purposes. You can't even realistically enforce property rights for illegal training data, but maybe you can argue that the totality of human knowledge should go towards the benefits of the humans, regardless of who organizes it.

However there's a lot that can be done like understanding the implications of the (close to) zero-sum game that's about to happen and whether they are solvable in the current framework and without a first principles approach.

Ultimately, it's a resource ownership and resource utilization efficiency game. Everyone's resource ownership can't be drastically change but their resource efficiency utilization can as long as the implications are made clear.


I think this is a misread of the economics. Human level AI will be expensive at first, but eventually nearly free. OpenAI will have no say in whether this happens. Competition between AI firms means OpenAI has no pricing power. And cost decreases from hardware and software (for a fixed level of intelligence) will allow competition to deliver those lower costs.

This won't mean humans can't earn wages by selling their labor. But it will mean that human intellectual labor will be not valuable in the labor market. Humans will only earn an income by differentiated activity. Initially that will be manual labor. Once robotics catches up, probably only labor tied to people's personality and humanness.


Don't worry, it'll never be good enough to actually be a personal assistant.


Not this version, but in 3 years time. Promise.

Just keeping sending us money...


Same as self-driving cars 10 years ago? Yeah...


Self-driving cars have actually made progress. Better sensors, better programming, etc. Tesla can't do it but clearly Waymo can. It's not perfect, but with enough time and effort it can get to the point where it'll be regularly usable in most cases.

But LLMs? Those have already scraped all the data they're going to, and bigger models have less and less impact. They're about as good as they're ever going to be.


Another angle is that they might use it to scrape more content for internal use. That’s on top of the training data that the user interactions generate. Which I was your point, I believe.


>we are paying for the privilege of training their AI

this was, is and is going to be a constant thing with every AI company


Everything they’ve released is affordable to the average person. Why would this one be different?


I think the answer here speaks to the intentions of these companies. The focus is on having the AI act like a human would in order to cut humans out of the equation.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: