If o1-pro is 10% better than Claude, but you are a guy who makes $300,000 per year, but now can make $330,000 because o1-pro makes you more productive, then it makes sense to give Sam $2,400.
Above example makes no sense since it says ChatGPT is 10% better than Claude at first, then pivots to use it as a 10% total productivity enhancer. Which is it?
It's never this clean, but it is direction-ally correct. If I make $300k / year, and I can tell that chatgpt already saves me hours or even days per month, $200 is a laughable amount. If I feel like pro is even slightly better, it's worth $200 just to know that I always have the best option available.
Heck, it's probably worth $200 even if I'm not confident it's better just in case it is.
For the same reason I don't start with the cheapest AI model when asking questions and then switch to the more expensive if it doesn't work. The more expensive one is cheap enough that it doesn't even matter, and $200 is cheap enough (for a certain subsection of users) that they'll just pay it to be sure they're using the best option.
That's only true if your time is metered by the hour; and the vast majority of roles which find some benefit from AI, at this time, are not compensated hourly. This plan might be beneficial to e.g. CEO-types, but I question who at OpenAI thought it would be a good idea to lead their 12 days of hollowhype with this launch, then; unless this is the highest impact release they've got (one hopes it is not).
>This plan might be beneficial to e.g. CEO-types, but I question who at OpenAI thought it would be a good idea to lead their 12 days of hollowhype with this launch, then; unless this is the highest impact release they've got (one hopes it is not).
In previous multi-day marketing campaigns I've ran or helped ran (specifically on well-loved products), we've intentionally announced a highly-priced plan early on without all of its features.
Two big benefits:
1) Your biggest advocates get to work justifying the plan/product as-is, anchoring expectations to the price (which already works well enough to convert a slice of potential buyers)
2) Anything you announce afterward now gets seen as either a bonus on top (e.g. if this $200/mo plan _also_ includes Sora after they announce it...), driving value per price up compared to the anchor; OR you're seen as listening to your audience's criticisms ("this isn't worth it!") by adding more value to compensate.
I work from home and my time is accounted for by way of my productive output because I am very far away from a CEO type. If I can take every Wednesday off because I’ve gained enough productivity to do so, I would happily pay $200/mo out of my own pocket to do so.
$200/user/month isn’t even that high of a number in the enterprise software world.
Employers might be willing to get their employees a subscription if they believe it makes their employees they are paying $$$$$ more X% productive. (Where X% of their salary works out to more than $2400/year)
There is only so much time in the day. If you have a job where increased productivity translates to increases income (not just hourly metered jobs) then you will see a benefit.
> cheapest AI model when asking questions and then switch to the more expensive if it doesn't work.
The thing is, more expensive isn't guaranteed to be better. The more expensive models are better most of the time, but not all the time. I talk about this more in this comment https://news.ycombinator.com/item?id=42313401#42313990
Since LLMs are non-deterministic, there is no guarantee that GPT-4o is better than GPT-4o mini. GPT-4o is most likely going to be better, but sometimes the simplicity of GPT-4o mini makes it better.
As you say, the more expensive models are better most of the time.
Since we can't easily predict which model will actually be better for a given question at the time of asking, it makes sense to stick to the most expensive/powerful models. We could try, but that would be a complex and expensive endeavor. Meanwhile, both weak and powerful models are already too cheap to meter in direct / regular use, and you're always going to get ahead with the more powerful ones, per the very definition of what "most of the time" means, so it doesn't make sense to default to a weaker model.
TBH it's easily in the other direction. If I can get something to clients quicker that's more valuable.
If paying this gets me two days of consulting it's a win for me.
Obvious caveat if cheaper setups get me the same, although I can't spend too long comparing or that time alone will cost more than just buying everything.
The number of times I've heard all this about some other groundbreaking technology... most businesses just went meh and moved on. But for self-employed, if those numbers are right, it may make sense.
It's not worth it if you're a W2 employee and you'll just spend those 2 hours doing other work. Realistically, working 42 hours a week instead of 40 will not meaningfully impact your performance, so doing 42 hours a week of work in 40 won't, either.
I pay $20/mo for Claude because it's been better than GPT for my use case, and I'm fine paying that but I wouldn't even consider something 10x the price unless it is many, many times better. I think at least 4-5x better is when I'd consider it and this doesn't appear to be anywhere close to even 2x better.
That's also not how pricing works, it's about perceived incremental increases in how useful it is (marginal utility), not about the actual more money you make.
Yeah, the $200 seems excessive and annoying, until you realise it depends on how much it saves you. For me it needs to save me about 6 hours per month to pay for itself.
Funny enough I've told people that baulk at the $20 that I would pay $200 for the productivity gains of the 4o class models. I already pay $40 to OpenAI, $20 to Anthropic, and $40 to cursor.sh.
ah yes, you must work at the company where you get paid per line of code. There's no way productivity is measured this accurately and you are rewarded directly in any job unless you are self-employed and get paid per website or something
Being in an AI domain does not invalidate the fundamental logic. If an expensive tool can make you productive enough to offset the cost, then the tool is worth it for all intents and purposes.
I think of them as different people -- I'll say that I use them in "ensemble mode" for coding, the workflow is Claude 3.5 by default -- when Claude is spinning, o1-preview to discuss, Claude to implement. Worst case o1-preview to implement, although I think its natural coding style is slightly better than Claude's. The speed difference isn't worth it.
The intersection of problems I have where both have trouble is pretty small. If this closes the gap even more, that's great. That said, I'm curious to try this out -- the ways in which o1-preview fails are a bit different than prior gpt-line LLMs, and I'm curious how it will feel on the ground.
Okay, tried it out. Early indications - it feels a bit more concise, thank god, certainly more concise than 4o -- it's s l o w. Getting over 1m times to parse codebases. There's some sort of caching going on though, follow up queries are a bit faster (30-50s). I note that this is still superhuman speeds, but it's not writing at the speed Groqchat can output Llama 3.1 8b, that is for sure.
Code looks really clean. I'm not instantly canceling my subscription.
When you say "parse codebases" is this uploading a couple thousand lines in a few different files? Or pasting in 75 lines into the chat box? Or something else?
$ find web -type f \( -name '.go' -o -name '.tsx' \) | tar -cf code.tar -T; cat code.tar | pbcopy
Then I paste it in and say "can you spot any bugs in the API usage? Write out a list of tasks for a senior engineer to get the codebase in basically perfect shape," or something along those lines.
Alternately: "write a go module to support X feature, and implement the react typescript UI side as well. Use the existing styles in the tsx files you find; follow these coding guidelines, etc. etc."
I pay for both GPT and Claude and use them both extensively. Claude is my go-to for technical questions, GPT (4o) for simple questions, internet searches and validation of Claude answers. GPT o1-preview is great for more complex solutions and work on larger projects with multiple steps leading to finish. There’s really nothing like it that Anthropic provides.
But $200/mo is way above what I’m willing to pay.
I have several local models I hit up first (Mixtral, Llama), if I don’t like the results then I’ll give same prompt to Claude and GPT.
Overall though it’s really just for reference and/or telling me about some standard library function I didn’t know of.
Somewhat counterintuitively I spend way more time reading language documentation than I used to, as the LLM is mainly useful in pointing me to language features.
After a few very bad experiences I never let LLM write more than a couple lines of boilerplate for me, but as a well-read assistant they are useful.
But none of them are sufficient alone, you do need a “team” of them - which is why I also don’t see the value is spending this much on one model. I’d spend that much on a system that polled 5 models concurrently and came up with a summary of sorts.
People keep talking about using LLMs for writing code, and they might be useful for that, but I've found them much more useful for explaining human-written code than anything else, especially in languages/frameworks outside my core competency.
E.g. "why does this (random code in a framework I haven't used much) code cause this error?"
About 50% of the time I get a helpful response straight away that saves me trawling through Stack Overflow and random blog posts. About 25% of the time the response is at least partially wrong, but it still helps me get on the right track.
25% of the time the LLM has no idea and won't admit it so I end up wasting a small amount of time going round in circles, but overall it's a significant productivity boost when I'm working on unfamiliar code.
Right on, I like to use local models - even though I also use OpenAI, Anthropic, and Google Gemini.
I often use one or two shot examples in prompts, but with small local models it is also fairly simple to do fine tuning - if you have fine tuning examples, and if you are a developer so you get the training data in the correct format, and the correct format changes for different models that you are fine tuning.
> But none of them are sufficient alone, you do need a “team” of them
Given the sensitivity to parameters and prompts the models have, your "team" can just as easily be querying the same LLM multiple times with different system prompts.
I haven't used ChatGPT in a few weeks now. I still maintain subscriptions to both ChatGPT and Claude, but I'm very close to dropping ChatGPT entirely. The only useful thing it provides over Claude is a decent mobile voice mode and web search.
If you don't want to necessarily have to pick between one or the other, there are services like this one that let you basically access all the major LLMs and only pay per use: https://nano-gpt.com/
I've used TypingMind and it's pretty great, I like the idea of just plugging in a couple API keys and paying a fraction, but I really wish there was some overlap.
If a random query via the API costs a fifth of a cent why can't I can't 10 free API calls w/ my $20/mo premium subscription?
I'm in the same boat — I maintain subscriptions to both.
The main thing I like OpenAI for is that when I'm on a long drive, I like to have conversations with OpenAI's voice mode.
If Claude had a voice mode, I could see dropping OpenAI entirely, but for now it feels like the subscriptions to both is a near-negligible cost relative to the benefits I get from staying near the front of the AI wave.
I've heard so much about Claude and decided to give it a try and it has been rather a major disappointment. I ended up using chatgpt as an assistant for claude's code writing because it just couldn't get things right. Had to cancel my subscription, no idea why people still promote it everywhere like it is 100x times better than chatgpt.
I've heard this a lot and so I switched to Claude for a month and was super disappointed. What are you mainly using ChatGPT for?
Personally, I found Claude marginally better for coding, but far, far worse for just general purpose questions (e.g. I'm a new home owner and I need to winterize my house before our weather drops below freezing. What are some steps I should take or things I should look into?)
It's ironic because I never want to ask an LLM for something like your example general purpose question, where I can't just cheaply and directly test the correctness of the answer
But we're hurtling towards all the internet's answers to general purpose questions being SEO spam that was generated by an LLM anyways.
Since OpenAI probably isn't hiring as many HVAC technicians to answer queries as they are programmers, it feels like we're headed towards a death spiral where either having the LLM do actual research from non-SEO affected primary sources, or finding a human who's done that research will be the only options for generic knowledge questions that are off the beaten path
-
Actually to test my hypothesis I just tried this with ChatGPT with internet access.
The list of winterization tips cited an article that felt pretty "delvey". I search the author's name and their LinkedIn profile is about how they professionally write marketing content (nothing about HVAC), one of their accomplishments is Generative AI, and their like feed is full of AI mentions for writing content.
So ChatGPT is already at a place where when it searches for "citations", it's just spitting back out its own uncited answers above answers by actual experts (since the expert sources aren't as SEO-driven)
> I can't just cheaply and directly test the correctness of the answer
I feel that, but I think for me the key is knowing that LLMs can be wrong and I should treat the answer as a starting point and not an actual expert. I find it really helpful for topics where I don't even know where to start because, like you said, most search engines are utter trash now.
For things like that, I find ChatGPT to be a good diving off point. For example, this is what I got when I asked:
```
Preparing your townhouse for winter involves addressing common issues associated with the region's wet and cool climate. Here's a concise checklist to help you get started:
1. Exterior Maintenance
Roof Inspection: Check for damaged or missing shingles to prevent leaks during heavy rains.
Gutter Cleaning: Remove leaves and debris to ensure proper drainage and prevent water damage.
Downspouts: Ensure they direct water away from the foundation to prevent pooling and potential leaks.
Siding and Trim: Inspect for cracks or gaps and seal them to prevent moisture intrusion.
2. Windows and Doors
Weatherstripping: Install or replace to seal gaps and prevent drafts, improving energy efficiency.
Caulking: Apply around window and door frames to block moisture and cold air.
3. Heating System
Furnace Inspection: Have a professional service your furnace to ensure it's operating efficiently.
Filter Replacement: Change furnace filters to maintain good air quality and system performance.
4. Plumbing
Outdoor Faucets: Disconnect hoses and insulate faucets to prevent freezing.
Pipe Insulation: Insulate exposed pipes, especially in unheated areas, to prevent freezing and bursting.
5. Landscaping
Tree Trimming: Prune branches that could break under snow or ice and damage your property.
Drainage: Ensure the yard slopes away from the foundation to prevent water accumulation.
6. Safety Checks
Smoke and Carbon Monoxide Detectors: Test and replace batteries to ensure functionality.
Fireplace and Chimney: If applicable, have them inspected and cleaned to prevent fire hazards.
By addressing these areas, you can help protect your home from common winter-related issues in Seattle's climate.
```
Once I dove into the links ChatGPT provided I found the detail I needed and things I needed to investigate more, but it saved 30 minutes of pulling together a starting list from the top 5-10 articles on Google.
Claude Sonnet 3.5 has outperformed o1 in most tasks based on my own anecdotal assessment. So much so that I'm debating canceling my ChatGPT subscription. I just literally do not use it anymore, despite being a heavy user for a long time in the past
Is a "reasoning" model really different? Or is it just clever prompting (and feeding previous outputs) for an existing model? Possibly with some RLHF reasoning examples?
OpenAI doesn't have a large enough database of reasoning texts to train a foundational LLM off it? I thought such a db simply does not exist as humans don't really write enough texts like this.
It's trained via reinforcement learning on essentially infinite synthetic reasoning data. You can generate infinite reasoning data because there are infinite math and coding problems that can be created with machine-checkable solutions, and machines can make infinite different attempts at reasoning their way to the answer. Similar to how models trained to learn chess by self-play have essentially unlimited training data.
We don't know the specifics of GPT-o1 to judge, but we can look at open weights model for an example. Qwen-32B is a base model, QwQ-32B is a "reasoning" variant. You're broadly correct that the magic, such as it is, is in training the model into a long-winded CoT, but the improvements from it are massive. QwQ-32B beats larger 70B models in most tasks, and in some cases it beats Claude.
I just tried QwQ 32B, i didn't know about it. I used it to generate, some code GPT generated 2 days ago perfect code without even sweating.
QwQ generated 10 pages of it's reasoning steps, and the code is probably not correct. [1] includes both answers from QwQ and GPT.
Breaking down it's reasoning steps to such an excruciating detailed prose is certainly not user friendly, but it is intriguing. I wonder what an ideal use case for it would be.
To my understanding, Anthropic realizes that they can’t compete in name recognition yet, so they have to overdeliver in terms of quality to win the war. It’s hard to beat the incumbent, especially when “chatgpt’ing” is basically a well understood verb.
They don't have a model that does o1-style "thought tokens" or is specialized for math, but Sonnet 3.6 is really strong in other ways. I'm guessing they will have an o1-style model within six months if there's demand
Same. Honestly if they released a $200 a month plan I’d probably bite, but OpenAI hasn’t earned that level of confidence from me yet. They have some catching up to do.
I am incredibly doubtful that this new GPT is 10x Claude unless it is embracing some breakthrough, secret, architecture nobody has heard of.