I consistently get significantly better performance from Anthropic at a literal ...

MP_1729 · 2024-12-05T19:11:18 1733425878

That's not how pricing works.

If o1-pro is 10% better than Claude, but you are a guy who makes $300,000 per year, but now can make $330,000 because o1-pro makes you more productive, then it makes sense to give Sam $2,400.

echoangle · 2024-12-05T19:25:07 1733426707

Having a tool that’s 10% better doesn’t make your whole work 10% better though.

TeMPOraL · 2024-12-05T22:20:36 1733437236

A "10% better" tool could make no difference, or it could make the work 100% better. The impact isn't linear.

kolbe · 2024-12-06T01:25:15 1733448315

It's likely probabilistically linear... like speeding on a street with random traffic lights.

echoangle · 2024-12-05T22:25:39 1733437539

Right, I should have put a "necessarily" in there.

dawnerd · 2024-12-06T08:54:21 1733475261

It also doesn’t magically make you more money either.

szundi · 2024-12-05T22:13:09 1733436789

Depends on the definition of better. Above example used this definition implicitly as you can see.

jaredklewis · 2024-12-05T22:21:37 1733437297

Above example makes no sense since it says ChatGPT is 10% better than Claude at first, then pivots to use it as a 10% total productivity enhancer. Which is it?

onlyrealcuzzo · 2024-12-05T20:42:50 1733431370

Yeah, but that's the sales pitch.

jaredklewis · 2024-12-05T22:19:46 1733437186

Man, why are people making $300k so stupid though

015a · 2024-12-05T19:20:21 1733426421

The math is never this clean, and no one has ever experienced this (though I'm sure its a justification that was floated at OAI HQ at least once).

xur17 · 2024-12-05T19:29:19 1733426959

It's never this clean, but it is direction-ally correct. If I make $300k / year, and I can tell that chatgpt already saves me hours or even days per month, $200 is a laughable amount. If I feel like pro is even slightly better, it's worth $200 just to know that I always have the best option available.

Heck, it's probably worth $200 even if I'm not confident it's better just in case it is.

For the same reason I don't start with the cheapest AI model when asking questions and then switch to the more expensive if it doesn't work. The more expensive one is cheap enough that it doesn't even matter, and $200 is cheap enough (for a certain subsection of users) that they'll just pay it to be sure they're using the best option.

015a · 2024-12-05T19:32:56 1733427176

That's only true if your time is metered by the hour; and the vast majority of roles which find some benefit from AI, at this time, are not compensated hourly. This plan might be beneficial to e.g. CEO-types, but I question who at OpenAI thought it would be a good idea to lead their 12 days of hollowhype with this launch, then; unless this is the highest impact release they've got (one hopes it is not).

drusepth · 2024-12-05T21:00:57 1733432457

>This plan might be beneficial to e.g. CEO-types, but I question who at OpenAI thought it would be a good idea to lead their 12 days of hollowhype with this launch, then; unless this is the highest impact release they've got (one hopes it is not).

In previous multi-day marketing campaigns I've ran or helped ran (specifically on well-loved products), we've intentionally announced a highly-priced plan early on without all of its features.

Two big benefits:

1) Your biggest advocates get to work justifying the plan/product as-is, anchoring expectations to the price (which already works well enough to convert a slice of potential buyers)

2) Anything you announce afterward now gets seen as either a bonus on top (e.g. if this $200/mo plan _also_ includes Sora after they announce it...), driving value per price up compared to the anchor; OR you're seen as listening to your audience's criticisms ("this isn't worth it!") by adding more value to compensate.

luma · 2024-12-05T20:57:36 1733432256

I work from home and my time is accounted for by way of my productive output because I am very far away from a CEO type. If I can take every Wednesday off because I’ve gained enough productivity to do so, I would happily pay $200/mo out of my own pocket to do so.

$200/user/month isn’t even that high of a number in the enterprise software world.

cmeacham98 · 2024-12-05T20:18:51 1733429931

Employers might be willing to get their employees a subscription if they believe it makes their employees they are paying $$$$$ more X% productive. (Where X% of their salary works out to more than $2400/year)

fastball · 2024-12-06T07:35:02 1733470502

There is only so much time in the day. If you have a job where increased productivity translates to increases income (not just hourly metered jobs) then you will see a benefit.

sdesol · 2024-12-05T21:22:27 1733433747

> cheapest AI model when asking questions and then switch to the more expensive if it doesn't work.

The thing is, more expensive isn't guaranteed to be better. The more expensive models are better most of the time, but not all the time. I talk about this more in this comment https://news.ycombinator.com/item?id=42313401#42313990

Since LLMs are non-deterministic, there is no guarantee that GPT-4o is better than GPT-4o mini. GPT-4o is most likely going to be better, but sometimes the simplicity of GPT-4o mini makes it better.

TeMPOraL · 2024-12-05T22:14:57 1733436897

As you say, the more expensive models are better most of the time.

Since we can't easily predict which model will actually be better for a given question at the time of asking, it makes sense to stick to the most expensive/powerful models. We could try, but that would be a complex and expensive endeavor. Meanwhile, both weak and powerful models are already too cheap to meter in direct / regular use, and you're always going to get ahead with the more powerful ones, per the very definition of what "most of the time" means, so it doesn't make sense to default to a weaker model.

sdesol · 2024-12-05T22:52:47 1733439167

For regular users I agree, for businesses, it will have to be a shotgun approach in my opinion.

Edit:

I should add, for businesses, it isn't about better, but more about risk as the better model can still be wrong.

IanCal · 2024-12-06T01:32:49 1733448769

TBH it's easily in the other direction. If I can get something to clients quicker that's more valuable.

If paying this gets me two days of consulting it's a win for me.

Obvious caveat if cheaper setups get me the same, although I can't spend too long comparing or that time alone will cost more than just buying everything.

jajko · 2024-12-05T19:36:29 1733427389

The number of times I've heard all this about some other groundbreaking technology... most businesses just went meh and moved on. But for self-employed, if those numbers are right, it may make sense.

jnsaff2 · 2024-12-05T20:10:37 1733429437

It would be a worthy deal if you started making $302,401 per year.

awb · 2024-12-05T21:26:26 1733433986

Also a worthy deal if you don’t lose your $300k/year job to someone who is willing to pay $2,400/year.

truetraveller · 2024-12-05T20:37:35 1733431055

Yes. But also from the perspective of saving time. If it saves an additional 2 hours/month, and you make six figures, it's worth it.

And the perspective of frustration as well.

Business class is 4x the price of regular. definitely not 4x better. But it saves times + frustration.

pc86 · 2024-12-05T22:15:49 1733436949

It's not worth it if you're a W2 employee and you'll just spend those 2 hours doing other work. Realistically, working 42 hours a week instead of 40 will not meaningfully impact your performance, so doing 42 hours a week of work in 40 won't, either.

I pay $20/mo for Claude because it's been better than GPT for my use case, and I'm fine paying that but I wouldn't even consider something 10x the price unless it is many, many times better. I think at least 4-5x better is when I'd consider it and this doesn't appear to be anywhere close to even 2x better.

bufferoverflow · 2024-12-06T04:47:12 1733460432

When it comes to sleep, business class is 100x better.

pvarangot · 2024-12-05T20:26:05 1733430365

That's also not how pricing works, it's about perceived incremental increases in how useful it is (marginal utility), not about the actual more money you make.

richardhowes · 2024-12-06T08:51:24 1733475084

Yeah, the $200 seems excessive and annoying, until you realise it depends on how much it saves you. For me it needs to save me about 6 hours per month to pay for itself.

Funny enough I've told people that baulk at the $20 that I would pay $200 for the productivity gains of the 4o class models. I already pay $40 to OpenAI, $20 to Anthropic, and $40 to cursor.sh.

pie420 · 2024-12-05T19:55:26 1733428526

ah yes, you must work at the company where you get paid per line of code. There's no way productivity is measured this accurately and you are rewarded directly in any job unless you are self-employed and get paid per website or something

bloppe · 2024-12-05T22:24:07 1733437447

I love it when AI bros quantify AI's helpfulness like this

educasean · 2024-12-05T23:35:40 1733441740

Being in an AI domain does not invalidate the fundamental logic. If an expensive tool can make you productive enough to offset the cost, then the tool is worth it for all intents and purposes.

vessenes · 2024-12-05T19:12:11 1733425931

I think of them as different people -- I'll say that I use them in "ensemble mode" for coding, the workflow is Claude 3.5 by default -- when Claude is spinning, o1-preview to discuss, Claude to implement. Worst case o1-preview to implement, although I think its natural coding style is slightly better than Claude's. The speed difference isn't worth it.

The intersection of problems I have where both have trouble is pretty small. If this closes the gap even more, that's great. That said, I'm curious to try this out -- the ways in which o1-preview fails are a bit different than prior gpt-line LLMs, and I'm curious how it will feel on the ground.

vessenes · 2024-12-05T19:55:49 1733428549

Okay, tried it out. Early indications - it feels a bit more concise, thank god, certainly more concise than 4o -- it's s l o w. Getting over 1m times to parse codebases. There's some sort of caching going on though, follow up queries are a bit faster (30-50s). I note that this is still superhuman speeds, but it's not writing at the speed Groqchat can output Llama 3.1 8b, that is for sure.

Code looks really clean. I'm not instantly canceling my subscription.

pc86 · 2024-12-05T22:16:59 1733437019

When you say "parse codebases" is this uploading a couple thousand lines in a few different files? Or pasting in 75 lines into the chat box? Or something else?

vessenes · 2024-12-05T23:59:34 1733443174

$ find web -type f $ -name '.go' -o -name '.tsx' $ | tar -cf code.tar -T; cat code.tar | pbcopy

Then I paste it in and say "can you spot any bugs in the API usage? Write out a list of tasks for a senior engineer to get the codebase in basically perfect shape," or something along those lines.

Alternately: "write a go module to support X feature, and implement the react typescript UI side as well. Use the existing styles in the tsx files you find; follow these coding guidelines, etc. etc."

404mm · 2024-12-05T19:58:40 1733428720

I pay for both GPT and Claude and use them both extensively. Claude is my go-to for technical questions, GPT (4o) for simple questions, internet searches and validation of Claude answers. GPT o1-preview is great for more complex solutions and work on larger projects with multiple steps leading to finish. There’s really nothing like it that Anthropic provides. But $200/mo is way above what I’m willing to pay.

griomnib · 2024-12-05T20:32:00 1733430720

I have several local models I hit up first (Mixtral, Llama), if I don’t like the results then I’ll give same prompt to Claude and GPT.

Overall though it’s really just for reference and/or telling me about some standard library function I didn’t know of.

Somewhat counterintuitively I spend way more time reading language documentation than I used to, as the LLM is mainly useful in pointing me to language features.

After a few very bad experiences I never let LLM write more than a couple lines of boilerplate for me, but as a well-read assistant they are useful.

But none of them are sufficient alone, you do need a “team” of them - which is why I also don’t see the value is spending this much on one model. I’d spend that much on a system that polled 5 models concurrently and came up with a summary of sorts.

ifwinterco · 2024-12-07T16:19:00 1733588340

People keep talking about using LLMs for writing code, and they might be useful for that, but I've found them much more useful for explaining human-written code than anything else, especially in languages/frameworks outside my core competency.

E.g. "why does this (random code in a framework I haven't used much) code cause this error?"

About 50% of the time I get a helpful response straight away that saves me trawling through Stack Overflow and random blog posts. About 25% of the time the response is at least partially wrong, but it still helps me get on the right track.

25% of the time the LLM has no idea and won't admit it so I end up wasting a small amount of time going round in circles, but overall it's a significant productivity boost when I'm working on unfamiliar code.

mark_l_watson · 2024-12-06T00:53:29 1733446409

Right on, I like to use local models - even though I also use OpenAI, Anthropic, and Google Gemini.

I often use one or two shot examples in prompts, but with small local models it is also fairly simple to do fine tuning - if you have fine tuning examples, and if you are a developer so you get the training data in the correct format, and the correct format changes for different models that you are fine tuning.

TeMPOraL · 2024-12-05T22:23:23 1733437403

> But none of them are sufficient alone, you do need a “team” of them

Given the sensitivity to parameters and prompts the models have, your "team" can just as easily be querying the same LLM multiple times with different system prompts.

griomnib · 2024-12-05T22:44:09 1733438649

Other factor is I use local LLM first because I don’t trust any of the companies to protect my data or software IP.

404mm · 2024-12-05T20:42:59 1733431379

What model sizes do you run locally? Anything that would work on a 16GB M1?

mark_l_watson · 2024-12-06T01:02:02 1733446922

I ha e a 32G M2, but most local models I use fit into my 8G old M1 laptop.

I can run the QwQ 32G model with Q4 on my 32G M2.

I suggest using https://Ollama.com on Mac, Windows, and Linux. I experiments with all options on Apple Silicon and liked Ollama best.

griomnib · 2024-12-05T20:58:32 1733432312

I have an A6000 with 48GB VRAM I run from a local server and I connect to it using Enchanted on my Mac.

aliasxneo · 2024-12-05T19:09:05 1733425745

I haven't used ChatGPT in a few weeks now. I still maintain subscriptions to both ChatGPT and Claude, but I'm very close to dropping ChatGPT entirely. The only useful thing it provides over Claude is a decent mobile voice mode and web search.

asterix_pano · 2024-12-05T20:32:45 1733430765

If you don't want to necessarily have to pick between one or the other, there are services like this one that let you basically access all the major LLMs and only pay per use: https://nano-gpt.com/

pc86 · 2024-12-05T22:18:31 1733437111

I've used TypingMind and it's pretty great, I like the idea of just plugging in a couple API keys and paying a fraction, but I really wish there was some overlap.

If a random query via the API costs a fifth of a cent why can't I can't 10 free API calls w/ my $20/mo premium subscription?

sumedh · 2024-12-06T11:19:11 1733483951

Does it have Claude's artifact feature

HanClinto · 2024-12-06T13:18:12 1733491092

I'm in the same boat — I maintain subscriptions to both.

The main thing I like OpenAI for is that when I'm on a long drive, I like to have conversations with OpenAI's voice mode.

If Claude had a voice mode, I could see dropping OpenAI entirely, but for now it feels like the subscriptions to both is a near-negligible cost relative to the benefits I get from staying near the front of the AI wave.

bluedays · 2024-12-05T19:25:09 1733426709

I’ve been considering dropping ChatGPT for the same reason. Now that the app is out the only thing I actually care about is search.

xixixao · 2024-12-05T19:18:29 1733426309

Which ChatGPT model have you been using? In my experience nothing beats 4. (Not claude, not 4o)

cryptoegorophy · 2024-12-05T19:43:58 1733427838

I've heard so much about Claude and decided to give it a try and it has been rather a major disappointment. I ended up using chatgpt as an assistant for claude's code writing because it just couldn't get things right. Had to cancel my subscription, no idea why people still promote it everywhere like it is 100x times better than chatgpt.

sumedh · 2024-12-06T11:20:50 1733484050

> Had to cancel my subscription, no idea why people still promote it everywhere like it is 100x times better than chatgpt.

You need to learn how to ask it the right questions.

acchow · 2024-12-05T20:41:38 1733431298

I find o1 much better for having discussions or solving problems, then usually switch to Claude for code generation.

rmbyrro · 2024-12-06T00:32:17 1733445137

Sonnet isn't good at debugging, or even architecting. o1 shines, it feels like magic. The kinds of bugs it helped me nail were incredible to me.

superfrank · 2024-12-05T19:25:24 1733426724

I've heard this a lot and so I switched to Claude for a month and was super disappointed. What are you mainly using ChatGPT for?

Personally, I found Claude marginally better for coding, but far, far worse for just general purpose questions (e.g. I'm a new home owner and I need to winterize my house before our weather drops below freezing. What are some steps I should take or things I should look into?)

BoorishBears · 2024-12-05T22:11:30 1733436690

It's ironic because I never want to ask an LLM for something like your example general purpose question, where I can't just cheaply and directly test the correctness of the answer

But we're hurtling towards all the internet's answers to general purpose questions being SEO spam that was generated by an LLM anyways.

Since OpenAI probably isn't hiring as many HVAC technicians to answer queries as they are programmers, it feels like we're headed towards a death spiral where either having the LLM do actual research from non-SEO affected primary sources, or finding a human who's done that research will be the only options for generic knowledge questions that are off the beaten path

-

Actually to test my hypothesis I just tried this with ChatGPT with internet access.

The list of winterization tips cited an article that felt pretty "delvey". I search the author's name and their LinkedIn profile is about how they professionally write marketing content (nothing about HVAC), one of their accomplishments is Generative AI, and their like feed is full of AI mentions for writing content.

So ChatGPT is already at a place where when it searches for "citations", it's just spitting back out its own uncited answers above answers by actual experts (since the expert sources aren't as SEO-driven)

superfrank · 2024-12-09T22:38:36 1733783916

> I can't just cheaply and directly test the correctness of the answer

I feel that, but I think for me the key is knowing that LLMs can be wrong and I should treat the answer as a starting point and not an actual expert. I find it really helpful for topics where I don't even know where to start because, like you said, most search engines are utter trash now.

For things like that, I find ChatGPT to be a good diving off point. For example, this is what I got when I asked:

``` Preparing your townhouse for winter involves addressing common issues associated with the region's wet and cool climate. Here's a concise checklist to help you get started:

1. Exterior Maintenance

Roof Inspection: Check for damaged or missing shingles to prevent leaks during heavy rains.

Gutter Cleaning: Remove leaves and debris to ensure proper drainage and prevent water damage.

Downspouts: Ensure they direct water away from the foundation to prevent pooling and potential leaks.

Siding and Trim: Inspect for cracks or gaps and seal them to prevent moisture intrusion.

2. Windows and Doors

Weatherstripping: Install or replace to seal gaps and prevent drafts, improving energy efficiency.

Caulking: Apply around window and door frames to block moisture and cold air.

3. Heating System

Furnace Inspection: Have a professional service your furnace to ensure it's operating efficiently.

Filter Replacement: Change furnace filters to maintain good air quality and system performance.

4. Plumbing

Outdoor Faucets: Disconnect hoses and insulate faucets to prevent freezing.

Pipe Insulation: Insulate exposed pipes, especially in unheated areas, to prevent freezing and bursting.

5. Landscaping

Tree Trimming: Prune branches that could break under snow or ice and damage your property.

Drainage: Ensure the yard slopes away from the foundation to prevent water accumulation.

6. Safety Checks

Smoke and Carbon Monoxide Detectors: Test and replace batteries to ensure functionality.

Fireplace and Chimney: If applicable, have them inspected and cleaned to prevent fire hazards.

By addressing these areas, you can help protect your home from common winter-related issues in Seattle's climate. ```

Once I dove into the links ChatGPT provided I found the detail I needed and things I needed to investigate more, but it saved 30 minutes of pulling together a starting list from the top 5-10 articles on Google.

BoorishBears · 2024-12-21T22:42:45 1734820965

Super old comment, but for posterity, my point is that unfortunately increasingly when you do dive into those results those are also ChatGPT

Depends on the topic of course, but it ends up being a bit of an ouroborous

nurettin · 2024-12-05T18:55:48 1733424948

Or Anthropic will follow suit.

MuffinFlavored · 2024-12-05T18:59:54 1733425194

Am I wrong that Anthropic doesn't really have a match yet to ChatGPT's o1 model (a "reasoning" model?)

airstrike · 2024-12-05T20:16:25 1733429785

Claude Sonnet 3.5 has outperformed o1 in most tasks based on my own anecdotal assessment. So much so that I'm debating canceling my ChatGPT subscription. I just literally do not use it anymore, despite being a heavy user for a long time in the past

jerjerjer · 2024-12-05T19:21:13 1733426473

Is a "reasoning" model really different? Or is it just clever prompting (and feeding previous outputs) for an existing model? Possibly with some RLHF reasoning examples?

OpenAI doesn't have a large enough database of reasoning texts to train a foundational LLM off it? I thought such a db simply does not exist as humans don't really write enough texts like this.

logicchains · 2024-12-05T20:14:10 1733429650

It's trained via reinforcement learning on essentially infinite synthetic reasoning data. You can generate infinite reasoning data because there are infinite math and coding problems that can be created with machine-checkable solutions, and machines can make infinite different attempts at reasoning their way to the answer. Similar to how models trained to learn chess by self-play have essentially unlimited training data.

int_19h · 2024-12-06T01:06:57 1733447217

We don't know the specifics of GPT-o1 to judge, but we can look at open weights model for an example. Qwen-32B is a base model, QwQ-32B is a "reasoning" variant. You're broadly correct that the magic, such as it is, is in training the model into a long-winded CoT, but the improvements from it are massive. QwQ-32B beats larger 70B models in most tasks, and in some cases it beats Claude.

emporas · 2024-12-06T02:29:47 1733452187

I just tried QwQ 32B, i didn't know about it. I used it to generate, some code GPT generated 2 days ago perfect code without even sweating.

QwQ generated 10 pages of it's reasoning steps, and the code is probably not correct. [1] includes both answers from QwQ and GPT.

Breaking down it's reasoning steps to such an excruciating detailed prose is certainly not user friendly, but it is intriguing. I wonder what an ideal use case for it would be.

[1] https://gist.github.com/defmarco/9eb4b1d0c547936bafe39623ec6...

griomnib · 2024-12-05T19:31:06 1733427066

It’s clever marketing.

tokioyoyo · 2024-12-05T19:11:21 1733425881

To my understanding, Anthropic realizes that they can’t compete in name recognition yet, so they have to overdeliver in terms of quality to win the war. It’s hard to beat the incumbent, especially when “chatgpt’ing” is basically a well understood verb.

apsec112 · 2024-12-05T19:01:28 1733425288

They don't have a model that does o1-style "thought tokens" or is specialized for math, but Sonnet 3.6 is really strong in other ways. I'm guessing they will have an o1-style model within six months if there's demand

VeejayRampay · 2024-12-05T19:26:51 1733426811

Claude is so much better

moralestapia · 2024-12-05T19:49:52 1733428192

I mean ... anecdata for anecdata.

I use LLMs for many projects and 4o is the sweet spot for me.

>literal order of magnitude less cost

This is just not true. If your use case can be solved with 4o-mini (I know, not all do) OpenAI is the one which is an order of magnitude cheaper.

bhouston · 2024-12-05T20:04:08 1733429048

Yeah, I've switched to Anthropic fully as well for personal usage. It seems better to me and/or equivalent in all use cases.

replwoacause · 2024-12-07T16:56:35 1733590595

Same. Honestly if they released a $200 a month plan I’d probably bite, but OpenAI hasn’t earned that level of confidence from me yet. They have some catching up to do.