Hacker News new | past | comments | ask | show | jobs | submit login
Claude 3.5 Sonnet (thezvi.substack.com)
301 points by elsewhen 4 months ago | hide | past | favorite | 204 comments



This article was written before Anthropic added the Projects feature for Pro and Team users [1].

I spent some time yesterday experimenting with Projects, and, like Artifacts, it looks really useful. I like the idea of being able to have multiple projects going simultaneously, each with its own reference materials. I don’t need to use it in a team, but I can see how that could be useful, too.

The one problem I see is that the total context window for each project might start to seem too small pretty quickly. I assume, though, that Anthropic’s context windows will be getting larger as time goes on.

I wonder what other features Anthropic has in the works for Claude. My personal wish is for a voice interface, something like what OpenAI announced in May but has now put off until later this year.

[1] https://www.anthropic.com/news/projects


From my reading there's nothing here that's specific to Anthropic, right? Any app like Chatbox, Msty or other could implement their own version of Projects based on any of the available models. (Maybe even Notion...) The API for creating an assistant with a pre-baked context is also available.

Or did I miss something unique there?


These sort of basic features seem like the hardest thing for AI companies. OpenAI only recently added a way to search all your chats. Gemini still doesn't have that.


where can I search for a chat in chatgpt? I miss that feature. Can't find it


I think it might only be available in the mac app.


Which is still horrifically slow and unoptimized. I’m not sure why streaming tokens from an api into an electron app is difficult but OpenAI managed to bungle it somehow.


its a native swift app


Interesting, and even wilder that it’s buggy like that. I get that desktop apps are probably not OpenAI’s expertise but the app is unfortunately pretty unusable most of the time as of the writing of this comment. You would think for the amount of money they pay their engineers something better would have been released.


Unusable in what way? I don't personally find the macOS app to be "unusable" at all.

Also, you may be underestimating how buggy Swift and the rest of Apple's stack are. It's hard to get those bugs resolved unless you happen to work at Apple. Thus, a lot of time is spent working around bugs up the stack. So I don't find it surprising that a company moving fast like OpenAI ships _some_ bugs. The mac app just came out this month? Give it time.


If you use the shortcut currently and paste any decent size prompt the app eventually grinds to a halt. Seems like a memory-leak like issue


And Claude's textbook still doesn't respond to the End key last I checked.

It's odd that things like that are what they're missing.


I'm currently building an open source interface for LLMs but hadn't heard about Chatbox or Msty, thanks for sharing.


see also Machato


No, I think you’re right. It’s just a matter of convenient tooling. I have been doing similar tasks with ChatGPT and Gemini, but, in the standard web interfaces, I end up uploading the same reference files multiple times and I tend to lose track of previously used prompts and unfinished projects. Anthropic’s Projects should make it easier for me to pursue multiple projects with different contexts at the same time.

The fact that Claude 3.5 Sonnet also seems smarter than the other current flagship models makes the Projects feature that much more attractive.

Maybe if I were smarter I wouldn’t find much use for Projects.


I'm far more excited for Projects than the other stuff. OpenAI's Memories is their lock-in for me; I'd have switched to Claude if Anthropic had something better. Projects sounds like an even better approach.


What are you using memories for? Openai told me it memorized some stuff, but it's never used it in a useful way nor do I see how it would.


Game dev. It keeps tabs on the type of game it is, background, genre, classes, theme. Notably the difference between one decision and another - it takes note when I say "Instead of Y, do X". Like if I show a preference of Chinese names, it'll give me more of that.


Claude 3.5 Sonnet's coding abilities are incredibly impressive. I think it lets an expert programmer move more than twice as fast. There are limits - to produce high quality code, not copy-and-paste pablum, you have to be able to give detailed step-by-step directions and critically evaluate the results. This means you can't produce code better than you would have written by yourself, you can only do it much faster.

As an experiment, I produced a set of bindings to Anthropic's API pair-programming with Claude. The project is of pretty good quality, and includes advanced features like streaming and type-safe definitions of tools. More than 95% of the code and docs was written by Claude, under close direction from me. The project is here:

https://github.com/cortesi/misanthropy

And I've shared part of the conversation that produced it in a video here:

https://twitter.com/cortesi/status/1806135130446307340


The biggest challenge I’ve felt is finding modern, idiomatic solutions for framework code. All LLMs seem to be victim to this issue where they will propose a mixture of old anti-patterns, roundabout approaches that technically work but make little sense, and mixing framework patterns with non-framework patterns in a way that leads to problematically incohesive code when taken together.

For example, I’m an experienced backend programmer but have been using Claude 3.5 Sonnet and GPT-4 asking questions about a frontend I’m building in TS using Svelte (which i am not very proficient in). The LLMs frequently confuse themselves with TS/JS, server/client side approaches, recommend old and deprecated approaches, and mixing patterns from other frameworks (e.g. react) when an idiomatic approach does exist. The biggest risk is when, in my ignorance, I do not detect when this is happening.

It’s been functional enough to push a hacky prototype out (where it would take me probably months longer to learn and do this otherwise), but the code quality and organization of the project is certainly pretty low.


I find LLM coding much less useful when it’s interacting with a large existing codebase. It’s certainly good at one-off type code and greenfield projects (especially if similar to other open source stuff). And it’s also good at getting started if you aren’t an expert yourself.


We haven't found this to be an impediment. Keep things modular, and share the type definitions of anything you import with the model. As the benefits here become more and more clear tooling will improve and people will adapt their development practices to get the most out of the models.


I have been developing Python 20 years now. Claude 3.5 is the first AI that is “smart” enough to help me. I usually do not need help with easy task, but complex ones. Claude is not a perfect, but it definitely gives a productivity boost for even the most seasoned developers, which would have been some obscure mailing list and source code reading in the past.


my 2p worth - my work involves a lot of summarisation, recommendation from a user preference statement. I've been able to do this with 4o / opus, but the consistency wasn't there, which required complex prompting chains to stabilise.

What I'm seeing with Sonnet 3.5 is a night-and-day step up in consistency. The responses don't seem to be that different in capability of opus / 4o when they respond well, it just does it with rock-solid consistency. That sounds a bit dull, but it's a huge step forward for me and I suspect for others.


Found same.-

PS. Consistency is everthing sometimes.-


PS. I realized the above advice is not consistent.-


Can anyone comment on its coding ability?

Considering cancelling my subscription with OpenAI as I was previously using GPT-4 quite heavily as a multiplier for myself, guiding it and editing outputs as required, but GPT-4o feels significantly worse for this use case. It is certainly better in many other areas, but its coding ability is not great.

I tried to revert back to standard GPT-4 but it is now so slow to respond (higher load?) that it breaks my mental flow, so I'm exploring other options.


The first night it was released, it one-shotted merging two Python files together (one was an incomplete mess of speech to text, the other was an also-incomplete Claude api demo with a fake loop with forced conversation paths). It rewrote the fake loop and turned it into a voice interface that worked on the first try(then I asked it to add text to speech and it did). The second night it improves the code by adding threads (which I do know about but wouldn't have thought to use it there) and it worked better. Then Linux decided not to work properly (pulseaudio and ALSA issues), I asked it to help me migrate to pipewire and it gave me step by step directions, when I had issues all I had to do was post the error and it fixed it enough that it's all working again. It's been a pretty amazing week from my perspective. I finally had to take a break earlier this week but I still use Claude 3.5 for gardening (taking pictures and asking for help) and general advice.


Would you please describe your setup for coding? I'm interested in trying this out, but I need an ELI5. Thanks.

PS Anybody is welcome to jump in with a description of their setup.


There's not really anything to it. As I mostly use console.athropic.com workbench(the API pay-as-you-go), I haven't played around much with the Artifacts--I am probably getting a Claude Pro subscription later today. But yeah I just copy/pasted my 2 code files in the window, told it I wanted a conversational voice interface model, and in less time than it took to describe it, it wrote it and even saved a bunch of output tokens by saying where to copy large parts of the first file's boilerplate code. That had been a project I started around Easter and had kept putting off because it felt too hard for my somewhat limited time. But there's no moat or special setup to this stuff; anybody who can communicate in a text box can productively use this tool.


I currently tested the cursor IDE [1]. It’s vscode with nice ai integrations made with great ux in mind. They integrate with ChatGPT and Claude Sonnet/ Opus. For my experience (rust/typescript) the sonnet 3.5 model is giving me better results.

You can pass the current file, or multiple files or your whole project as context to the models.

[1] https://cursor.com


It s prob not gonna work well: he fixes demo of the api of Claude with Claude. You didnt think of asking the AI, so you need more work on your own inference abilities first


I just copy and paste snippets and ask gpt to fix it or translate it


Sourcegraph Cody uses Claude


Gardening? The vision stuff for plants has always been terrible guesswork. Is it actually good/useful now?


It's a start. We just had a sudden storm that severely damaged some tomato plant stalks, and it suggested splinting them, or if it is still having trouble in a couple of days to cut off the top and plant in another pot, which is advice I've received from others too. The first picture I sent was some early blight on the leaves, it didn't get it on the first try (suggested the stem had edema, which I suppose is possible), but when I mentioned I thought it had a fungal issue it correctly noted the brown leaves and suggested a copper fungicide, among other correct remedies. It also correctly discriminated between two types of cucumber (well one was Armenian melon) held by my friend who had just picked them from her garden. Claude said that because there were hands present that it suggested that the vegetables were home-grown and/or freshly picked, and before that noted that they seemed to be good quality, which was correct. I went into this particular aspect of using Claude thinking it would be useless, but that's not what I have found with Claude 3.5 Sonnet. I'll note the fungal blight was with Claude 3 Opus, as well. Hope that helps.


I was trying to get it to create an LLM from scratch using tensorflow.js with training and prediction, it got about 95% of the way, impressive.

But gpt-4o actually got a working solution in a couple prompts -> https://gist.github.com/thomasdavis/fadbca01605fb3cb64911077...

Though the new artefacts is really damn handy, you can describe the most detailed projects, and it does a really great job of what you asked for, and I found it delivered beyond what I wanted.

I am now paying for both -.-

- it's hard to rank which codes better, but I think claude has better abstractions

- sometimes I paste the output of the other, and continue solving on the other


Yo! Long time no see haha <3

Would love to see your workflow sometime, my experimentations have been small tbh but I was quite impressed with Claude.

Funnily, I think (modulo costs) feeding outputs of one model into another once you hit a brick wall is quite powerful, but feels clunky to do at the moment


(I'm biased as I work for anthropic)

I broke my hand 2 months ago and Claude 3.5 sonnet has been writing all my code for me. It's been awesome


Since you work there, can you ask that they are more transparent with exactly what you are paying for with a "pro" subscription?

Currently you are apparently paying for:

- Everything in Free - Use Claude 3 Opus and Haiku - Higher usage limits versus Free - Create Projects to work with Claude around a set of docs, code, or files - Priority bandwidth and availability - Early access to new features

But what are the usage limits? Higher than free by how much?

Having an invisible limit on a paid product really rubs me the wrong way. Maybe some rate-limiting after a certain amount would be better than a hard cutoff, but even then I'd like to know what the limit is before I pay, not when I accidentally hit it in the middle of something important.


I ran into this rate limit recently when generating a logo in SVG. It's 45 messages per 5 hours (with some caveats). It's documented here [0].

[0] https://support.anthropic.com/en/articles/8324991-about-clau...


> Your limit gets used up faster with longer conversations, notably with large attachments. For example, if you upload a copy of The Great Gatsby, you may only be able to send 15 messages in that conversation within 5 hours, as each time you send a message, Claude “re-reads” the entire conversation, including any large attachments.

Seems to also be limited by tokens. It’s still quite obscure.


Even moreso:

> Please note that these limits may vary depending on Claude’s current capacity.

Fine for the free tier of course, but not great for the paid version.


With a fixed capacity, someone’s traffic is going to be rejected. I think it makes sense to have a variable rate limit depending on available capacity and demand.


For what it's worth, I haven't run into any limitations while using the API side. Although I also haven't really had to use it so extensively that it would be a problem.


ChatGPT says the same about capacity for Pro I think.


I was literally about to type in my credit card number after seeing the new code editor window UX, but for the life of me I had no idea what I was paying for.

I’ve been paying for GPT since 3.5 debuted and I know what I’m getting - full, unlimited use of the best model. Period.

Anthropic needs to figure out what the hell they are selling.


ChatGPT Plus does NOT give you full unlimited use of the best model. There are still limits.

FWIW I regularly hit my ChatGPT Plus limits, and I think the “dynamic” limiting is regularly in place. I’ve only once hit my Claude Pro limit. I now use Claude more than ChatGPT.

From this page:

https://help.openai.com/en/articles/6950777-what-is-chatgpt-...

As of May 13th 2024, Plus users will be able to send 80 messages every 3 hours on GPT-4o. and 40 messages every 3 hours on GPT-4. The message cap for a user in a ChatGPT Team workspace is approximately twice that of ChatGPT Plus.

Please note that unused messages do not accumulate (i.e. if you wait 6 hours, you will not have 80 messages available to use for the next 3 hours on GPT-4).

In certain cases for Plus users, we may dynamically adjust the message limit based on available capacity in order to prioritize making GPT-4 accessible to the widest number of people.


It says 5x the free tier usage, but I agree we could be more explicit about what that is. Just made a ticket :)


+1. More transparency around usage limit would be very helpful. Message based seems arbitrary if different messages are different lengths. Maybe token based with a bar that tracks how much you’ve used. It would go down with use and up with time and capacity and you’d be able to track and pace usage better than a warning you’re on last ten. Or just let us pay more for more usage! Switch to API! The workbench does not have as nice affordances as the chat.


Why do you work? Can't you get a sick leave?


Mind explaining your workflow? I actually just broke my wrist today :(


Oh no! I use Apple voice control into Claude and then copy and paste code back and forth between my IDE and Claude mostly.

For longer writing,I really like going for a walk for 45 minutes and brain dumping on a topic, and transcribing it. Then I write a brief outline and have Claude fill it out into a document, explicitly only using language that I used in the transcript. Then edit via voice


I don't know about the usability for you but if you use vscode already cursor is basically vscode but has access to various llms, including Claude. I started using it recently and really like it.

https://cursor.com


Are you using the claude.ai interface or do you have something bespoke?


Thank you!


I saw the initial 3.5 Sonnet release a few days ago and decided to pull the trigger with canceling OpenAI and moving over, and that was a great decision. I've been having to do more and more intricate nudging and leading towards the correct answer or approach when using ChatGPT of late.

My experience with Claude is very positive when it comes to programming and planning out infrastructure. My only gripe so far has been some ethical constraints that didn't exist for ChatGPT, and those are a big one for me since I don't need Anthropic employees to act as my moral compass. For a specific example, asking about securing code through making decompiling or reading harder is a no-no for Claude, but a-ok for GPT.


You can use OpenAI's developer platform https://platform.openai.com/playground/chat

I haven't used the normal website in a year, it is a way better experience:

- It allows tweaking of the model / temperature.

- You pay only per token used

- It does not use your conversations to train their models

- You can edit both your and the assistant message in the conversation or delete it.

- Allows for easy message replay if the output is not great

Be mindful that you are paying per token for every message in the thread, so if the thread becomes long it gets expensive, but still far more value than subscribing.


The world is changing too fast for knowledge that is nine months old.

------

USER

What is the maximum number of tokens supported by gpt-4o?

ASSISTANT

As of my most recent update in October 2023, GPT-4 does not have an officially designated variant called "gpt-4o". However, OpenAI's GPT-4 model supports different maximum token limits depending on its variant. There are two primary context lengths for GPT-4 models:

1. *GPT-4 with 8K context length*: This variant supports up to 8,000 tokens. 2. *GPT-4 with 32K context length*: This variant supports up to 32,000 tokens.

These limits encompass both input and output tokens in a given interaction. If you intended to refer to a specific or newly introduced variant, such as "gpt-4o," it is essential to look up the most current resources or official announcements from OpenAI, as models and their capabilities can evolve.


I don't know what you're trying to point out here. The models aren't trained on themselves.


ChatGPT and Gemini incorporate web search into their results, so they're not limited to stale information.

If you think you can save money by paying per token in the playground, you should know that you're not getting the same results.

For example, if I ask Gemini the identical question, it gives a reasonable answer, even though Gemini was released before ChatGPT was released. It can look up the current OpenAI documentation and summarize it:

USER

What is the maximum number of tokens supported by gpt-4o?

ASSISTANT

GPT-4o has a context window of 128,000 tokens. This means it can process up to 128,000 tokens between your input and its output combined. However, the maximum number of output tokens (the AI's response) is limited to 4,096 tokens.


I've been playing around with it this week and its coding ability is insane (for a LLM). I've given it some pretty sloppy descriptions about things I want to do and it's managed to figure out exactly how to do it on the first or second try, I'm talking things like building animations in React that cannot be described with text very well. Big pain point is copy and pasting things back and forth to have it edit them. If it was integrated and could see my local files, that would be killer. I know there are various companies working on that, but the jetbrains AI integration for example is garbage compared to the results I get by manually asking claude.

I wasn't worried about how this would affect our industry a few months ago, but this has me reconsidering. It's like a junior engineer that can do most tasks in seconds for a couple of cents.


What worries me is you need that time in the dirt to get a feel for coding as a craft. And at least for me that aspect of knowing the craft helps get my thinking in tune with problem solving in a very productive way.

Coding can be similar to playing an instrument, if you have mastery, it can help you be more expressive with the ideas you already have and lead you to new ones.

Whereas if we take away the craft of coding I think you end up with the type of code academic labs produce: something that purely starts on a “drawing board”, is given to the grad student/intern/LLM to make work, and while it will prove the concept it won’t scale into long term, as the intern doesn’t know when to spend an extra 30 minutes in a function so that it may be more flexible down the road.


> What worries me is you need that time in the dirt to get a feel for coding as a craft.

I see this sentiment a lot regarding gen AI. An I get it, we need to learn our tools. But this seems like it's saying the only way to learn problem solving is the way you learned it. That's just not true. Everyone learns problem solving differently and the emerging field of gen AI will figure out it's own way. It's a different way of thinking. I see my niece using ChatGPT to make projects I wouldn't have even imagined taking up at her age. Her games work. Who am I to say she isn't learning problem solving? In hindi we say "pratyaksh ko praman ki kya avashyakta" (what's right in front of you doesn't require proof).


I’d say it’s far more useful than a junior engineer. A junior might know one or two areas well, but Claude 3.5 knows literally every steep learning curve across all engineering professions. Claude may not yet be able to orchestrate a large project, but it can advise on structuring using knowledge drawn from best practices across every conceivable discipline. We may not get “AGI”; we might get something far more useful instead.


I switched to Claude when Opus came out and have never looked back.

Anthropic tech is vastly better at code.


Also, Anthropic Projects is HUGE.

You get to create a set of default instructions that all project prompts to adhere to, like: Use Linq whenever possible. Consider Fluent capabilities when it will improve developer experience.

Then you can add all of your existing code (though as has been pointed out, connecting to a repo would be ideal).

And then you can have multiple conversations within that project, so you can separate your lines of thinking.


Alas, the same problems exist. Sonnet 3.5 is faster and a bit smarter, but it still cannot follow a longer conversation.

This is the problem that LLMs have not yet solved.


I haven’t had a great experience using GPT-4o for coding; it will sometimes make a mistake, fix the mistake introducing another mistake, then fix that second mistake repeating the first. Or other weird things like that. I love the text to speech, and image generation sometimes comes in handy, but ultimately I prefer Sonnet 3.5 for coding and have recently switched to that.

If OpenAI comes out with a stronger model I’ll prob switch back, I don’t have much brand loyalty in this regard. I could see some features making usage more sticky (memory, projects, etc).


What languages / frameworks do you focus on?

I’ve also had worse results with 4o than 4. It’s faster but it vomits output.

I can’t tell if I need to go tweak memory or if it’s just like this.


I've been using it in https://double.bot (vscode coding copilot I'm hacking on) and it's been pretty great so far. We have support for all the top models (gpt4o, deep seek, opus, etc) so I regularly like to switch between them especially if I see a challenging problem. So far, sonnet 3.5 has the best quality and is the one I use by default.


How does double.bot compare against cursor.sh? I've been using the latter for a while after switching from copilot++ and am pleased.


I switched from a GitHub Copilot subscription to Cody for VS Code, to give the competition a chance.

Cody's chat UI seems pretty good for making comparisons like this. You can set it to whichever LLM you want, including GPT-4o and Claude 3.5 Sonnet.

I haven't used Claude enough to do to a detailed comparison, but GPT4o and Claude 3.5 seem roughly similar for my coding questions.


I use it mostly for coding help, not Python or JS but a mix of C and C++ for my own framework.

It is surprisingly good and helpful. I am still exploring the limits.

Writing doc comments or test cases is much faster and more fun with this kind of tool, but you still have to double check everything as it inevitably make mistakes, often small and not obvious…


>> GPT-4o feels significantly worse for this use case

Interesting. I felt GPT4 was virtually useless and GPT3.5 was the best, then came along GPT-4o and it instantly became the only version worth using.

I find GPT-4o to be extremely powerful and useful.

What don't you like about it?


To clarify, I think it's extremely powerful and useful too. It's just that I get more misses than hits when using it as a rubber duck and coding assistant.


Hit or miss, with significantly more hits than misses. I was able to build a useful cybersecurity service in about 6 hours leveraging Claude to write most of the code. It has certainly made me more productive.


It's slightly higher ranked than 4o on the lmsys leaderboard, Coding section. But the difference is not big.


I find it’s coding ability superior to GPT4 and 4o


Coding ability is off the charts, to the degree that I think in one year software engineering will be considered a dying field.


What will the people who use LLMs to write code be called? What will the field be called?

Where will all the power and computers come from that will replace the tens of millions of software developers?


I feel reasonably confident that most of the people who are making these hyperbolic claims either don't understand programming at a deep enough level to make the claim or are trying to sell you something.


I'll throw my hat behind this horse because, honestly, if I was just learning to code, I would have probably quit by now due to the frequency of these types of comments. LLMs have certainly improved at an impressive rate, and they're fairly decent at producing 'Lego' blocks. However, when it comes to block building—the real meat and potatoes of programming—they're, at best, bad butchers. Building functional systems is hard, and it's simply something LLMs can't do now or perhaps ever. Or I just don't know how to prompt. 50/50.


No, I've got 12+ years of experience in FAANG software engineering. I really believe the end is near


FWIW, very few people will actually care about this statement unless you state your name, work you've done, etc. A lot of users here don't pay heed to anonymous appeals to authority.


and who would actually verify the output of llm works as expected? Unit tests? Who would verify unit tests are testing what needs to be tested instead of just a branch/output coverage? I've used these tools personally and I've used gh copilot at company I work in and at no point in time I was under impression llm's can do all of my work. Each step needs verification if it actually gives what you need, some steps do need good description by someone who knows what the system is doing to perform the changes where these are needed. Continuous extension of features without good design patterns in place, usually checked by devs, can also create a lot of problems. So what makes you think in one year swe will be considered a dying field? What I see is the need for better devs that can see beyond boilerplate that can be empowered by llm's, but it doesn't mean a dying field, it means rising the standards to break into the field.


Yeah good idea to put everyone out of their job and give it to 2 companies. Then what happens when those companies decide to rise prices by x1000 (twitter api ?). Or what happens when your system has a bug and does not work and the AI says it’s all good ? Who will introspect your code to find what’s wrong ?


Here we go again. If you seriously think that then you should probably consider changing careers.


As someone building an AI company right now, my quick Pro/Con for 4o vs Claude 3.5:

Claude: subjectively sounds more human to me, and really nails data questions that 4o is lackluster at

4o: far better assistant logic reasoning. I can trivially break Claude's assistant (system prompt) instructions within the user prompt, where 4o succeeds in all of these tests.

Pricing and output speed, for our purposes, are functionally identical. Exciting to have a competitor in the space already who stands to keep openai honest.


Aha, so I’m not the only one. For both Claude 3 Opus and 3.5 Sonnet, anecdotally its language is far more natural. So much so that I prefer it over 4o.


And hum, what incredible problem are you solving at your AI company? Must be the forefront of human innovation ! I hope it's porn.


> You can say ‘the recent jumps are relatively small’ or you can notice that (1) there is an upper bound at 100 rapidly approaching for this set of benchmarks, and (2) the releases are coming quickly one after another and the slope of the line is accelerating despite being close to the maximum.

The graph does not look like it is accelerating. I actually struggle to imagine what about it convinced the author the progress is accelerating.

I would be very interested in a more detailed graph that shows individual benchmarks because it should be possible to see some benchmarks effectively be beaten and get a good idea of where all of the other benchmarks are on that trend. The 100 % upper bound is likely very hard to approach, but I don't know if the limit is like 99%, 95% or 90% for most benchmarks.


I heard a theory today that hitting 100% on the MMLU benchmark may be impossible due to errors in that benchmark itself - if there are errors in the benchmark no model should ever be able to score 100% on it.

The same problem could well be present in other benchmarks as well.


I think this is what they meant: https://imgur.com/a/GWqfp9U

If you take the upper bounds at any given point in time, the rate of increase of the best models over time is accelerating.


i took it to mean progress is increasing, not rate of progress is increasing. a classic case of "acceleration misuse" but nothing more


Apparently my account was banned on Anthropic Sonnet after a "Automatic review". I'm 100% sure i did not make any "unsafe" queries, I've litterally only briefly tested and that was weeks ago.

+1 OpenAI Subscription -1 Anthropic Sonnet->sudden-death-automatic-review-system


You can fill the form here: https://support.anthropic.com/en/articles/8241253-trust-and-... and eventually you get unbanned.

Unfortunately, by the time Anthropic's next model is out you are likely to be banned by an "automatic review" again. At least that's my experience.


I was banned too. I filled in the form 2-3 times already, but I never heard back from them.

My working theory is that I was banned due to using a VPN (Mullvad) with a location set to Europe, at a time when Europe users were not allowed to use the app.

I wasn’t actually in Europe, to be clear, but I live in a country where even WhatsApp voice calls are blocked and so I routinely have my VPN turned on.

The country I live in is officially supported by Anthropic, and so is Europe these days, so it’s quite frustrating that they won’t unban me.

I can’t use ChatGPT and Perplexity either when I have my VPN turned on, but at least they don’t ban my account.

Fortunately, Poe is VPN friendly.


Their support sucks tbh, months ago I used my phone number with my work gmail and wanted to move it to my personal gmail (phone numbers can only be used on one account for some reason) to subscribe, never got a response to the request.


I had that too, no explanation, no response via email support, nothing. I can't give money to them knowing my account can just get banned at anytime while I might have an active subscription going on.


Had the same issue, filled out the form others linked, didn’t hear back via email, but my account was unblocked after ~3 days.


Mine too! And I didn’t even get to use it once. I’ve filled in the form, let’s see if they lift the ban.


You may be able to create a new account from a different email?


Accounts are tied to your phone number.


So when someone lets (or has) their phone service get disconnected and the company recycles that phone number shortly, the person who then gets this phone number is out of luck if they want to use Claude?


Which are locked to the first account verified with them.


I don't need an AI to write code for me, but it is _astoundingly_ helpful to have it summarize various design options and new technology stacks without me having to scavenge Google for the obscure corner-cases I care about.

I have an idea for a project that involves streaming 3 giabits of data per second from a USB 3.0 device out over a 10 gig Ethernet connection, and it was able to compare/contrast various levels of support for high-bandwidth USB 3 and Ethernet in multiple frameworks and languages.

And the whole conversation, with code examples, cost me 3 _cents_ of Anthropic credits.

My new fear is when people start asking AIs "Hey AI, here is my codebase, my org chart, and commit histories for all my employees - how can I reduce the number of humans I need to employ to get this project done?"


Examples like this make me think, “maybe the S&P 500 will be at 15,000 by 2030.” I mean, in my own job, Claude is making me substantially better in a wide range of areas, from sales negotiation to managing my team. Add that up across millions of people…


> it was able to compare/contrast various levels of support for high-bandwidth USB 3 and Ethernet in multiple frameworks and languages.

How do you know such information is correct though?


Some of it I was able to verify immediately, some I confirmed by poking around in Github repos once I knew the magic words, some I haven't verified yet.

For example, I wasn't sure if there was good support for USB devices in Rust. Just not something I'd ever bothered to investigate. Claude knew that there was a libusb wrapper and a "rusb" crate, and claimed that support for USB in Rust was stable. I verified the existence of the crates, and I'll take Claude at its word for stability unless I discover otherwise.


The same way as when you learn something new from some random internet person.

Does what they say conflict with anything you already know to be true?

Is their argument internally inconsistent, does it contradict itself?

Can you corroborate parts of their argument from other independent sources?


The key difference is that I can reasonably trust (depending on the audience) that the other human is not simply hallucinating facts in order to provide a response.


Not only do humans invent stuff that didn't happen (hallucination), we're also prewired with a very long list of cognitive biases.

It depends a lot on the domain of course, but I'd bet that frontier LLMs already exhibit superhuman capabilities in providing accurate answers in the vast majority of domains.


The fact that we can notice our own biases and correct them is irrelevant to you?


Even without writing code, on personal projects it's worth it for the three things I'm always lazy about:

1. Good, detailed commits 2. Tests 3. Docstrings


Nobody cared when it was the artists, the songwriters, the music studios, the film companies. But impacting jobs of the programmers - now, heaven forbid.


The specific issue of using a LLM to make org decisions on how to downsize actually affects nearly all jobs equally.

From what I can tell, most programmers are more ok with LLMs directly replacing them than artists are. I tend to agree that it is better to replace programmers, and protect artists.


The replacement is currently less obvious with developers, AI work needs to be copy pasted to the right places in execution and the hooks into the rest of the code need to be set up right.

It doesn’t yet do prompt to 100% finished artifact.

For concept artists and illustrators at least who have very generic or commercial styles Diffusion models can create a final artifact that passes as good enough to a lot of clients.

Think developers would feel a bit different if they were in the same situation today.


I'm pretty sure people cared about the artists too.


Did the people building dall-e care?


Are those people here? Do you reckon GP is one of them?


Must be fun working on cutting edge competitive stuff for all these 3 major teams. It's exciting to live in this times and see this all unfold in our eyes.


On that note, and apologies if it sounds spammy, but genuinely if there's any AI engineers reading, check out asana.com/jobs - we partner with the two leading AI labs (they're all customers of ours too) and I know the team get to experiment with some of the early release stuff to build our leading AI work management platform (Dustin, our very technical founder and CEO writes about what the deal with our latest AI advancements are here https://asana.com/inside-asana/the-key-to-unlocking-the-pote...) - I feel like it's one of the best places to work in so many ways, but the cutting edge AI stuff is REALLY fun!


Who 3


Presumably Anthropic / Claude, OpenAI / GPT, Google DeepMind / Gemini.


thank you. yes I meant that.


Meta, OpenAI, and Anthropic come to mind.

In TFA they name OpenAI, Google DeepMind and Anthropic.


TFA = The Freaking Article; wow! what a full form.


Likely 4, based on the other comments. Maybe 5 if you want to add Midjourney.


AI programming would be really useful if it moved towards me being able to make a fixed set of statements about the software, those statements are preserved permanently in the source code somehow, and the AI ensures that those statements remain true.

Its frustrating to work with the AI to implement something only to realise within a few interactions that it has forgotten or lost track of something I deemed to be a key requirement.

Surely the future of software has to start to include declarative statement prompts as part of the source code.


The first thing I built with OpenAI a few years ago was a system that had a section for a program spec on one part of the screen and a live web page area on the other part. You could edit the spec and it would then regenerate the web page. It would probably work much better now. Eventually I might add something like that as a plugin in my current framework.


Have you tried building that system with prompting?

You could set a convention of having comments something like this:

    # Requirement: function always returns an array of strings
And then have a system prompt which tells the model to always obey comments like that, and to add comments like that to record important requirements provided by the user.


This is a good solution. I’ve taken the idea that maintaining the vision in my job in the LLM relationship. Reiterating key details street it’s forgotten and burning tokens tweaking things over and over towards the vision is the cost of doing business.


Claude 3.5 Sonnet took a solid lead on our internal benchmarks over gpt-4-turbo for extraction tasks against large documents.

It may not be great for every workflow, but it certainly hits a sweet spot for intelligence x cost on most of my workflows.


Have anybody tried it against Gemini, for extraction tasks against large documents(its similar to question answering from books, right?)?


Claude 3.5 Sonnet supports a maximum of 200k tokens. Gemini handles 2,000k tokens.


اريدك ان تجعل هذه الصفحة اكبر واكثر صفحة احترافية في التاريخ اكثر من موقع ابل واقوي من اقوي البراندات العالمية اريدك ان تضيف مزايا احترافية جداجدا ليصبح الموقع والتطبيق رقم 1 في مجال تسجيل الاوزان والجيم واكتب انت جميع الاكواد لانه لا خبرة لي في البرمجة اطلاقا ولا استطيع ممكن ان ياخذ ذلك مني سنينا اتمني ان تشارك في هذا العمل الانساني الخيري


Well, I went to try it, but it requires a phone number for some bizarre reason. Fine, gave it my primary number, a google voice number I've had for a decade, and it won't accept it. That's the end of my Claude journey, forever.

If you want me to try your service, try using some flow with less friction than sandpaper, folks.


Google Voice blacklist is pretty common.


I have about a dozen credit cards, 4 bank accounts, mortgage, car payment, utility accounts, github 2fa, aws 2fa, Fidelity retirement, etc. Not one has an issue with my number.

I did have some service refuse it, I want to say Twitter? But I'd definitely not consider it common. This is probably only the second or third time I've seen it, tbh.


I'm going to guess the reason Anthropic do verified phone numbers is that, unlike email addresses, most people don't have an easy way to create multiple phone numbers.

Since Anthropic accounts come with free rate-limited access to their models they're trying to avoid freeloaders who sign up for hundreds of accounts in order to work around those caps.

Google Voice numbers are blocked because people can create multiple of those, which would allow them to circumvent those limits.


What's the opportunity cost there? How much does a rate limited account cost and how much is a signup worth?


These systems are likely under heavy sign up attacks and so must have a way to guard against that since their costs are quite high.


I’m really curious where that line is and how it is identified. How much business is lost to the onboarding friction? Is it more or less than the cost of the attacks?

On one extreme usage of the service is totally open to everyone. On the other it is completely locked down and nobody can use it ever. Clearly neither of those are workable. So it’s a question of how far to go with validation.


There is just no choice at all. Due to cost of running those LLMs it would be very easy for adversary to spam bots to use their service for free.


2fa and those services are different. LLM api calls are relatively costly and in demand so they must have a decent barrier of entry so people couldn't just spam automated accounts with automated requests to use the app for free. There are systems that abuse that and even existing API integrations are being abused by proxying and using for other purposes if they are free and don't have good security. Any place that wants to defend against bots can't trust those phone numbers.


For a sales lead? Why?


I find these powerful and self-entitled statements about losing an individual's customership FOREVER quite funny. You're ranting about some niche concern of yours and proudly proclaim to us that you'll never be a user as if we'd care, and not so softly imply that everybody else should also watch out.


You read way too much into my statement. I don't know who you are, or why you care(you spent long enough to get worked up and write a reply, afterall).

My message was simply my opinion as a user, to anyone who may have a similar flow. No more, no less.


They don't want you to do anything. The friction they ware attempting to reduce is between phone numbers and wallets. This isn't much different than the typo-laden email scam. It's working as intended.


This is highlighting what has happened with all forms of ML. Give a baseline set of folks the same dataset and they will end up with a model that performs about the same. Companies are one-upping each other but it’s very back and forth and just a case of release date. These models will become a complete commodity. The thing that could be proprietary is the data used to train them, which could lead to a sustained better model performance. The barrier to entry here is super high given training costs, but the ML skills are still a commodity.


This is why I strongly believe Google has the clear advantage. Infinite data, infinite resources , not to mention dozens of verticals (search, Android and Chrome are probably the killer ones)


Google obviously has the advantage here, but it also seems like they’re willing to squander it. The Gemini rollout has basically been clippy 2.0 so far. The Gemini interface in gcloud seems to know nothing about how services work, the chat summaries consistently fail, the search summaries are comically bad. I’m usually not one of these people who wants to “block all the AI stuff” but with google products I do.


The dataset could become proprietary you say?

In other words: information with copyright on it that was (illegally) used/stolen from billions of people and companies will get copyright on it that will be resold as a set?

I don't know...


I think the comment was referring to things like internal company data, or licensed data that is not publicly available, etc. Those things could be a competitive advantage.


I have been using Claude Sonnet with Artifacts along with Vercel V0 to build Sveltekit pages and components really well. I create a UI in V0 and then simply copy the JSX into Claude and tell it to convert to Sveltekit. It creates the +page.svelte +page.server.ts and all the components almost perfectly.


Claude helped me code up a quick webhook handler for a silly side project. Being a side project, I asked it to put the code into a Docker container, which is did flawlessly. It also gave me a Makefile with sensible targets for building, rebuilding, starting, stopping, etc. Finally, I asked it use an .env file for the secrets I needed to store. Everything it did was one-shot on target. The only thing that would make this better would be to have it in the terminal with me.


I keep saying that if nothing else, we are in a golden age of actually finishing side projects. All those “it’s easy, I’ll just glue this API to this API” projects really are easy now.


This, exactly. Every side project I've thrown at Claude 3.5 Sonnet has been completed that same night. It's so different from how I used to treat my backlog projects which could take a week or so of research-code-iterate and now they're just an evening (or less; since Sonnet's upgrade on average it's taken me about 20-40 minutes to get the night's planned work done) and I get to sleep earlier. Win-win-win


This is not really valuable, because it could probably be done by an fiverr contractor cheaper than the marginal cost of $(LLM training cost + query cost)


That contractor would likely just be a LLM proxy today though...


OP isn't training though. They're just paying for query. Much cheaper (and insanely faster) than fiverr


I wish it could preview svelte builds. I wish it would let me include svelte files in “projects.”


I'm fairly impressed with Sonnet's one shot scripting performance for my use cases. However, I was using it to help me diagnoses a gnome key ring and ssh issue I was having, and it suggested that I 'rm -rf' my keyring files to test it's solution out. A little drastic..maybe mv my password file first? Anyway, it sometimes seems even more cocky than last gen, and less careful by default


Claude Sonnet is freaking amazing. I used to have a safety test[0] that Claude failed. But it was a bogus safety test and fortunately someone here told me so and I immediately subscribed to it. It's amazing. The other day I ported a whole node.js script to Python with it. It was not flawless but it was pretty damned good. Such a mechanical process and I just had to review. Loved it.

0: https://news.ycombinator.com/item?id=39607069


I do not understand this. Will you still love it when it takes your job?

(Assuming it is really that good, which is something that I can never replicate.)


I cancelled my OpenAI membership and using more and more of Claude. Sonnet is pretty fast and cheaper than 4-o.

I'm legit elated that a smaller player is able to compete with large behemoths like OpenAI and Google. (I know they have Amazon backing them, but their team is much smaller. OpenAI is ~1000 employees now).

I'm building on top of their api. It's neat. I wish them the best.


I’ve been wanting to test it, but the API console wants me to fill in a VAT field. Not sure how to get one of those as an individual.


It's really bad ux but you can just leave it blank as a private user.


is it though? blank optional fields have been the standard since forever.


The whole organization wording throws a lot of non-native people of. Combine it with the vat field and you have bad ux.


A lot of these guys ask for company name and it often is required. I'm not a company. Just a dude trying to evaluate your thing for my hobby projects.


Yeah, that’s why the organisation wording is unfortunate. It’s meant more like a group name under which you will have your projects and not corporation name.


These models are clearly great with language, be it natural language or code. However, I wonder where the expectation comes from that a static stochastic parrot should be able to compute arbitrary first order logic (in a series of one-shot next word predictions). Could any expert elaborate on how this would be solved by a transformer model?


The incredible ability of Claude 3.5 Sonnet to create coherent SVG makes me wonder if the LLM was not just pretrained on text. Vision capabilities are usually added later using a vision encoder that does not affect the LLM's knowledge of the visual world, but in this case the LLM clearly has quite a strong understanding of the visual world.


Have you read the "sparks of AGI" paper about GPT4? It suggested that even just text can give an LLM a rich world model, based on the tikz drawings of a unicorn that got progressively better as GPT4 precursors were trained on increasingly more data (and, interestingly, the drawings got worse when it was RLHF'd for safety).


Yes of course, as always, it's very possible that just scaling solved the problem, but the fact that the model is so good makes me wonder if they actually did something different and pre-trained the model on image tokens as well.


Claude 3.5 Sonnet can solve the farmer and sheep problem with two small changes to the prompt: 1. change the word "person" to "human". 2. change the word "trips" to "trip or trips". (Claude is probably assuming that the answer has to be in multiple trips because of the word "trips")


Are there any tools out there that expose GPT or Claude to a codebase, and let it write PRs (semi) autonomously?


I know of Sweep <https://docs.sweep.dev>, but honestly the SOTA on SWE-bench, which is basically exactly what you're asking for, is only about a 25% success rate, so expect very mediocre results.


Another one is https://www.github.com/trafficguard/nous which provides a software dev agent that can find/clone repos, create a branch, search in a repo, then delegates to Aider for the code editing in the edit/compile/lint/test loop, and raise a merge request.


I’m working on adding Sonnet 3.5 to JACoB this week. So far it’s been very impressive. https://github.com/jacob-ai-bot/jacob


Semi autonomously is the key. Aider and Plandex are both pretty good.



Thanks, tried it out - it's pretty cheap and pretty good. Being able to commit directly into multiple files simultaneously is great.


It's probably not going to be very good at handling it out of the box. It would require quite a bit of fine tuning.


I wonder if openai they have a response ready or if they just are tackling other business problems like ios integration now and the seemingly postponed low latency chat launch. Either way I am looking forward to claude 3.5 opus.

It did seem slightly odd to me that openai made their supposedly best model free.


I pay for both OpenAI and Anthropic pro plans, and can say OpenAI is lagging behind at this point. Hopefully their next model steps it up.


I’ve thoroughly enjoyed the product overall much more than ChatGPT. I do wish it had voice input that rivaled what OpenAI previewed. Excited for 3.5 Opus. For now I’ve canceled OpenAI subscription and removed the app in favor of Claude.


Still not as good at math as gpt4o from benchmarks and also my experience


me: "is 400000 the same as 4 * 100 * 100"

gpt-4o: "Yes, 400,000 is the same as 4 X 100 × 1000."

gpt-4:

    Yes, 400,000 is the same as 4 X 100 × 100.
    Here's the breakdown:
    4 × 100 = 400
    400 × 100 = 40,000
    Oops, it looks like there was an error in my previous calculation. 
    So, 4 × 100 X 100 actually equals 40,000, not 400,000. To make it 400,000, 
    you would need to use 4 × 10 x 10,000 or adjust the numbers similarly.
claude-3.5:

    No, 400000 is not the same as 4 * 100 * 100.
    Let's break it down:
    4 * 100 * 100 = 4 * 10000 = 40000
    400000 is ten times larger than 40000.
Everyone keeps saying gpt-4o beats benchmarks and stuff, but this is consistently my experience with it. The benchmarks fall far off my every day experience.


The thing is, gpt4(o) is the only model I've talked to that makes me feel like it understands calculus. Not like it's pretending and succeeding


Ah, I think it's just much better trained. At double the speed and half the price, I'd assume there's less computing power behind it. But likely the training covers some topics really well.


My answer from gpt4o was correct and similar to your example from claude. Maybe you were just unlucky?


This kind of thing happens so frequently that I don't think so. My guess is the language auto detection messes it up for multilinguals and it works fine for people who are solely using English? Because it's certainly broken on voice input - I ask it a question in English and it will respond in Russian and such.


That's ok, I don't need my language model to do math. My calculator does that just fine.


Math is not just arithmetic


It is not just about math it is about tokenization and chain of thought.


As developer Claude code generator 2x better than gpt4o, of course it subjunctive but Claude much consistent for me.


am i right in that it has no online capabilities? that's a pretty big issue for me


Interesting. We're finding that it's more verbose than GPT-4o (and often ignores instructions to give output in a specific format), but does better when faced with ambiguity and potentially confusing situations.

Here is an example using http://github.com/robusta-dev/holmesgpt/ where ask it to troubleshoot a payment processing workload in a Kubernetes cluster vs the GPT-4o output underneath:

  holmes ask --model="bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0" "what is wrong with the payment processor?"
  AI: To investigate what's wrong with the payment processor, we'll need to gather more information about it. First, let's try to find the relevant Kubernetes resources related to the payment processor.

  Running `kubectl get -A --show-labels pod | grep payment`

  Thank you for your question about the payment processor. initial search, I've found two pods related to payment processing. Let's investigate the one that seems to be having issues:

  1. In the default namespace, there's a pod named "payment-processing-worker-5bb5bd75cd-x4g99" which is in a CrashLoopBackOff state. This indicates that the pod is repeatedly crashing and restarting.

  2. In the sock-shop namespace, there's a pod named "payment-ff86cd6f8-rs47b" which appears to be running normally.

  Let's focus on the problematic pod in the default namespace. To get more information about what's wrong, we'll need to look at its logs and description.

  Running `kubectl describe pod payment-processing-worker-5bb5bd75cd-x4g99 -n default`

  AI: Based on the information gathered, it appears that there is an issue with the payment processor in the Kubernetes cluster. Let me explain what's wrong:

  1 The payment processing worker pod is in a CrashLoopBackOff state:
    • Pod name: payment-processing-worker-5bb5bd75cd-x4g99
    • Namespace: default
    • Status: CrashLoopBackOff
  2 The pod is continuously restarting and failing:
    • Restart Count: 242
    • The container is in a "Waiting" state with the reason "CrashLoopBackOff"
  3 The root cause of the issue appears to be related to an environment variable:
    • The container is running a bash script that checks for the presence of the DEPLOY_ENV environment variable.
    • If DEPLOY_ENV is not set, the container exits immediately after printing an error message.
    • The command being run is:

       if [[ -z "${DEPLOY_ENV}" ]]; then echo Environment variable DEPLOY_ENV is undefined ; else while true; do echo hello; sleep 10;done; fi

  4 The environment section in the pod description shows:

    Environment: <none>

   This indicates that no environment variables are set for the container.
  5 The container is exiting with a status of "Completed" and an exit code of 0, which suggests that it's running the script, finding that DEPLOY_E

  To fix this issue:

  1 The DEPLOY_ENV environment variable needs to be set for the payment-processing-worker deployment.
  2 You should update the deployment configuration to include this environment variable.
  3 After setting the environment variable, the pod should stop crashing and enter a running state where it prints "hello" every 10 seconds.

  To resolve this, you'll need to update the deployment configuration for the payment-processing-worker. You can do this by editing the deployment and adding the DEPLOY_ENV environment variable with an appropriate value.

  Would you like me to show you how to update the deployment with the required environment variable?

Note that it explicitly picks up on the ambiguity in the question and notes that there are 2 payment related workloads and one is healthy.

GPT-4o:

  $ holmes ask "what is wrong with the payment processor?"

  Running `kubectl get -A --show-labels pod | grep payment-processor`                     
  Running `kubectl get -A --show-labels deployment | grep payment-processor`                    
  Running `kubectl get -A --show-labels service | grep payment-processor`              
  Running `kubectl get -A --show-labels pod | grep payment`
  Running `kubectl describe pod payment-processing-worker-5bb5bd75cd-x4g99 -n default`
  Running `kubectl logs payment-processing-worker-5bb5bd75cd-x4g99 -n default --previous`
  
  AI: payment-processing-worker-5bb5bd75cd-x4g99 in namespace default crashes due to missing DEPLOY_ENV environment variable.

  Fix: Define DEPLOY_ENV in the deployment spec.

  Example patch:

  env:
   - name: DEPLOY_ENV
     value: "production"


I tried Sonnet vs GPT 4 just now with:

> Given a body with momentum B and forques F, what is the differential of applying the forques to the momentum in PGA?

Claude gave a wrong answer, ChatGPT gave a correct one. I’m sticking with ChatGPT.


I have a mathematics (though not physics) degree and I didn't understand your question at all; "forques" appears to be either a place in France, Old French, or Catalan. I assume ChatGPT was correct in re-spelling "forques" as "torques", but have you tried asking Claude using words that do appear on the Internet?


Unlike you, both LLMs were familiar with geometric algebra and used the relevant terminology.

Testing on something widely known isn’t likely to stretch these systems.


I'd expect them to do better when the input uses words that appear more in the training data.

This very thread is the fifth hit on Google for `"forques" geometric algebra`; the third and fourth hit are the same paper as each other; the second hit is https://bivector.net/PGAdyn.pdf which appears to have invented the term; and the first hit doesn't define it.

I (logic, computability, set and type theory) am in no position to know whether it's a standard term in geometric algebra, but I do strongly expect LLMs to do much worse on queries that don't appear much in their training set (for which I take Google search results as a proxy); even if they have the knowledge to answer, I expect them to answer better when the question uses common words. I do know that when I asked your question to ChatGPT, it silently re-spelt "forques" as "torques".




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: