This article was written before Anthropic added the Projects feature for Pro and Team users [1].
I spent some time yesterday experimenting with Projects, and, like Artifacts, it looks really useful. I like the idea of being able to have multiple projects going simultaneously, each with its own reference materials. I don’t need to use it in a team, but I can see how that could be useful, too.
The one problem I see is that the total context window for each project might start to seem too small pretty quickly. I assume, though, that Anthropic’s context windows will be getting larger as time goes on.
I wonder what other features Anthropic has in the works for Claude. My personal wish is for a voice interface, something like what OpenAI announced in May but has now put off until later this year.
From my reading there's nothing here that's specific to Anthropic, right? Any app like Chatbox, Msty or other could implement their own version of Projects based on any of the available models. (Maybe even Notion...) The API for creating an assistant with a pre-baked context is also available.
These sort of basic features seem like the hardest thing for AI companies. OpenAI only recently added a way to search all your chats. Gemini still doesn't have that.
Which is still horrifically slow and unoptimized. I’m not sure why streaming tokens from an api into an electron app is difficult but OpenAI managed to bungle it somehow.
Interesting, and even wilder that it’s buggy like that. I get that desktop apps are probably not OpenAI’s expertise but the app is unfortunately pretty unusable most of the time as of the writing of this comment. You would think for the amount of money they pay their engineers something better would have been released.
Unusable in what way? I don't personally find the macOS app to be "unusable" at all.
Also, you may be underestimating how buggy Swift and the rest of Apple's stack are. It's hard to get those bugs resolved unless you happen to work at Apple. Thus, a lot of time is spent working around bugs up the stack. So I don't find it surprising that a company moving fast like OpenAI ships _some_ bugs. The mac app just came out this month? Give it time.
No, I think you’re right. It’s just a matter of convenient tooling. I have been doing similar tasks with ChatGPT and Gemini, but, in the standard web interfaces, I end up uploading the same reference files multiple times and I tend to lose track of previously used prompts and unfinished projects. Anthropic’s Projects should make it easier for me to pursue multiple projects with different contexts at the same time.
The fact that Claude 3.5 Sonnet also seems smarter than the other current flagship models makes the Projects feature that much more attractive.
Maybe if I were smarter I wouldn’t find much use for Projects.
I'm far more excited for Projects than the other stuff. OpenAI's Memories is their lock-in for me; I'd have switched to Claude if Anthropic had something better. Projects sounds like an even better approach.
Game dev. It keeps tabs on the type of game it is, background, genre, classes, theme. Notably the difference between one decision and another - it takes note when I say "Instead of Y, do X". Like if I show a preference of Chinese names, it'll give me more of that.
Claude 3.5 Sonnet's coding abilities are incredibly impressive. I think it lets an expert programmer move more than twice as fast. There are limits - to produce high quality code, not copy-and-paste pablum, you have to be able to give detailed step-by-step directions and critically evaluate the results. This means you can't produce code better than you would have written by yourself, you can only do it much faster.
As an experiment, I produced a set of bindings to Anthropic's API pair-programming with Claude. The project is of pretty good quality, and includes advanced features like streaming and type-safe definitions of tools. More than 95% of the code and docs was written by Claude, under close direction from me. The project is here:
The biggest challenge I’ve felt is finding modern, idiomatic solutions for framework code. All LLMs seem to be victim to this issue where they will propose a mixture of old anti-patterns, roundabout approaches that technically work but make little sense, and mixing framework patterns with non-framework patterns in a way that leads to problematically incohesive code when taken together.
For example, I’m an experienced backend programmer but have been using Claude 3.5 Sonnet and GPT-4 asking questions about a frontend I’m building in TS using Svelte (which i am not very proficient in). The LLMs frequently confuse themselves with TS/JS, server/client side approaches, recommend old and deprecated approaches, and mixing patterns from other frameworks (e.g. react) when an idiomatic approach does exist. The biggest risk is when, in my ignorance, I do not detect when this is happening.
It’s been functional enough to push a hacky prototype out (where it would take me probably months longer to learn and do this otherwise), but the code quality and organization of the project is certainly pretty low.
I find LLM coding much less useful when it’s interacting with a large existing codebase. It’s certainly good at one-off type code and greenfield projects (especially if similar to other open source stuff). And it’s also good at getting started if you aren’t an expert yourself.
We haven't found this to be an impediment. Keep things modular, and share the type definitions of anything you import with the model. As the benefits here become more and more clear tooling will improve and people will adapt their development practices to get the most out of the models.
I have been developing Python 20 years now. Claude 3.5 is the first AI that is “smart” enough to help me. I usually do not need help with easy task, but complex ones. Claude is not a perfect, but it definitely gives a productivity boost for even the most seasoned developers, which would have been some obscure mailing list and source code reading in the past.
my 2p worth - my work involves a lot of summarisation, recommendation from a user preference statement. I've been able to do this with 4o / opus, but the consistency wasn't there, which required complex prompting chains to stabilise.
What I'm seeing with Sonnet 3.5 is a night-and-day step up in consistency. The responses don't seem to be that different in capability of opus / 4o when they respond well, it just does it with rock-solid consistency. That sounds a bit dull, but it's a huge step forward for me and I suspect for others.
Considering cancelling my subscription with OpenAI as I was previously using GPT-4 quite heavily as a multiplier for myself, guiding it and editing outputs as required, but GPT-4o feels significantly worse for this use case. It is certainly better in many other areas, but its coding ability is not great.
I tried to revert back to standard GPT-4 but it is now so slow to respond (higher load?) that it breaks my mental flow, so I'm exploring other options.
The first night it was released, it one-shotted merging two Python files together (one was an incomplete mess of speech to text, the other was an also-incomplete Claude api demo with a fake loop with forced conversation paths). It rewrote the fake loop and turned it into a voice interface that worked on the first try(then I asked it to add text to speech and it did). The second night it improves the code by adding threads (which I do know about but wouldn't have thought to use it there) and it worked better. Then Linux decided not to work properly (pulseaudio and ALSA issues), I asked it to help me migrate to pipewire and it gave me step by step directions, when I had issues all I had to do was post the error and it fixed it enough that it's all working again. It's been a pretty amazing week from my perspective. I finally had to take a break earlier this week but I still use Claude 3.5 for gardening (taking pictures and asking for help) and general advice.
There's not really anything to it. As I mostly use console.athropic.com workbench(the API pay-as-you-go), I haven't played around much with the Artifacts--I am probably getting a Claude Pro subscription later today. But yeah I just copy/pasted my 2 code files in the window, told it I wanted a conversational voice interface model, and in less time than it took to describe it, it wrote it and even saved a bunch of output tokens by saying where to copy large parts of the first file's boilerplate code. That had been a project I started around Easter and had kept putting off because it felt too hard for my somewhat limited time. But there's no moat or special setup to this stuff; anybody who can communicate in a text box can productively use this tool.
I currently tested the cursor IDE [1]. It’s vscode with nice ai integrations made with great ux in mind. They integrate with ChatGPT and Claude Sonnet/ Opus. For my experience (rust/typescript) the sonnet 3.5 model is giving me better results.
You can pass the current file, or multiple files or your whole project as context to the models.
It s prob not gonna work well: he fixes demo of the api of Claude with Claude. You didnt think of asking the AI, so you need more work on your own inference abilities first
It's a start. We just had a sudden storm that severely damaged some tomato plant stalks, and it suggested splinting them, or if it is still having trouble in a couple of days to cut off the top and plant in another pot, which is advice I've received from others too. The first picture I sent was some early blight on the leaves, it didn't get it on the first try (suggested the stem had edema, which I suppose is possible), but when I mentioned I thought it had a fungal issue it correctly noted the brown leaves and suggested a copper fungicide, among other correct remedies. It also correctly discriminated between two types of cucumber (well one was Armenian melon) held by my friend who had just picked them from her garden. Claude said that because there were hands present that it suggested that the vegetables were home-grown and/or freshly picked, and before that noted that they seemed to be good quality, which was correct. I went into this particular aspect of using Claude thinking it would be useless, but that's not what I have found with Claude 3.5 Sonnet. I'll note the fungal blight was with Claude 3 Opus, as well. Hope that helps.
Though the new artefacts is really damn handy, you can describe the most detailed projects, and it does a really great job of what you asked for, and I found it delivered beyond what I wanted.
I am now paying for both -.-
- it's hard to rank which codes better, but I think claude has better abstractions
- sometimes I paste the output of the other, and continue solving on the other
Would love to see your workflow sometime, my experimentations have been small tbh but I was quite impressed with Claude.
Funnily, I think (modulo costs) feeding outputs of one model into another once you hit a brick wall is quite powerful, but feels clunky to do at the moment
Since you work there, can you ask that they are more transparent with exactly what you are paying for with a "pro" subscription?
Currently you are apparently paying for:
- Everything in Free
- Use Claude 3 Opus and Haiku
- Higher usage limits versus Free
- Create Projects to work with Claude around a set of docs, code, or files
- Priority bandwidth and availability
- Early access to new features
But what are the usage limits? Higher than free by how much?
Having an invisible limit on a paid product really rubs me the wrong way. Maybe some rate-limiting after a certain amount would be better than a hard cutoff, but even then I'd like to know what the limit is before I pay, not when I accidentally hit it in the middle of something important.
> Your limit gets used up faster with longer conversations, notably with large attachments. For example, if you upload a copy of The Great Gatsby, you may only be able to send 15 messages in that conversation within 5 hours, as each time you send a message, Claude “re-reads” the entire conversation, including any large attachments.
Seems to also be limited by tokens. It’s still quite obscure.
With a fixed capacity, someone’s traffic is going to be rejected. I think it makes sense to have a variable rate limit depending on available capacity and demand.
For what it's worth, I haven't run into any limitations while using the API side. Although I also haven't really had to use it so extensively that it would be a problem.
I was literally about to type in my credit card number after seeing the new code editor window UX, but for the life of me I had no idea what I was paying for.
I’ve been paying for GPT since 3.5 debuted and I know what I’m getting - full, unlimited use of the best model. Period.
Anthropic needs to figure out what the hell they are selling.
ChatGPT Plus does NOT give you full unlimited use of the best model. There are still limits.
FWIW I regularly hit my ChatGPT Plus limits, and I think the “dynamic” limiting is regularly in place. I’ve only once hit my Claude Pro limit. I now use Claude more than ChatGPT.
As of May 13th 2024, Plus users will be able to send 80 messages every 3 hours on GPT-4o. and 40 messages every 3 hours on GPT-4. The message cap for a user in a ChatGPT Team workspace is approximately twice that of ChatGPT Plus.
Please note that unused messages do not accumulate (i.e. if you wait 6 hours, you will not have 80 messages available to use for the next 3 hours on GPT-4).
In certain cases for Plus users, we may dynamically adjust the message limit based on available capacity in order to prioritize making GPT-4 accessible to the widest number of people.
+1. More transparency around usage limit would be very helpful. Message based seems arbitrary if different messages are different lengths. Maybe token based with a bar that tracks how much you’ve used. It would go down with use and up with time and capacity and you’d be able to track and pace usage better than a warning you’re on last ten. Or just let us pay more for more usage! Switch to API! The workbench does not have as nice affordances as the chat.
Oh no! I use Apple voice control into Claude and then copy and paste code back and forth between my IDE and Claude mostly.
For longer writing,I really like going for a walk for 45 minutes and brain dumping on a topic, and transcribing it. Then I write a brief outline and have Claude fill it out into a document, explicitly only using language that I used in the transcript. Then edit via voice
I don't know about the usability for you but if you use vscode already cursor is basically vscode but has access to various llms, including Claude. I started using it recently and really like it.
I saw the initial 3.5 Sonnet release a few days ago and decided to pull the trigger with canceling OpenAI and moving over, and that was a great decision. I've been having to do more and more intricate nudging and leading towards the correct answer or approach when using ChatGPT of late.
My experience with Claude is very positive when it comes to programming and planning out infrastructure. My only gripe so far has been some ethical constraints that didn't exist for ChatGPT, and those are a big one for me since I don't need Anthropic employees to act as my moral compass. For a specific example, asking about securing code through making decompiling or reading harder is a no-no for Claude, but a-ok for GPT.
I haven't used the normal website in a year, it is a way better experience:
- It allows tweaking of the model / temperature.
- You pay only per token used
- It does not use your conversations to train their models
- You can edit both your and the assistant message in the conversation or delete it.
- Allows for easy message replay if the output is not great
Be mindful that you are paying per token for every message in the thread, so if the thread becomes long it gets expensive, but still far more value than subscribing.
The world is changing too fast for knowledge that is nine months old.
------
USER
What is the maximum number of tokens supported by gpt-4o?
ASSISTANT
As of my most recent update in October 2023, GPT-4 does not have an officially designated variant called "gpt-4o". However, OpenAI's GPT-4 model supports different maximum token limits depending on its variant. There are two primary context lengths for GPT-4 models:
1. *GPT-4 with 8K context length*: This variant supports up to 8,000 tokens.
2. *GPT-4 with 32K context length*: This variant supports up to 32,000 tokens.
These limits encompass both input and output tokens in a given interaction. If you intended to refer to a specific or newly introduced variant, such as "gpt-4o," it is essential to look up the most current resources or official announcements from OpenAI, as models and their capabilities can evolve.
ChatGPT and Gemini incorporate web search into their results, so they're not limited to stale information.
If you think you can save money by paying per token in the playground, you should know that you're not getting the same results.
For example, if I ask Gemini the identical question, it gives a reasonable answer, even though Gemini was released before ChatGPT was released. It can look up the current OpenAI documentation and summarize it:
USER
What is the maximum number of tokens supported by gpt-4o?
ASSISTANT
GPT-4o has a context window of 128,000 tokens. This means it can process up to 128,000 tokens between your input and its output combined. However, the maximum number of output tokens (the AI's response) is limited to 4,096 tokens.
I've been playing around with it this week and its coding ability is insane (for a LLM). I've given it some pretty sloppy descriptions about things I want to do and it's managed to figure out exactly how to do it on the first or second try, I'm talking things like building animations in React that cannot be described with text very well. Big pain point is copy and pasting things back and forth to have it edit them. If it was integrated and could see my local files, that would be killer. I know there are various companies working on that, but the jetbrains AI integration for example is garbage compared to the results I get by manually asking claude.
I wasn't worried about how this would affect our industry a few months ago, but this has me reconsidering. It's like a junior engineer that can do most tasks in seconds for a couple of cents.
What worries me is you need that time in the dirt to get a feel for coding as a craft. And at least for me that aspect of knowing the craft helps get my thinking in tune with problem solving in a very productive way.
Coding can be similar to playing an instrument, if you have mastery, it can help you be more expressive with the ideas you already have and lead you to new ones.
Whereas if we take away the craft of coding I think you end up with the type of code academic labs produce: something that purely starts on a “drawing board”, is given to the grad student/intern/LLM to make work, and while it will prove the concept it won’t scale into long term, as the intern doesn’t know when to spend an extra 30 minutes in a function so that it may be more flexible down the road.
> What worries me is you need that time in the dirt to get a feel for coding as a craft.
I see this sentiment a lot regarding gen AI. An I get it, we need to learn our tools. But this seems like it's saying the only way to learn problem solving is the way you learned it. That's just not true. Everyone learns problem solving differently and the emerging field of gen AI will figure out it's own way. It's a different way of thinking. I see my niece using ChatGPT to make projects I wouldn't have even imagined taking up at her age. Her games work. Who am I to say she isn't learning problem solving? In hindi we say "pratyaksh ko praman ki kya avashyakta" (what's right in front of you doesn't require proof).
I’d say it’s far more useful than a junior engineer. A junior might know one or two areas well, but Claude 3.5 knows literally every steep learning curve across all engineering professions. Claude may not yet be able to orchestrate a large project, but it can advise on structuring using knowledge drawn from best practices across every conceivable discipline. We may not get “AGI”; we might get something far more useful instead.
You get to create a set of default instructions that all project prompts to adhere to, like: Use Linq whenever possible. Consider Fluent capabilities when it will improve developer experience.
Then you can add all of your existing code (though as has been pointed out, connecting to a repo would be ideal).
And then you can have multiple conversations within that project, so you can separate your lines of thinking.
I haven’t had a great experience using GPT-4o for coding; it will sometimes make a mistake, fix the mistake introducing another mistake, then fix that second mistake repeating the first. Or other weird things like that. I love the text to speech, and image generation sometimes comes in handy, but ultimately I prefer Sonnet 3.5 for coding and have recently switched to that.
If OpenAI comes out with a stronger model I’ll prob switch back, I don’t have much brand loyalty in this regard. I could see some features making usage more sticky (memory, projects, etc).
I've been using it in https://double.bot (vscode coding copilot I'm hacking on) and it's been pretty great so far. We have support for all the top models (gpt4o, deep seek, opus, etc) so I regularly like to switch between them especially if I see a challenging problem. So far, sonnet 3.5 has the best quality and is the one I use by default.
I use it mostly for coding help, not Python or JS but a mix of C and C++ for my own framework.
It is surprisingly good and helpful. I am still exploring the limits.
Writing doc comments or test cases is much faster and more fun with this kind of tool, but you still have to double check everything as it inevitably make mistakes, often small and not obvious…
To clarify, I think it's extremely powerful and useful too. It's just that I get more misses than hits when using it as a rubber duck and coding assistant.
Hit or miss, with significantly more hits than misses. I was able to build a useful cybersecurity service in about 6 hours leveraging Claude to write most of the code. It has certainly made me more productive.
I feel reasonably confident that most of the people who are making these hyperbolic claims either don't understand programming at a deep enough level to make the claim or are trying to sell you something.
I'll throw my hat behind this horse because, honestly, if I was just learning to code, I would have probably quit by now due to the frequency of these types of comments. LLMs have certainly improved at an impressive rate, and they're fairly decent at producing 'Lego' blocks. However, when it comes to block building—the real meat and potatoes of programming—they're, at best, bad butchers. Building functional systems is hard, and it's simply something LLMs can't do now or perhaps ever. Or I just don't know how to prompt. 50/50.
FWIW, very few people will actually care about this statement unless you state your name, work you've done, etc. A lot of users here don't pay heed to anonymous appeals to authority.
and who would actually verify the output of llm works as expected? Unit tests? Who would verify unit tests are testing what needs to be tested instead of just a branch/output coverage? I've used these tools personally and I've used gh copilot at company I work in and at no point in time I was under impression llm's can do all of my work. Each step needs verification if it actually gives what you need, some steps do need good description by someone who knows what the system is doing to perform the changes where these are needed. Continuous extension of features without good design patterns in place, usually checked by devs, can also create a lot of problems. So what makes you think in one year swe will be considered a dying field? What I see is the need for better devs that can see beyond boilerplate that can be empowered by llm's, but it doesn't mean a dying field, it means rising the standards to break into the field.
Yeah good idea to put everyone out of their job and give it to 2 companies. Then what happens when those companies decide to rise prices by x1000 (twitter api ?). Or what happens when your system has a bug and does not work and the AI says it’s all good ? Who will introspect your code to find what’s wrong ?
As someone building an AI company right now, my quick Pro/Con for 4o vs Claude 3.5:
Claude: subjectively sounds more human to me, and really nails data questions that 4o is lackluster at
4o: far better assistant logic reasoning. I can trivially break Claude's assistant (system prompt) instructions within the user prompt, where 4o succeeds in all of these tests.
Pricing and output speed, for our purposes, are functionally identical. Exciting to have a competitor in the space already who stands to keep openai honest.
Aha, so I’m not the only one. For both Claude 3 Opus and 3.5 Sonnet, anecdotally its language is far more natural. So much so that I prefer it over 4o.
> You can say ‘the recent jumps are relatively small’ or you can notice that (1) there is an upper bound at 100 rapidly approaching for this set of benchmarks, and (2) the releases are coming quickly one after another and the slope of the line is accelerating despite being close to the maximum.
The graph does not look like it is accelerating. I actually struggle to imagine what about it convinced the author the progress is accelerating.
I would be very interested in a more detailed graph that shows individual benchmarks because it should be possible to see some benchmarks effectively be beaten and get a good idea of where all of the other benchmarks are on that trend. The 100 % upper bound is likely very hard to approach, but I don't know if the limit is like 99%, 95% or 90% for most benchmarks.
I heard a theory today that hitting 100% on the MMLU benchmark may be impossible due to errors in that benchmark itself - if there are errors in the benchmark no model should ever be able to score 100% on it.
The same problem could well be present in other benchmarks as well.
Apparently my account was banned on Anthropic Sonnet after a "Automatic review". I'm 100% sure i did not make any "unsafe" queries, I've litterally only briefly tested and that was weeks ago.
I was banned too. I filled in the form 2-3 times already, but I never heard back from them.
My working theory is that I was banned due to using a VPN (Mullvad) with a location set to Europe, at a time when Europe users were not allowed to use the app.
I wasn’t actually in Europe, to be clear, but I live in a country where even WhatsApp voice calls are blocked and so I routinely have my VPN turned on.
The country I live in is officially supported by Anthropic, and so is Europe these days, so it’s quite frustrating that they won’t unban me.
I can’t use ChatGPT and Perplexity either when I have my VPN turned on, but at least they don’t ban my account.
Their support sucks tbh, months ago I used my phone number with my work gmail and wanted to move it to my personal gmail (phone numbers can only be used on one account for some reason) to subscribe, never got a response to the request.
I had that too, no explanation, no response via email support, nothing. I can't give money to them knowing my account can just get banned at anytime while I might have an active subscription going on.
So when someone lets (or has) their phone service get disconnected and the company recycles that phone number shortly, the person who then gets this phone number is out of luck if they want to use Claude?
I don't need an AI to write code for me, but it is _astoundingly_ helpful to have it summarize various design options and new technology stacks without me having to scavenge Google for the obscure corner-cases I care about.
I have an idea for a project that involves streaming 3 giabits of data per second from a USB 3.0 device out over a 10 gig Ethernet connection, and it was able to compare/contrast various levels of support for high-bandwidth USB 3 and Ethernet in multiple frameworks and languages.
And the whole conversation, with code examples, cost me 3 _cents_ of Anthropic credits.
My new fear is when people start asking AIs "Hey AI, here is my codebase, my org chart, and commit histories for all my employees - how can I reduce the number of humans I need to employ to get this project done?"
Examples like this make me think, “maybe the S&P 500 will be at 15,000 by 2030.” I mean, in my own job, Claude is making me substantially better in a wide range of areas, from sales negotiation to managing my team. Add that up across millions of people…
Some of it I was able to verify immediately, some I confirmed by poking around in Github repos once I knew the magic words, some I haven't verified yet.
For example, I wasn't sure if there was good support for USB devices in Rust. Just not something I'd ever bothered to investigate. Claude knew that there was a libusb wrapper and a "rusb" crate, and claimed that support for USB in Rust was stable. I verified the existence of the crates, and I'll take Claude at its word for stability unless I discover otherwise.
The key difference is that I can reasonably trust (depending on the audience) that the other human is not simply hallucinating facts in order to provide a response.
Not only do humans invent stuff that didn't happen (hallucination), we're also prewired with a very long list of cognitive biases.
It depends a lot on the domain of course, but I'd bet that frontier LLMs already exhibit superhuman capabilities in providing accurate answers in the vast majority of domains.
Nobody cared when it was the artists, the songwriters, the music studios, the film companies. But impacting jobs of the programmers - now, heaven forbid.
The specific issue of using a LLM to make org decisions on how to downsize actually affects nearly all jobs equally.
From what I can tell, most programmers are more ok with LLMs directly replacing them than artists are. I tend to agree that it is better to replace programmers, and protect artists.
The replacement is currently less obvious with developers, AI work needs to be copy pasted to the right places in execution and the hooks into the rest of the code need to be set up right.
It doesn’t yet do prompt to 100% finished artifact.
For concept artists and illustrators at least who have very generic or commercial styles Diffusion models can create a final artifact that passes as good enough to a lot of clients.
Think developers would feel a bit different if they were in the same situation today.
Must be fun working on cutting edge competitive stuff for all these 3 major teams. It's exciting to live in this times and see this all unfold in our eyes.
On that note, and apologies if it sounds spammy, but genuinely if there's any AI engineers reading, check out asana.com/jobs - we partner with the two leading AI labs (they're all customers of ours too) and I know the team get to experiment with some of the early release stuff to build our leading AI work management platform (Dustin, our very technical founder and CEO writes about what the deal with our latest AI advancements are here https://asana.com/inside-asana/the-key-to-unlocking-the-pote...) - I feel like it's one of the best places to work in so many ways, but the cutting edge AI stuff is REALLY fun!
AI programming would be really useful if it moved towards me being able to make a fixed set of statements about the software, those statements are preserved permanently in the source code somehow, and the AI ensures that those statements remain true.
Its frustrating to work with the AI to implement something only to realise within a few interactions that it has forgotten or lost track of something I deemed to be a key requirement.
Surely the future of software has to start to include declarative statement prompts as part of the source code.
The first thing I built with OpenAI a few years ago was a system that had a section for a program spec on one part of the screen and a live web page area on the other part. You could edit the spec and it would then regenerate the web page.
It would probably work much better now.
Eventually I might add something like that as a plugin in my current framework.
Have you tried building that system with prompting?
You could set a convention of having comments something like this:
# Requirement: function always returns an array of strings
And then have a system prompt which tells the model to always obey comments like that, and to add comments like that to record important requirements provided by the user.
This is a good solution. I’ve taken the idea that maintaining the vision in my job in the LLM relationship. Reiterating key details street it’s forgotten and burning tokens tweaking things over and over towards the vision is the cost of doing business.
اريدك ان تجعل هذه الصفحة اكبر واكثر صفحة احترافية في التاريخ اكثر من موقع ابل واقوي من اقوي البراندات العالمية اريدك ان تضيف مزايا احترافية جداجدا ليصبح الموقع والتطبيق رقم 1 في مجال تسجيل الاوزان والجيم واكتب انت جميع الاكواد لانه لا خبرة لي في البرمجة اطلاقا ولا استطيع ممكن ان ياخذ ذلك مني سنينا اتمني ان تشارك في هذا العمل الانساني الخيري
Well, I went to try it, but it requires a phone number for some bizarre reason. Fine, gave it my primary number, a google voice number I've had for a decade, and it won't accept it. That's the end of my Claude journey, forever.
If you want me to try your service, try using some flow with less friction than sandpaper, folks.
I have about a dozen credit cards, 4 bank accounts, mortgage, car payment, utility accounts, github 2fa, aws 2fa, Fidelity retirement, etc. Not one has an issue with my number.
I did have some service refuse it, I want to say Twitter? But I'd definitely not consider it common. This is probably only the second or third time I've seen it, tbh.
I'm going to guess the reason Anthropic do verified phone numbers is that, unlike email addresses, most people don't have an easy way to create multiple phone numbers.
Since Anthropic accounts come with free rate-limited access to their models they're trying to avoid freeloaders who sign up for hundreds of accounts in order to work around those caps.
Google Voice numbers are blocked because people can create multiple of those, which would allow them to circumvent those limits.
I’m really curious where that line is and how it is identified. How much business is lost to the onboarding friction? Is it more or less than the cost of the attacks?
On one extreme usage of the service is totally open to everyone. On the other it is completely locked down and nobody can use it ever. Clearly neither of those are workable. So it’s a question of how far to go with validation.
2fa and those services are different. LLM api calls are relatively costly and in demand so they must have a decent barrier of entry so people couldn't just spam automated accounts with automated requests to use the app for free. There are systems that abuse that and even existing API integrations are being abused by proxying and using for other purposes if they are free and don't have good security. Any place that wants to defend against bots can't trust those phone numbers.
I find these powerful and self-entitled statements about losing an individual's customership FOREVER quite funny. You're ranting about some niche concern of yours and proudly proclaim to us that you'll never be a user as if we'd care, and not so softly imply that everybody else should also watch out.
You read way too much into my statement. I don't know who you are, or why you care(you spent long enough to get worked up and write a reply, afterall).
My message was simply my opinion as a user, to anyone who may have a similar flow. No more, no less.
They don't want you to do anything. The friction they ware attempting to reduce is between phone numbers and wallets. This isn't much different than the typo-laden email scam. It's working as intended.
This is highlighting what has happened with all forms of ML. Give a baseline set of folks the same dataset and they will end up with a model that performs about the same. Companies are one-upping each other but it’s very back and forth and just a case of release date. These models will become a complete commodity. The thing that could be proprietary is the data used to train them, which could lead to a sustained better model performance. The barrier to entry here is super high given training costs, but the ML skills are still a commodity.
This is why I strongly believe Google has the clear advantage. Infinite data, infinite resources , not to mention dozens of verticals (search, Android and Chrome are probably the killer ones)
Google obviously has the advantage here, but it also seems like they’re willing to squander it. The Gemini rollout has basically been clippy 2.0 so far. The Gemini interface in gcloud seems to know nothing about how services work, the chat summaries consistently fail, the search summaries are comically bad. I’m usually not one of these people who wants to “block all the AI stuff” but with google products I do.
In other words: information with copyright on it that was (illegally) used/stolen from billions of people and companies will get copyright on it that will be resold as a set?
I think the comment was referring to things like internal company data, or licensed data that is not publicly available, etc. Those things could be a competitive advantage.
I have been using Claude Sonnet with Artifacts along with Vercel V0 to build Sveltekit pages and components really well. I create a UI in V0 and then simply copy the JSX into Claude and tell it to convert to Sveltekit. It creates the +page.svelte +page.server.ts and all the components almost perfectly.
Claude helped me code up a quick webhook handler for a silly side project. Being a side project, I asked it to put the code into a Docker container, which is did flawlessly. It also gave me a Makefile with sensible targets for building, rebuilding, starting, stopping, etc. Finally, I asked it use an .env file for the secrets I needed to store. Everything it did was one-shot on target. The only thing that would make this better would be to have it in the terminal with me.
I keep saying that if nothing else, we are in a golden age of actually finishing side projects. All those “it’s easy, I’ll just glue this API to this API” projects really are easy now.
This, exactly. Every side project I've thrown at Claude 3.5 Sonnet has been completed that same night. It's so different from how I used to treat my backlog projects which could take a week or so of research-code-iterate and now they're just an evening (or less; since Sonnet's upgrade on average it's taken me about 20-40 minutes to get the night's planned work done) and I get to sleep earlier. Win-win-win
This is not really valuable, because it could probably be done by an fiverr contractor cheaper than the marginal cost of $(LLM training cost + query cost)
I'm fairly impressed with Sonnet's one shot scripting performance for my use cases. However, I was using it to help me diagnoses a gnome key ring and ssh issue I was having, and it suggested that I 'rm -rf' my keyring files to test it's solution out. A little drastic..maybe mv my password file first? Anyway, it sometimes seems even more cocky than last gen, and less careful by default
Claude Sonnet is freaking amazing. I used to have a safety test[0] that Claude failed. But it was a bogus safety test and fortunately someone here told me so and I immediately subscribed to it. It's amazing. The other day I ported a whole node.js script to Python with it. It was not flawless but it was pretty damned good. Such a mechanical process and I just had to review. Loved it.
I cancelled my OpenAI membership and using more and more of Claude. Sonnet is pretty fast and cheaper than 4-o.
I'm legit elated that a smaller player is able to compete with large behemoths like OpenAI and Google. (I know they have Amazon backing them, but their team is much smaller. OpenAI is ~1000 employees now).
I'm building on top of their api. It's neat. I wish them the best.
Yeah, that’s why the organisation wording is unfortunate. It’s meant more like a group name under which you will have your projects and not corporation name.
These models are clearly great with language, be it natural language or code. However, I wonder where the expectation comes from that a static stochastic parrot should be able to compute arbitrary first order logic (in a series of one-shot next word predictions). Could any expert elaborate on how this would be solved by a transformer model?
The incredible ability of Claude 3.5 Sonnet to create coherent SVG makes me wonder if the LLM was not just pretrained on text. Vision capabilities are usually added later using a vision encoder that does not affect the LLM's knowledge of the visual world, but in this case the LLM clearly has quite a strong understanding of the visual world.
Have you read the "sparks of AGI" paper about GPT4? It suggested that even just text can give an LLM a rich world model, based on the tikz drawings of a unicorn that got progressively better as GPT4 precursors were trained on increasingly more data (and, interestingly, the drawings got worse when it was RLHF'd for safety).
Yes of course, as always, it's very possible that just scaling solved the problem, but the fact that the model is so good makes me wonder if they actually did something different and pre-trained the model on image tokens as well.
Claude 3.5 Sonnet can solve the farmer and sheep problem with two small changes to the prompt:
1. change the word "person" to "human".
2. change the word "trips" to "trip or trips". (Claude is probably assuming that the answer has to be in multiple trips because of the word "trips")
I know of Sweep <https://docs.sweep.dev>, but honestly the SOTA on SWE-bench, which is basically exactly what you're asking for, is only about a 25% success rate, so expect very mediocre results.
Another one is https://www.github.com/trafficguard/nous which provides a software dev agent that can find/clone repos, create a branch, search in a repo, then delegates to Aider for the code editing in the edit/compile/lint/test loop, and raise a merge request.
I wonder if openai they have a response ready or if they just are tackling other business problems like ios integration now and the seemingly postponed low latency chat launch. Either way I am looking forward to claude 3.5 opus.
It did seem slightly odd to me that openai made their supposedly best model free.
I’ve thoroughly enjoyed the product overall much more than ChatGPT. I do wish it had voice input that rivaled what OpenAI previewed. Excited for 3.5 Opus. For now I’ve canceled OpenAI subscription and removed the app in favor of Claude.
gpt-4o: "Yes, 400,000 is the same as 4 X 100 × 1000."
gpt-4:
Yes, 400,000 is the same as 4 X 100 × 100.
Here's the breakdown:
4 × 100 = 400
400 × 100 = 40,000
Oops, it looks like there was an error in my previous calculation.
So, 4 × 100 X 100 actually equals 40,000, not 400,000. To make it 400,000,
you would need to use 4 × 10 x 10,000 or adjust the numbers similarly.
claude-3.5:
No, 400000 is not the same as 4 * 100 * 100.
Let's break it down:
4 * 100 * 100 = 4 * 10000 = 40000
400000 is ten times larger than 40000.
Everyone keeps saying gpt-4o beats benchmarks and stuff, but this is consistently my experience with it. The benchmarks fall far off my every day experience.
Ah, I think it's just much better trained. At double the speed and half the price, I'd assume there's less computing power behind it. But likely the training covers some topics really well.
This kind of thing happens so frequently that I don't think so. My guess is the language auto detection messes it up for multilinguals and it works fine for people who are solely using English? Because it's certainly broken on voice input - I ask it a question in English and it will respond in Russian and such.
Interesting. We're finding that it's more verbose than GPT-4o (and often ignores instructions to give output in a specific format), but does better when faced with ambiguity and potentially confusing situations.
Here is an example using http://github.com/robusta-dev/holmesgpt/ where ask it to troubleshoot a payment processing workload in a Kubernetes cluster vs the GPT-4o output underneath:
holmes ask --model="bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0" "what is wrong with the payment processor?"
AI: To investigate what's wrong with the payment processor, we'll need to gather more information about it. First, let's try to find the relevant Kubernetes resources related to the payment processor.
Running `kubectl get -A --show-labels pod | grep payment`
Thank you for your question about the payment processor. initial search, I've found two pods related to payment processing. Let's investigate the one that seems to be having issues:
1. In the default namespace, there's a pod named "payment-processing-worker-5bb5bd75cd-x4g99" which is in a CrashLoopBackOff state. This indicates that the pod is repeatedly crashing and restarting.
2. In the sock-shop namespace, there's a pod named "payment-ff86cd6f8-rs47b" which appears to be running normally.
Let's focus on the problematic pod in the default namespace. To get more information about what's wrong, we'll need to look at its logs and description.
Running `kubectl describe pod payment-processing-worker-5bb5bd75cd-x4g99 -n default`
AI: Based on the information gathered, it appears that there is an issue with the payment processor in the Kubernetes cluster. Let me explain what's wrong:
1 The payment processing worker pod is in a CrashLoopBackOff state:
• Pod name: payment-processing-worker-5bb5bd75cd-x4g99
• Namespace: default
• Status: CrashLoopBackOff
2 The pod is continuously restarting and failing:
• Restart Count: 242
• The container is in a "Waiting" state with the reason "CrashLoopBackOff"
3 The root cause of the issue appears to be related to an environment variable:
• The container is running a bash script that checks for the presence of the DEPLOY_ENV environment variable.
• If DEPLOY_ENV is not set, the container exits immediately after printing an error message.
• The command being run is:
if [[ -z "${DEPLOY_ENV}" ]]; then echo Environment variable DEPLOY_ENV is undefined ; else while true; do echo hello; sleep 10;done; fi
4 The environment section in the pod description shows:
Environment: <none>
This indicates that no environment variables are set for the container.
5 The container is exiting with a status of "Completed" and an exit code of 0, which suggests that it's running the script, finding that DEPLOY_E
To fix this issue:
1 The DEPLOY_ENV environment variable needs to be set for the payment-processing-worker deployment.
2 You should update the deployment configuration to include this environment variable.
3 After setting the environment variable, the pod should stop crashing and enter a running state where it prints "hello" every 10 seconds.
To resolve this, you'll need to update the deployment configuration for the payment-processing-worker. You can do this by editing the deployment and adding the DEPLOY_ENV environment variable with an appropriate value.
Would you like me to show you how to update the deployment with the required environment variable?
Note that it explicitly picks up on the ambiguity in the question and notes that there are 2 payment related workloads and one is healthy.
GPT-4o:
$ holmes ask "what is wrong with the payment processor?"
Running `kubectl get -A --show-labels pod | grep payment-processor`
Running `kubectl get -A --show-labels deployment | grep payment-processor`
Running `kubectl get -A --show-labels service | grep payment-processor`
Running `kubectl get -A --show-labels pod | grep payment`
Running `kubectl describe pod payment-processing-worker-5bb5bd75cd-x4g99 -n default`
Running `kubectl logs payment-processing-worker-5bb5bd75cd-x4g99 -n default --previous`
AI: payment-processing-worker-5bb5bd75cd-x4g99 in namespace default crashes due to missing DEPLOY_ENV environment variable.
Fix: Define DEPLOY_ENV in the deployment spec.
Example patch:
env:
- name: DEPLOY_ENV
value: "production"
I have a mathematics (though not physics) degree and I didn't understand your question at all; "forques" appears to be either a place in France, Old French, or Catalan. I assume ChatGPT was correct in re-spelling "forques" as "torques", but have you tried asking Claude using words that do appear on the Internet?
I'd expect them to do better when the input uses words that appear more in the training data.
This very thread is the fifth hit on Google for `"forques" geometric algebra`; the third and fourth hit are the same paper as each other; the second hit is https://bivector.net/PGAdyn.pdf which appears to have invented the term; and the first hit doesn't define it.
I (logic, computability, set and type theory) am in no position to know whether it's a standard term in geometric algebra, but I do strongly expect LLMs to do much worse on queries that don't appear much in their training set (for which I take Google search results as a proxy); even if they have the knowledge to answer, I expect them to answer better when the question uses common words. I do know that when I asked your question to ChatGPT, it silently re-spelt "forques" as "torques".
I spent some time yesterday experimenting with Projects, and, like Artifacts, it looks really useful. I like the idea of being able to have multiple projects going simultaneously, each with its own reference materials. I don’t need to use it in a team, but I can see how that could be useful, too.
The one problem I see is that the total context window for each project might start to seem too small pretty quickly. I assume, though, that Anthropic’s context windows will be getting larger as time goes on.
I wonder what other features Anthropic has in the works for Claude. My personal wish is for a voice interface, something like what OpenAI announced in May but has now put off until later this year.
[1] https://www.anthropic.com/news/projects