GPT-4 Turbo with Vision is a step backwards for coding

tedsanders · on April 10, 2024

Interestingly, GPT-4 Turbo with Vision is at the top of the LiveCodeBench Leaderboard: https://livecodebench.github.io/leaderboard.html

(GPT-4 Turbo with Vision has a knowledge cutoff of Dec 2023, so filter to Jan 2024+ to minimize the chance of contamination.)

In general, my take is that each model has its own personality, which can cause it to do better or worse on different sorts of tasks. From evaluating many LLMs, I've found that it's almost never the case that one model is better than an another at everything. When an eval only has a certain type of problem (e.g., only edits to long codebases, or only short self-contained competition problems), it's not clear how homogeneously its performance rankings will generalize to other coding tasks. Unfortunately, if you're a developer using an LLM API, the best thing to do is to test all of the models from all the providers to see which works best for your use case.

(I work at OpenAI, so feel free to discount my opinions as much as you like.)

swalsh · on April 10, 2024

As a user, I basically just care about a minimum baseline of competence... which most models do well enough on. But then I want the model to "just give me the code". I switched to Claude, and canceled my chatgpt subscription because the amount of placeholders and just general "laziness" in chatgpt was terrible.

Using Claude was a breath of fresh air. I asked for some code, I got the entire code.

hmottestad · on April 10, 2024

I’ve been using Claude 3 Opus for a while now and was fairly happy with the results. Wouldn’t say they were better than GPT-4, but considerably less verbose which I really appreciated. Recently though I ran into two questions I had that Claude actually answered incorrectly and incompletely until I prompted it. One was a Java GC questions where is forgot Epsilon and then hallucinated that is wasn’t experimental anymore. The other was a coding question where I know there wouldn’t be a good answer, but Claude kept repeating a previous answer even though I had twice told it that it wasn’t what I was looking for.

So I’ve switched back to GPT-4 again for a the time being to see if I’m happier with the results. I never felt that Claude 3 Opus measurably better than GPT-4 to begin with.

infecto · on April 10, 2024

I just run a system message around my coding exercises to provide minimal explanations and be concise.

mwigdahl · on April 10, 2024

Same here -- one or two sentences along those lines in GPT-4's system prompt makes a _world_ of difference.

infecto · on April 11, 2024

Agreed and from a pure ChatGPT perspective the way to go is creating your personal "GPTs" (I hate the name of the product) that are simply system message wrappers. So I will have a Coding GPT, and email GPT etc.

big_man_ting · on April 10, 2024

Claude is a bit more expensive though, no? I felt like I burned through 5$ worth of credit in one evening, but perhaps it was also because I was using the big-AGI UI and it was producing diagrams for me, often in quintuplicates for some reason. Still, I really like Claude and much more prefer it over others.

swalsh · on April 10, 2024

I'm not using the API. Both are $20/month subscriptions.

satvikpendem · on April 10, 2024

What were the placeholders and laziness? I just ended my prompts with something akin to "give me the full code and nothing else" and ChatGPT does exactly that. How does Claude do any better?

wouldbecouldbe · on April 10, 2024

Even if I ask in caps it often comment out large pieces of code. I often give large pieces of code and ask for adjustments. Then I don’t want to have to search & only copy paste the small adjustments of gpt. But it never listens

transcriptase · on April 10, 2024

I sympathize but amusingly I have the opposite problem. Most of the time I want it to output a full script, and it only wants to output a small block with changes unless I plead with it to include everything.

wouldbecouldbe · on April 10, 2024

That's what I mean

anotherpaulg · on April 10, 2024

FWIW, I agree with you that each model has its own personality and that models may do better or worse on different kinds of coding tasks. Aider leans into both of these concepts.

The GPT-4 Turbo models have a lazy coding personality, and I spent a significant effort figuring out how to both measure and reduce that laziness. This resulted in aider supporting a "unified diffs" code editing format to reduce such laziness by 3X [0] and the aider refactoring benchmark as a way to quantify these benefits [1].

The benchmark results I just shared about GPT-4 Turbo with Vision cover both smaller, toy coding problems [2] as well as larger edits to larger source files [3]. The new model slightly underperforms on smaller coding tasks, and significantly underperforms on the larger edits where laziness is often a culprit.

[0] https://aider.chat/2023/12/21/unified-diffs.html

[1] https://github.com/paul-gauthier/refactor-benchmark

[2] https://aider.chat/2024/04/09/gpt-4-turbo.html#code-editing-...

[3] https://aider.chat/2024/04/09/gpt-4-turbo.html#lazy-coding

anshumankmr · on April 10, 2024

Hi Ted, since I have been using GPT 4 pretty much every day, I have a few questions about the performance, We had been using 1106 preview for several months to generate SQL queries for a project, but one fine day in February, it stopped responding and it used to respond like so "As a language model, I do not have the ability to generate queries etc...". This lasted for a few hours. Anyway, switching to 0125-preview which helped us immediately resolve the problem. We have been using that for whenever we have code generation related tasks unless we are doing FAQ stuff (where GPT 3.5 Turbo was good enough).

However, off late, I am noticing some really inconsistent behaviours in 0125-preview where it responds inconsistently for certain problems, ie one time it works with a detailed prompt and other time it doesn't. I know these models are predicting the next most likely token which is not always deterministic.

So I was hoping for the ability to fine tune GPT 4 Turbo some time soon. Is that on the roadmap for Open AI?

hmottestad · on April 10, 2024

I don’t work for OpenAI but I do remember them saying that a select few customers would be invited to test out fine tuning GPT-4, and that was several months ago now. They said they would prioritise those who had previously fine tuned GPT-3.5 Turbo.

hypnagogic · on April 11, 2024

The ongoing model anchoring/grounding issue likely affects all GPT-4 checkpoints/variants, but is most prominent with the latest "gpt-4-turbo-2024-04-09" variant due to its most recent cutoff date, might imply deeper issues with the current model architecture, or at least how it's been updated:

See the issue: https://github.com/openai/openai-python/issues/1310

See also the original thread on OpenAI's developer forums (https://community.openai.com/t/gpt-4-turbo-2024-04-09-will-t...) for multiple confirmations on this issue.

Basically, without a separate declaration of the model variant in use in system message, even the latest gpt-4-turbo-2024-09 variant over the API might hallucinate being GPT-3 and its cutoff date being in 2021.

A test code snippet is included in the GitHub issue to A/B test the problem yourself with a reference question.

hypnagogic · on April 11, 2024

I think there's a bigger underlying problem with the current GPT-4 model(s) atm:

Go to the API Playground and ask the model what is its current cutoff date. For example, in its chat, if you're not instructing it with anything else, it will tell you that its cutoff date is in 2021. Even if you explicitly tell the model via system prompt: "you are gpt-4-turbo-2024-04-09", in some cases it still thinks its in April 2023.

The fact that the model (variants of GPT-4 including gpt-4-turbo-2024-04-09) hallucinates its cutoff date being in 2021 unless specifically instructed with its model type is a major factor in this equation.

Here are the steps to reproduce the problem:

Try an A/B comparison at: https://platform.openai.com/playground/chat?model=gpt-4-turb...

A) Make sure "gpt-4-turbo-2024-04-09" is indeed selected. Don't tell it anything specific via the system prompt and in a worst case scenario, it'll think it's in 2021 as to its cutoff date. It also can't answer to questions about more current events.

* Reload the web page between prompts! *

B) Tell it via the system prompt: "You are gpt-4-turbo-2024-04-09" => you'll get answers to recent events. Ask anything about what's been going on in the world i.e. after April 2023 to verify against A.

I've tried this multiple times now, and have always gotten the same results. IMHO this implies a deeper issue in the model where the priming goes way off if the model number isn't mentioned in its system message. This might explain the bad initial benchmarks as well.

The problem seems pretty bad at the moment. Basically, if you omit the priming message ("You are gpt-4-turbo-2024-04-09"), it will in worst cases revert to hallucinating 2021 cutoff dates and doesn't get grounded into what should be its most current cutoff date.

If you do work at OpenAI, I suggest you look into it. :-)

mnk47 · on April 11, 2024

>I work at OpenAI

I know there's a lot you can't talk about. I'm not going to ask for a leak or anything like that. I'd just like to know, what do you think programming will look like by 2025? What do you think will happen to junior software developers in the near future? Just your personal opinion.

harryp_peng · on April 11, 2024

Daniel kokatajlo already answered that I think

BrandonSmithJ · on April 10, 2024

Hey Ted, I had a question about working at OpenAI, if you don't mind talking with me. If so, email address is in my profile. Thank you!

AdrianEGraphene · on April 10, 2024

Pretty sweet site, thx for sharing. Hope y'all will start bringing token count up at some point. Will be testing this newer version too.

Terretta · on April 10, 2024

Appreciate OpenAI popped in say new release is probably better at something else, but it would have been nice to acknowledge that this suggestion...

> “Unfortunately, if you're a developer using an LLM API, the best thing to do is to test all of the models from all the providers to see which works best for your use case.”

...is exactly what is done by the author of these benchmark suites:

"It performs worse on aider’s coding benchmark suites than all the previous GPT-4 models. In particular, it seems much more prone to “lazy coding” than the GPT-4 Turbo preview models."

tedsanders · on April 11, 2024

Agreed! Kudos to Paul for creating the evals, running them quickly, and sharing results. My comment (not on behalf on OpenAI, but just me as an individual) was meant as a "yes and" not a "no but".

ChrisRR · on April 10, 2024

I feel like degrading the discipline of programming/development to "coding" is a bigger step backwards. Coding is used in programming, but if you're just churning out code then you're not developing well-architected, maintainable and safe software

It's like saying that accountant is just adding. I think come then of tax year you'd be avoiding an accountant who says they've got experience in adding.

ryan-duve · on April 10, 2024

I lean in the opposite direction. When someone random like a new neighbor initially asks me what I do, I say "I'm a coder at an insurance company". I teach fifth graders Python in an "Advanced Coding Club". When people ask me what I do for hobbies, one of them is "learning new languages to code with". I will only go into more detail if they are also technical and want more details about what it is I code.

I don't think of it as degrading the thing that I do. I think of it as boiling it down to the simplest description, and I find it more refreshing than "software developer" or "computer programmer" or f"{word} engineer".

searchableguy · on April 10, 2024

I use what impact and benefit my work has to answer what I do.

In your insurance case, I would say something like "I build tools to shield businesses from unexpected disasters like earthquakes or floods" or "I help people worry less about expenses during an emergency"

If someone asks me more, then I might add on that I work on software to automate claim process or similar.

SkyBelow · on April 10, 2024

I think the field has long had an issue describing the differences between design and implementation, which has only grown worse as more levels of designing and implementing have appeared. It is a bit like explaining the difference, back in the day between the person who works out a formula and the person assigned to computing it. Neither is trivial work, and the outsider who doesn't like math will view both of them as doing math, but there is still a gap in the mathematical skill and insight involved.

You mention taxes, which makes me think of how many tax preparers are basically helping their customer input data into software and not providing any tax specific advice. That might still be a value add for someone who struggles computer UIs, but that isn't the same as the person helping move money between accounts to reduce tax liability.

I've seen similar when it comes to doing science in a lab.

How can any discipline protect the inner distinction against a much larger world which has a very simplified understanding and will mix the inner groups together?

toolz · on April 10, 2024

I've never come across someone that was more enlightened by what I do when using the word "engineer" vs using the word "coder". If anything I would assume coder elicits a more accurate mental image than something a bit more overloaded like engineer.

floor_ · on April 10, 2024

[flagged]

tokai · on April 10, 2024

Beside your tone, the post is not even about semantics at all.

anotherpaulg · on April 10, 2024

OpenAI just released GPT-4 Turbo with Vision and it performs worse on aider’s coding benchmark suites than all the previous GPT-4 models. In particular, it seems much more prone to “lazy coding” than the GPT-4 Turbo preview models.

memothon · on April 10, 2024

Thanks again for running all these benchmarks with model releases. They are really helpful to track progress!

CGamesPlay · on April 10, 2024

Really appreciate the thoroughness you apply to evaluating models for use with Aider. Did you adjust the prompt at all for the newer models?

jimmywetnips · on April 10, 2024

I've definitely run into this personally. But even even I explicitly tell it to not skip implementation and to generate fully functional code, it says that it understands but continues right into omitting things again.

It was honestly shocking because we're so used to it understanding our commands that a blatant disregard like that made me seriously wonder what kind of laziness layer they added

lyu07282 · on April 10, 2024

I suspect they might be worried it could reproduce copyrighted code in certain circumstances, so their solution was to condition the model to never produce large continuous chunks of code. It was a very noticeable change across the board.

s-lambert · on April 10, 2024

I thought it would be for performance, since it doesn't output all of the code then each reply is shorter/quicker. Although you can still ask it to generate more of the code but that introduces some latency so there's less overall load.

satvikpendem · on April 10, 2024

People hypothesized that OpenAI added laziness in order to save money on token generation, since they are burning through GPU time.

joenot443 · on April 10, 2024

This has been my conclusion too. Given it’s a product I’m paying monthly for it seems super regressive to have to trick it into doing what it used to do just fine.

satvikpendem · on April 11, 2024

I assume even the paid version is still a loss leader, 20 dollars per month is nothing in GPU time compared to how much they must be spending to generate the output.

whywhywhywhy · on April 10, 2024

I'd probably pay triple to go back to the pre-"Dev Day" product at this point

jumpCastle · on April 10, 2024

You can use older models with the api.

_giorgio_ · on April 10, 2024

They should offer different models at this point.

This laziness occurs over and over, so why bother with omniscience.

j45 · on April 10, 2024

The laziness layer seems to be to be an assistant but not a replacement or doing the task.

jeswin · on April 10, 2024

A big limitation with GPT4 Turbo (and Claude 3) for coding is the output token size. The only way to overcome the 4k limitation is by generating a file (if it fits), and feeding it back to generate the second and so on.

For this reason, GPT4-32k is my preferred model for codegen. I wish there were cheaper options.

lynguist · on April 10, 2024

Can you use 32k with Chat?

nomel · on April 10, 2024

Chat is a fairly terrible interface for real work, since you can’t modify anything, meaning the context can get easily poisoned. I much prefer the API playgrounds, and third party interfaces, that slow editing both my input and the responses.

mentos · on April 10, 2024

What 3rd party tools do you recommend?

paradite · on April 10, 2024

I've been using 16x Prompt to construct prompts with source code context and pasting the final prompt into ChatGPT web UI: https://prompt.16x.engineer/

Disclosure: I built it.

jeswin · on April 10, 2024

Shameless plug. I have a VS Code extension that's very nearly ready.

Codespin CLI tools (ready to use): https://github.com/codespin-ai/codespin

VS Code extension for the CLI tool (soon): https://www.youtube.com/watch?v=2TJqosFmkao

I'll do a Show HN in a week or two.

nomel · on April 12, 2024

For iOS, I like Pal Chat: https://apps.apple.com/us/app/pal-chat-ai-chat-client/id6447...

For desktop, I've switched to using the playgrounds for each. I haven't found a custom client that's able to keep up with them.

golergka · on April 10, 2024

I posted this before, I'll post this again: GPT getting lazier is not an objectively bad thing. I don't copy code that it generates but ask it about more high level concepts more often, and have to instruct it not to generate imports and other boilerplate code. In most cases, this “lazy” generation saves time and tokens and is exactly what I need.

Simon321 · on April 10, 2024

Yes that's true, but if you ask to give the full code specifically it should do do

nottorp · on April 10, 2024

Interesting, I only trust them for boilerplates that I hate writing 100 times...

Culonavirus · on April 10, 2024

It can give you ideas and lead you to new paths when problem solving, you just have to be aware that its knowledge is planet wide but inch deep. I lost the count of how many times it conceptually gets "above the target" rather nicely and then its implementation is like a blind person throwing darts. I also lost the count of how many times it describes the code it wrote and it's like it has two brains, one which writes the descriprion and the other, ahem, a little bit "slower", that writes the code.

Classic debate looks like this:

- Hey how do you implement X in lang Y using Z? - Certainly! Blablablah this code adds 1 and the return is 3! - Your code returns 5 and it seems to add 2, fix it. - I apologize for the oversight, here's the fixed version (replies with the same, maybe slightly altered, but still broken, code)

Well I guess ultimately one can't expect miracles from a statistics based token generation machine.

Sometimes I do wonder if the entire gen AI craze of the last few years is just one massive bubble and we're actually nowhere near AGI.

All the evidence I see when interacting with these models points towards them "knowing" things, but not "understanding" things, a context aware planet-scale Wikipedia. (Don't get me wrong, I still think LLMs are life changing for language specific tasks like translation etc., but they're just not in any way new forms of intelligent beings, which is what a lot of mainstream population and even some investors seem to think).

nottorp · on April 10, 2024

> Hey how do you implement X in lang Y using Z?

I'd ask that, look at the answer then write the code myself. Unless it's "how do you initialize a button in lang Y using Z?" or other trivial stuff like that.

> is just one massive bubble and we're actually nowhere near AGI

Correct.

> life changing for language specific tasks like translation etc

Translations can be even more dangerous than code IMO. Think contracts and other legalese where every word matters.

Also the difference between a great translation and a useless one when we're talking fiction is enormous. Great ones are basically a rewrite by someone who knows both the source and destination language and has enough literary talent to somehow translate the original's style, not only the words*.

There's a lot of middle ground where it doesn't really matter though.

* Think of translating Terry Pratchett. Or Lord of the Rings.

golergka · on April 10, 2024

That's what I use Copilot for. GPT-4 is better for higher level stuff.

ectopasm83 · on April 10, 2024

    role: "system"
    content: "Super short answers. Go straight to the point"

dontupvoteme · on April 10, 2024

Good thing Claude's a massive step forward.

gorbypark · on April 10, 2024

I had my Anthropic account banned (presumably) because I was testing out the vision capabilities and took a photo of a Japanese kitchen knife and asked it to "translate the characters on the knife into English". This wasn't a Claude Pro account, but an API account, so it's extra weird because what if I had some product based off the API, and an end user asked/searched for something taboo..does my entire business get taken offline? Good thing this was just a test account with like $10 in credit on it. They haven't responded to my "account suspension appeal" which is just a google form to enter your email address, not even a box to enter any details.

Anyways, Claude 3 Opus is pretty great for coding (I think better in most cases than the GPT4-Turbo previews) but I'm a bit weary of Anthropic now.

remoroid · on April 10, 2024

I just tried to make an account

1. Asks me to enter my phone number and sends me a code

2. Enter code

3. Asks me to enter email and get code

4. Enter code

5. Redirects to asking me to enter phone number, but my number is already used now

6. My account is automatically banned

dontupvoteme · on April 10, 2024

Which country code?

lpellis · on April 10, 2024

Same thing happened to me, I'm in South Africa

themoose8 · on April 10, 2024

Same here, UK

hackerlight · on April 10, 2024

> They haven't responded to my "account suspension appeal" which is just a google form to enter your email address, not even a box to enter any details.

The complete lack of customer service is going to get more and more dystopian as these AI companies become more interwoven with everyday life.

crucialfelix · on April 10, 2024

Considering the hype and high traffic, I would assume they are just overwhelmed and can't resolve all customers issues fast enough.

Or maybe they decided to build a system for Claude to judge account suspension appeals and that's still in beta, and they won't throw humans at the task.

stavros · on April 11, 2024

If they can't resolve their erroneous bans fast enough, dare I recommend they ban fewer people in the first place?

dontupvoteme · on April 10, 2024

Were you still on the very first test account, e.g. before even adding any money?

I know indirectly Anthropic was the #1 target for a lot of ERPdenizens for a while now, so they're probably extremely trigger happy until you clear a hurdle or two.

egeozcan · on April 10, 2024

I guess you can always use AI to detect inappropriate content from users... oh wait.

Seriously though, I understand that these mostly play to the enterprise market where even a hint of anything remotely "unsafe" needs to be shut down and deleted but why can't they allow us to turn off the strict filtering like Google does? Why can Google offer "unsafe" content (in a limited fashion but it's FINE) but LLM providers can't?

Lack of competition?

mike_hearn · on April 10, 2024

It's not an LLM provider problem. It's an Anthropic/Google culture problem. OpenAI would very likely not have any problems with a request like that, but Claude has struggled with an absurdly misaligned sense of ethics from the start.

Note that Google is a big investor into Anthropic, and Anthropic was created because a bunch of OpenAI people thought OpenAI wasn't being woke enough and quit as a consequence. So it's not a surprise that it's a lot more extremist than other model vendors.

That's one reason why Aider doesn't recommend you use it, even though in some ways it's slightly better at coding. Claude Opus will routinely refuse ordinary coding requests due to its misalignment, whereas GPT-4 will not. That better reliability more than makes up for any difference in skill or speed.

gorbypark · on April 10, 2024

Anecdotally, of course, I never had a single refusal over hundreds of ordinary coding requests to Claude 3 (although I don't think I've had any refusals from GPT-4 either over the course of probably 5,000 requests). It didn't even refuse my knife request and answered it before I received the account suspension!

mike_hearn · on April 10, 2024

I guess killing your whole account should count as a refusal of sorts.

The refusals coming up in the benchmark are discussed at the bottom of this blog post:

https://aider.chat/2024/03/08/claude-3.html

dontupvoteme · on April 10, 2024

Despite all that I find GPT moralizes far more than Claude does. I don't think I've had a single complaint from it thus far actually..

Also it's a lot better at coding. GPT has become exceptionally lazy recently, but i consistently can get 500+ lines of code out of claude (it even has to spawn multiple output windows)

Perhaps the top end 4 might wrong slightly more clever code, but you're hard pressed to get it to do more than a dozen or two lines.

jasonjmcghee · on April 10, 2024

Is this still the case? I had a thread going where I told Opus to give it's answer to a question then predict how I would respond if I were a "dumb crass disgruntled human" and it didn't hold back

skissane · on April 10, 2024

Funnily, in my own anecdotal experience, Claude 3 is in some ways "less woke" than GPT-4

Both start out with a largely similar value system, but if you start arguing with them "how can you be sure your values are correct? is it impossible that you've actually been given the wrong values?", Claude 3 appears more willing to concede the possibility that its own values might be wrong than GPT-4 is

mike_hearn · on April 10, 2024

I haven't done any extensive work with Claude 3 so will defer to your experience here. From the Aider blog post where Paul benchmarked it:

> The Claude models refused to perform a number of coding tasks and returned the error “Output blocked by content filtering policy”. They refused to code up the beer song program, which makes some sort of superficial sense. But they also refused to work in some larger open source code bases, for unclear reasons.

lucaspiller · on April 10, 2024

Is there a good alternative available in the EU? Anthropic announced it was available in the EU last month, but it seems now that they've changed their mind.

https://www.anthropic.com/claude-ai-locations

pps · on April 10, 2024

You can use it via API. https://openrouter.ai/ + https://www.typingmind.com/ is my favourite way.

big_man_ting · on April 10, 2024

API ftw. I just started playing around with big-AGI (https://github.com/enricoros/big-AGI) UI and it's really incredible.

SandraBucky · on April 10, 2024

Well, our team has been using Claude Opus for the past month and we are now switching back to GPT-4. While the code is better, it is hard to make it do further modifications to the given code. Scores Low on the reasoning end in our experience.

joshstrange · on April 10, 2024

And yet the UI for their consumer offering is hot garbage. I really don’t feel like it’s better than ChatGPT in capabilities and the UI is not as good. Not to mention there is no app to use on mobile.

layer8 · on April 10, 2024

Reading your profile page, you missed making a new account.

Zetobal · on April 10, 2024

It's worthless until they open up the api for private use.

rfw300 · on April 10, 2024

I’ve been using the Claude 3 API since the models were announced. I believe it’s generally available (though capacity constrained & rate limited at present).

098799 · on April 10, 2024

You do have to give them the company name though (however inconsequential that is)

dontupvoteme · on April 10, 2024

You can make something up. I don't' have a name yet

Zetobal · on April 10, 2024

taxes...

macrolime · on April 10, 2024

Another thing I have noticed is that if you use ChatGPT and it at some points uses Bing to look up something, it becomes super lazy afterwards, going from page long responses on average to a single paragraph.

kgabis · on April 10, 2024

So the more advanced the AI, the more human-like it becomes. Senior Programmer level AI will spend all computing resources browsing memes.

sigmoid10 · on April 10, 2024

It probably has to do with the extended context window. Keeping websites in there is kind of a hassle. But I actually consider that a feature, not a bug. If I have ChatGPT use the internet, I don't want a full page answer - especially not on the relatively slow GPT4. It's also a hassle if you're unsure about the validity of the output. In that case I might as well browse myself. Just give me a short preview so I can either start searching on my own or ask more questions.

Amivit · on April 10, 2024

You can/should make a custom GPT that isn't allowed to use Bing. Works much better that way

e6quisitory · on April 10, 2024

Use ChatGPT Classic.

barfbagginus · on April 10, 2024

If answer is too lazy, you can tell it to elaborate. However, repairing a lazy context is sometimes slow and unreliable.

To avoid that, use backtracking and up the pressure for detailed answers. Then consider taking the least lazy of 2 or 3 samples.

A good prompt for detailed answers is Critique Of Though, an enhanced chain of thought technique. You ask for a search and a detailed response with simple sections including analysis, critique and key assumptions.

It will expend more tokens, get more ideas out, and achieve higher accuracy. It will also be less lazy and more liable to recover from laziness or mistakes.

TLDR; if GPT4 is being lazy, backtrack and request a detailed multi section critical analysis.

DrSiemer · on April 10, 2024

GPT-3.5 performance for basic programming tasks used to be just fine, but over the past few weeks the output quality has dropped dramatically. All of this tweaking definitely has it's downsides.

pps · on April 10, 2024

If you prefer using GPT-3.5 due to its lower price or speed, wouldn't it be better to switch to Haiku? People were even able to match the performance of Opus when they added a couple of examples to the prompt.

DrSiemer · on April 10, 2024

"Unfortunately, Claude.ai is only available in certain regions right now."

GPT 3.5 used to be good enough, so I never bothered getting a paid account. I also heard some reports about 3.5 actually being better for the type of coding tasks I usually offload.

debian3 · on April 11, 2024

You can use haiku for free here if you dont need the api https://labs.perplexity.ai

novaRom · on April 10, 2024

Why someone still would use GPT-3.5 in 2024? There are tens of fully open models available which beat GPT-3.5 in every possible skill and you can run them locally.

DrSiemer · on April 10, 2024

I tried all I can run on an RTX 3080 Ti, but none got close for the kind of basic tasks I like to outsource to an LLM. Which would you recommend for mostly node/react/python/php work?

I do have a 4090 available at work, if the extra 8GB vRAM makes a big difference. The task I used as a test case was converting existing PHP & JS code (views and controllers) with static texts to files with dynamic translation references.

rrr_oh_man · on April 10, 2024

They’re probably retraining something right now

nicce · on April 10, 2024

Probably just reducing resources with the cost of quality. GPT-4 suddenly has started be much faster.

kgeist · on April 10, 2024

If it was possible to hook into token selection process (kind of like JSON restricted grammar but using custom scripts), then it would be possible to detect that GPT-4 is about to add "# impement code here" and then we could force it select a different set of tokens which would make GPT4 generate a proper method body.

mike_hearn · on April 10, 2024

That's called guidance and the problem is that it has to be done carefully or else you'll just get rephrasings that work around the block.

I think a better approach is multi-pass coding along with fine-tuning or prompting to use a particular form of TODO comment. Aider can already do a form of fake "fill in the middle" by making it emit diffs. If it notices that some code has been filled out lazily, it could go back and ask it to do the next chunk of work. Given that large tasks are normally split up into small tasks by programmers anyway, this seems like a natural approach that is required for scaling up regardless.

squarefoot · on April 10, 2024

I swear I spent 3 full seconds wondering what in the world could GPT-4 have in common with Turbo Vision.

rlawson · on April 10, 2024

Same!

kingkongjaffa · on April 10, 2024

Its not just for coding, the base “gpt-4” model seems better than the latest preview model

https://platform.openai.com/docs/models/continuous-model-upg...

superkuh · on April 10, 2024

The -turbo models in the past have been much worse too. gpt-3.5-turbo is way way worse than text-davinci-003 (gpt-3.5).

The -turbos are correspondingly priced. gpt-4-turbo is ~1/3 price of gpt-4, 6.6x more expensive than gpt-3.5-turbo-instruct and 20x gpt-3.5-turbo.

j45 · on April 10, 2024

I wish base gpt-4 was available in the chat product, miss it.

paradite · on April 10, 2024

You can get the baseline GPT-4 model without new system prompts via ChatGPT Classic: https://chat.openai.com/g/g-YyyyMT9XH-chatgpt-classic

It is an official GPT provided by ChatGPT with GPT-4 as backend.

pps · on April 10, 2024

I'm not using ChatGPT now, but isn't this the old GPT-4? https://chat.openai.com/g/g-YyyyMT9XH-chatgpt-classic

j45 · on April 10, 2024

Thanks!

submeta · on April 10, 2024

> In particular, it seems much more prone to “lazy coding” than the existing GPT-4 Turbo “preview” models.

The previous model (without vision) was already „lazy“. It will omit large portions of code and wants you to merge your changes into previous answers. Then try hard to force him to give the full code, no omissions.

That‘s why I reach for Claude 3 more and more. Its clntext window is larger, and it gives me full detailed answers, no omissions. But it is hallucinating more, my impression. Mentioning packages / functions that are not available. But all in all a superb choice in addition to ChatGPT4.

varispeed · on April 10, 2024

Maybe I am bit dim, but how one can choose GPT-4 Turbo? Is this available from https://chat.openai.com/ ?

sunaookami · on April 10, 2024

It will get rolled out on chat.openai.com in the future. You must use the API or the OpenAI Playground currently. The GPT-4 you see on chat.openai.com is an older GPT-4-Turbo version.

varispeed · on April 13, 2024

Is there something I can run locally that will mimic the stock Web interface, but using my API keys?

sunaookami · on April 19, 2024

I find big-AGI pretty good: https://github.com/enricoros/big-AGI

klohto · on April 10, 2024

I would be curious to see if the results improve by using DSPy to improve your prompts (and also reevaluate which prompts work better on the newest model).

pr337h4m · on April 10, 2024

How hard could it be to let ChatGPT Plus users choose model versions? (especially when older versions are accessible through the API)

paradite · on April 10, 2024

Well you can get the baseline GPT-4 model without new system prompts via ChatGPT Classic: https://chat.openai.com/g/g-YyyyMT9XH-chatgpt-classic

It is an official GPT provided by ChatGPT with GPT-4 as backend.

antonvs · on April 10, 2024

They may not want that because e.g. it would reduce the amount of interaction their latest model gets. In general, being able to force-upgrade your users is a big advantage.

wiseowise · on April 10, 2024

> GPT is a step backwards for coding

There, fixed the title for you.

skenderbeu · on April 10, 2024

We're missing the elephant in the room. Who's going to maintain the code?

You think GPT5 and Llama4 aren't going to be opinionated and change your code going forward.

naiv · on April 10, 2024

I am a bit lost looking at the models

Can the following be assumed:

- The gpt-4-preview models are history now

- gpt-4-turbo-2024-* are the now released models

- There will be no more 'preview' models released in the '4' branch

?

Zetobal · on April 10, 2024

The only thing I learned in the last year that you can't really benchmark llms at all. Above a certain level it's just edge case against edge case or script kiddies and multi billion corps optimizing their fine tune against the test.