Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
GPT-4 Turbo with Vision is a step backwards for coding (aider.chat)
164 points by anotherpaulg on April 10, 2024 | hide | past | favorite | 122 comments


Interestingly, GPT-4 Turbo with Vision is at the top of the LiveCodeBench Leaderboard: https://livecodebench.github.io/leaderboard.html

(GPT-4 Turbo with Vision has a knowledge cutoff of Dec 2023, so filter to Jan 2024+ to minimize the chance of contamination.)

In general, my take is that each model has its own personality, which can cause it to do better or worse on different sorts of tasks. From evaluating many LLMs, I've found that it's almost never the case that one model is better than an another at everything. When an eval only has a certain type of problem (e.g., only edits to long codebases, or only short self-contained competition problems), it's not clear how homogeneously its performance rankings will generalize to other coding tasks. Unfortunately, if you're a developer using an LLM API, the best thing to do is to test all of the models from all the providers to see which works best for your use case.

(I work at OpenAI, so feel free to discount my opinions as much as you like.)


As a user, I basically just care about a minimum baseline of competence... which most models do well enough on. But then I want the model to "just give me the code". I switched to Claude, and canceled my chatgpt subscription because the amount of placeholders and just general "laziness" in chatgpt was terrible.

Using Claude was a breath of fresh air. I asked for some code, I got the entire code.


I’ve been using Claude 3 Opus for a while now and was fairly happy with the results. Wouldn’t say they were better than GPT-4, but considerably less verbose which I really appreciated. Recently though I ran into two questions I had that Claude actually answered incorrectly and incompletely until I prompted it. One was a Java GC questions where is forgot Epsilon and then hallucinated that is wasn’t experimental anymore. The other was a coding question where I know there wouldn’t be a good answer, but Claude kept repeating a previous answer even though I had twice told it that it wasn’t what I was looking for.

So I’ve switched back to GPT-4 again for a the time being to see if I’m happier with the results. I never felt that Claude 3 Opus measurably better than GPT-4 to begin with.


I just run a system message around my coding exercises to provide minimal explanations and be concise.


Same here -- one or two sentences along those lines in GPT-4's system prompt makes a _world_ of difference.


Agreed and from a pure ChatGPT perspective the way to go is creating your personal "GPTs" (I hate the name of the product) that are simply system message wrappers. So I will have a Coding GPT, and email GPT etc.


Claude is a bit more expensive though, no? I felt like I burned through 5$ worth of credit in one evening, but perhaps it was also because I was using the big-AGI UI and it was producing diagrams for me, often in quintuplicates for some reason. Still, I really like Claude and much more prefer it over others.


I'm not using the API. Both are $20/month subscriptions.


What were the placeholders and laziness? I just ended my prompts with something akin to "give me the full code and nothing else" and ChatGPT does exactly that. How does Claude do any better?


Even if I ask in caps it often comment out large pieces of code. I often give large pieces of code and ask for adjustments. Then I don’t want to have to search & only copy paste the small adjustments of gpt. But it never listens


I sympathize but amusingly I have the opposite problem. Most of the time I want it to output a full script, and it only wants to output a small block with changes unless I plead with it to include everything.


That's what I mean


FWIW, I agree with you that each model has its own personality and that models may do better or worse on different kinds of coding tasks. Aider leans into both of these concepts.

The GPT-4 Turbo models have a lazy coding personality, and I spent a significant effort figuring out how to both measure and reduce that laziness. This resulted in aider supporting a "unified diffs" code editing format to reduce such laziness by 3X [0] and the aider refactoring benchmark as a way to quantify these benefits [1].

The benchmark results I just shared about GPT-4 Turbo with Vision cover both smaller, toy coding problems [2] as well as larger edits to larger source files [3]. The new model slightly underperforms on smaller coding tasks, and significantly underperforms on the larger edits where laziness is often a culprit.

[0] https://aider.chat/2023/12/21/unified-diffs.html

[1] https://github.com/paul-gauthier/refactor-benchmark

[2] https://aider.chat/2024/04/09/gpt-4-turbo.html#code-editing-...

[3] https://aider.chat/2024/04/09/gpt-4-turbo.html#lazy-coding


Hi Ted, since I have been using GPT 4 pretty much every day, I have a few questions about the performance, We had been using 1106 preview for several months to generate SQL queries for a project, but one fine day in February, it stopped responding and it used to respond like so "As a language model, I do not have the ability to generate queries etc...". This lasted for a few hours. Anyway, switching to 0125-preview which helped us immediately resolve the problem. We have been using that for whenever we have code generation related tasks unless we are doing FAQ stuff (where GPT 3.5 Turbo was good enough).

However, off late, I am noticing some really inconsistent behaviours in 0125-preview where it responds inconsistently for certain problems, ie one time it works with a detailed prompt and other time it doesn't. I know these models are predicting the next most likely token which is not always deterministic.

So I was hoping for the ability to fine tune GPT 4 Turbo some time soon. Is that on the roadmap for Open AI?


I don’t work for OpenAI but I do remember them saying that a select few customers would be invited to test out fine tuning GPT-4, and that was several months ago now. They said they would prioritise those who had previously fine tuned GPT-3.5 Turbo.


The ongoing model anchoring/grounding issue likely affects all GPT-4 checkpoints/variants, but is most prominent with the latest "gpt-4-turbo-2024-04-09" variant due to its most recent cutoff date, might imply deeper issues with the current model architecture, or at least how it's been updated:

See the issue: https://github.com/openai/openai-python/issues/1310

See also the original thread on OpenAI's developer forums (https://community.openai.com/t/gpt-4-turbo-2024-04-09-will-t...) for multiple confirmations on this issue.

Basically, without a separate declaration of the model variant in use in system message, even the latest gpt-4-turbo-2024-09 variant over the API might hallucinate being GPT-3 and its cutoff date being in 2021.

A test code snippet is included in the GitHub issue to A/B test the problem yourself with a reference question.


I think there's a bigger underlying problem with the current GPT-4 model(s) atm:

Go to the API Playground and ask the model what is its current cutoff date. For example, in its chat, if you're not instructing it with anything else, it will tell you that its cutoff date is in 2021. Even if you explicitly tell the model via system prompt: "you are gpt-4-turbo-2024-04-09", in some cases it still thinks its in April 2023.

The fact that the model (variants of GPT-4 including gpt-4-turbo-2024-04-09) hallucinates its cutoff date being in 2021 unless specifically instructed with its model type is a major factor in this equation.

Here are the steps to reproduce the problem:

Try an A/B comparison at: https://platform.openai.com/playground/chat?model=gpt-4-turb...

A) Make sure "gpt-4-turbo-2024-04-09" is indeed selected. Don't tell it anything specific via the system prompt and in a worst case scenario, it'll think it's in 2021 as to its cutoff date. It also can't answer to questions about more current events.

* Reload the web page between prompts! *

B) Tell it via the system prompt: "You are gpt-4-turbo-2024-04-09" => you'll get answers to recent events. Ask anything about what's been going on in the world i.e. after April 2023 to verify against A.

I've tried this multiple times now, and have always gotten the same results. IMHO this implies a deeper issue in the model where the priming goes way off if the model number isn't mentioned in its system message. This might explain the bad initial benchmarks as well.

The problem seems pretty bad at the moment. Basically, if you omit the priming message ("You are gpt-4-turbo-2024-04-09"), it will in worst cases revert to hallucinating 2021 cutoff dates and doesn't get grounded into what should be its most current cutoff date.

If you do work at OpenAI, I suggest you look into it. :-)


>I work at OpenAI

I know there's a lot you can't talk about. I'm not going to ask for a leak or anything like that. I'd just like to know, what do you think programming will look like by 2025? What do you think will happen to junior software developers in the near future? Just your personal opinion.


Daniel kokatajlo already answered that I think


Hey Ted, I had a question about working at OpenAI, if you don't mind talking with me. If so, email address is in my profile. Thank you!


Pretty sweet site, thx for sharing. Hope y'all will start bringing token count up at some point. Will be testing this newer version too.


Appreciate OpenAI popped in say new release is probably better at something else, but it would have been nice to acknowledge that this suggestion...

> “Unfortunately, if you're a developer using an LLM API, the best thing to do is to test all of the models from all the providers to see which works best for your use case.”

...is exactly what is done by the author of these benchmark suites:

"It performs worse on aider’s coding benchmark suites than all the previous GPT-4 models. In particular, it seems much more prone to “lazy coding” than the GPT-4 Turbo preview models."


Agreed! Kudos to Paul for creating the evals, running them quickly, and sharing results. My comment (not on behalf on OpenAI, but just me as an individual) was meant as a "yes and" not a "no but".


I feel like degrading the discipline of programming/development to "coding" is a bigger step backwards. Coding is used in programming, but if you're just churning out code then you're not developing well-architected, maintainable and safe software

It's like saying that accountant is just adding. I think come then of tax year you'd be avoiding an accountant who says they've got experience in adding.


I lean in the opposite direction. When someone random like a new neighbor initially asks me what I do, I say "I'm a coder at an insurance company". I teach fifth graders Python in an "Advanced Coding Club". When people ask me what I do for hobbies, one of them is "learning new languages to code with". I will only go into more detail if they are also technical and want more details about what it is I code.

I don't think of it as degrading the thing that I do. I think of it as boiling it down to the simplest description, and I find it more refreshing than "software developer" or "computer programmer" or f"{word} engineer".


I use what impact and benefit my work has to answer what I do.

In your insurance case, I would say something like "I build tools to shield businesses from unexpected disasters like earthquakes or floods" or "I help people worry less about expenses during an emergency"

If someone asks me more, then I might add on that I work on software to automate claim process or similar.


I think the field has long had an issue describing the differences between design and implementation, which has only grown worse as more levels of designing and implementing have appeared. It is a bit like explaining the difference, back in the day between the person who works out a formula and the person assigned to computing it. Neither is trivial work, and the outsider who doesn't like math will view both of them as doing math, but there is still a gap in the mathematical skill and insight involved.

You mention taxes, which makes me think of how many tax preparers are basically helping their customer input data into software and not providing any tax specific advice. That might still be a value add for someone who struggles computer UIs, but that isn't the same as the person helping move money between accounts to reduce tax liability.

I've seen similar when it comes to doing science in a lab.

How can any discipline protect the inner distinction against a much larger world which has a very simplified understanding and will mix the inner groups together?


I've never come across someone that was more enlightened by what I do when using the word "engineer" vs using the word "coder". If anything I would assume coder elicits a more accurate mental image than something a bit more overloaded like engineer.


[flagged]


Beside your tone, the post is not even about semantics at all.


OpenAI just released GPT-4 Turbo with Vision and it performs worse on aider’s coding benchmark suites than all the previous GPT-4 models. In particular, it seems much more prone to “lazy coding” than the GPT-4 Turbo preview models.


Thanks again for running all these benchmarks with model releases. They are really helpful to track progress!


Really appreciate the thoroughness you apply to evaluating models for use with Aider. Did you adjust the prompt at all for the newer models?


I've definitely run into this personally. But even even I explicitly tell it to not skip implementation and to generate fully functional code, it says that it understands but continues right into omitting things again.

It was honestly shocking because we're so used to it understanding our commands that a blatant disregard like that made me seriously wonder what kind of laziness layer they added


I suspect they might be worried it could reproduce copyrighted code in certain circumstances, so their solution was to condition the model to never produce large continuous chunks of code. It was a very noticeable change across the board.


I thought it would be for performance, since it doesn't output all of the code then each reply is shorter/quicker. Although you can still ask it to generate more of the code but that introduces some latency so there's less overall load.


People hypothesized that OpenAI added laziness in order to save money on token generation, since they are burning through GPU time.


This has been my conclusion too. Given it’s a product I’m paying monthly for it seems super regressive to have to trick it into doing what it used to do just fine.


I assume even the paid version is still a loss leader, 20 dollars per month is nothing in GPU time compared to how much they must be spending to generate the output.


I'd probably pay triple to go back to the pre-"Dev Day" product at this point


You can use older models with the api.


They should offer different models at this point.

This laziness occurs over and over, so why bother with omniscience.


The laziness layer seems to be to be an assistant but not a replacement or doing the task.


A big limitation with GPT4 Turbo (and Claude 3) for coding is the output token size. The only way to overcome the 4k limitation is by generating a file (if it fits), and feeding it back to generate the second and so on.

For this reason, GPT4-32k is my preferred model for codegen. I wish there were cheaper options.


Can you use 32k with Chat?


Chat is a fairly terrible interface for real work, since you can’t modify anything, meaning the context can get easily poisoned. I much prefer the API playgrounds, and third party interfaces, that slow editing both my input and the responses.


What 3rd party tools do you recommend?


I've been using 16x Prompt to construct prompts with source code context and pasting the final prompt into ChatGPT web UI: https://prompt.16x.engineer/

Disclosure: I built it.


Shameless plug. I have a VS Code extension that's very nearly ready.

Codespin CLI tools (ready to use): https://github.com/codespin-ai/codespin

VS Code extension for the CLI tool (soon): https://www.youtube.com/watch?v=2TJqosFmkao

I'll do a Show HN in a week or two.


For iOS, I like Pal Chat: https://apps.apple.com/us/app/pal-chat-ai-chat-client/id6447...

For desktop, I've switched to using the playgrounds for each. I haven't found a custom client that's able to keep up with them.


I posted this before, I'll post this again: GPT getting lazier is not an objectively bad thing. I don't copy code that it generates but ask it about more high level concepts more often, and have to instruct it not to generate imports and other boilerplate code. In most cases, this “lazy” generation saves time and tokens and is exactly what I need.


Yes that's true, but if you ask to give the full code specifically it should do do


Interesting, I only trust them for boilerplates that I hate writing 100 times...


It can give you ideas and lead you to new paths when problem solving, you just have to be aware that its knowledge is planet wide but inch deep. I lost the count of how many times it conceptually gets "above the target" rather nicely and then its implementation is like a blind person throwing darts. I also lost the count of how many times it describes the code it wrote and it's like it has two brains, one which writes the descriprion and the other, ahem, a little bit "slower", that writes the code.

Classic debate looks like this:

- Hey how do you implement X in lang Y using Z? - Certainly! Blablablah this code adds 1 and the return is 3! - Your code returns 5 and it seems to add 2, fix it. - I apologize for the oversight, here's the fixed version (replies with the same, maybe slightly altered, but still broken, code)

Well I guess ultimately one can't expect miracles from a statistics based token generation machine.

Sometimes I do wonder if the entire gen AI craze of the last few years is just one massive bubble and we're actually nowhere near AGI.

All the evidence I see when interacting with these models points towards them "knowing" things, but not "understanding" things, a context aware planet-scale Wikipedia. (Don't get me wrong, I still think LLMs are life changing for language specific tasks like translation etc., but they're just not in any way new forms of intelligent beings, which is what a lot of mainstream population and even some investors seem to think).


> Hey how do you implement X in lang Y using Z?

I'd ask that, look at the answer then write the code myself. Unless it's "how do you initialize a button in lang Y using Z?" or other trivial stuff like that.

> is just one massive bubble and we're actually nowhere near AGI

Correct.

> life changing for language specific tasks like translation etc

Translations can be even more dangerous than code IMO. Think contracts and other legalese where every word matters.

Also the difference between a great translation and a useless one when we're talking fiction is enormous. Great ones are basically a rewrite by someone who knows both the source and destination language and has enough literary talent to somehow translate the original's style, not only the words*.

There's a lot of middle ground where it doesn't really matter though.

* Think of translating Terry Pratchett. Or Lord of the Rings.


That's what I use Copilot for. GPT-4 is better for higher level stuff.


    role: "system"
    content: "Super short answers. Go straight to the point"


Good thing Claude's a massive step forward.


I had my Anthropic account banned (presumably) because I was testing out the vision capabilities and took a photo of a Japanese kitchen knife and asked it to "translate the characters on the knife into English". This wasn't a Claude Pro account, but an API account, so it's extra weird because what if I had some product based off the API, and an end user asked/searched for something taboo..does my entire business get taken offline? Good thing this was just a test account with like $10 in credit on it. They haven't responded to my "account suspension appeal" which is just a google form to enter your email address, not even a box to enter any details.

Anyways, Claude 3 Opus is pretty great for coding (I think better in most cases than the GPT4-Turbo previews) but I'm a bit weary of Anthropic now.


I just tried to make an account

1. Asks me to enter my phone number and sends me a code

2. Enter code

3. Asks me to enter email and get code

4. Enter code

5. Redirects to asking me to enter phone number, but my number is already used now

6. My account is automatically banned


Which country code?


Same thing happened to me, I'm in South Africa


Same here, UK


> They haven't responded to my "account suspension appeal" which is just a google form to enter your email address, not even a box to enter any details.

The complete lack of customer service is going to get more and more dystopian as these AI companies become more interwoven with everyday life.


Considering the hype and high traffic, I would assume they are just overwhelmed and can't resolve all customers issues fast enough.

Or maybe they decided to build a system for Claude to judge account suspension appeals and that's still in beta, and they won't throw humans at the task.


If they can't resolve their erroneous bans fast enough, dare I recommend they ban fewer people in the first place?


Were you still on the very first test account, e.g. before even adding any money?

I know indirectly Anthropic was the #1 target for a lot of ERPdenizens for a while now, so they're probably extremely trigger happy until you clear a hurdle or two.


I guess you can always use AI to detect inappropriate content from users... oh wait.

Seriously though, I understand that these mostly play to the enterprise market where even a hint of anything remotely "unsafe" needs to be shut down and deleted but why can't they allow us to turn off the strict filtering like Google does? Why can Google offer "unsafe" content (in a limited fashion but it's FINE) but LLM providers can't?

Lack of competition?


It's not an LLM provider problem. It's an Anthropic/Google culture problem. OpenAI would very likely not have any problems with a request like that, but Claude has struggled with an absurdly misaligned sense of ethics from the start.

Note that Google is a big investor into Anthropic, and Anthropic was created because a bunch of OpenAI people thought OpenAI wasn't being woke enough and quit as a consequence. So it's not a surprise that it's a lot more extremist than other model vendors.

That's one reason why Aider doesn't recommend you use it, even though in some ways it's slightly better at coding. Claude Opus will routinely refuse ordinary coding requests due to its misalignment, whereas GPT-4 will not. That better reliability more than makes up for any difference in skill or speed.


Anecdotally, of course, I never had a single refusal over hundreds of ordinary coding requests to Claude 3 (although I don't think I've had any refusals from GPT-4 either over the course of probably 5,000 requests). It didn't even refuse my knife request and answered it before I received the account suspension!


I guess killing your whole account should count as a refusal of sorts.

The refusals coming up in the benchmark are discussed at the bottom of this blog post:

https://aider.chat/2024/03/08/claude-3.html


Despite all that I find GPT moralizes far more than Claude does. I don't think I've had a single complaint from it thus far actually..

Also it's a lot better at coding. GPT has become exceptionally lazy recently, but i consistently can get 500+ lines of code out of claude (it even has to spawn multiple output windows)

Perhaps the top end 4 might wrong slightly more clever code, but you're hard pressed to get it to do more than a dozen or two lines.


Is this still the case? I had a thread going where I told Opus to give it's answer to a question then predict how I would respond if I were a "dumb crass disgruntled human" and it didn't hold back


Funnily, in my own anecdotal experience, Claude 3 is in some ways "less woke" than GPT-4

Both start out with a largely similar value system, but if you start arguing with them "how can you be sure your values are correct? is it impossible that you've actually been given the wrong values?", Claude 3 appears more willing to concede the possibility that its own values might be wrong than GPT-4 is


I haven't done any extensive work with Claude 3 so will defer to your experience here. From the Aider blog post where Paul benchmarked it:

> The Claude models refused to perform a number of coding tasks and returned the error “Output blocked by content filtering policy”. They refused to code up the beer song program, which makes some sort of superficial sense. But they also refused to work in some larger open source code bases, for unclear reasons.


Is there a good alternative available in the EU? Anthropic announced it was available in the EU last month, but it seems now that they've changed their mind.

https://www.anthropic.com/claude-ai-locations


You can use it via API. https://openrouter.ai/ + https://www.typingmind.com/ is my favourite way.


API ftw. I just started playing around with big-AGI (https://github.com/enricoros/big-AGI) UI and it's really incredible.


Well, our team has been using Claude Opus for the past month and we are now switching back to GPT-4. While the code is better, it is hard to make it do further modifications to the given code. Scores Low on the reasoning end in our experience.


And yet the UI for their consumer offering is hot garbage. I really don’t feel like it’s better than ChatGPT in capabilities and the UI is not as good. Not to mention there is no app to use on mobile.


Reading your profile page, you missed making a new account.


It's worthless until they open up the api for private use.


I’ve been using the Claude 3 API since the models were announced. I believe it’s generally available (though capacity constrained & rate limited at present).


You do have to give them the company name though (however inconsequential that is)


You can make something up. I don't' have a name yet


taxes...


Another thing I have noticed is that if you use ChatGPT and it at some points uses Bing to look up something, it becomes super lazy afterwards, going from page long responses on average to a single paragraph.


So the more advanced the AI, the more human-like it becomes. Senior Programmer level AI will spend all computing resources browsing memes.


It probably has to do with the extended context window. Keeping websites in there is kind of a hassle. But I actually consider that a feature, not a bug. If I have ChatGPT use the internet, I don't want a full page answer - especially not on the relatively slow GPT4. It's also a hassle if you're unsure about the validity of the output. In that case I might as well browse myself. Just give me a short preview so I can either start searching on my own or ask more questions.


You can/should make a custom GPT that isn't allowed to use Bing. Works much better that way


Use ChatGPT Classic.


If answer is too lazy, you can tell it to elaborate. However, repairing a lazy context is sometimes slow and unreliable.

To avoid that, use backtracking and up the pressure for detailed answers. Then consider taking the least lazy of 2 or 3 samples.

A good prompt for detailed answers is Critique Of Though, an enhanced chain of thought technique. You ask for a search and a detailed response with simple sections including analysis, critique and key assumptions.

It will expend more tokens, get more ideas out, and achieve higher accuracy. It will also be less lazy and more liable to recover from laziness or mistakes.

TLDR; if GPT4 is being lazy, backtrack and request a detailed multi section critical analysis.


GPT-3.5 performance for basic programming tasks used to be just fine, but over the past few weeks the output quality has dropped dramatically. All of this tweaking definitely has it's downsides.


If you prefer using GPT-3.5 due to its lower price or speed, wouldn't it be better to switch to Haiku? People were even able to match the performance of Opus when they added a couple of examples to the prompt.


"Unfortunately, Claude.ai is only available in certain regions right now."

GPT 3.5 used to be good enough, so I never bothered getting a paid account. I also heard some reports about 3.5 actually being better for the type of coding tasks I usually offload.


You can use haiku for free here if you dont need the api https://labs.perplexity.ai


Why someone still would use GPT-3.5 in 2024? There are tens of fully open models available which beat GPT-3.5 in every possible skill and you can run them locally.


I tried all I can run on an RTX 3080 Ti, but none got close for the kind of basic tasks I like to outsource to an LLM. Which would you recommend for mostly node/react/python/php work?

I do have a 4090 available at work, if the extra 8GB vRAM makes a big difference. The task I used as a test case was converting existing PHP & JS code (views and controllers) with static texts to files with dynamic translation references.


They’re probably retraining something right now


Probably just reducing resources with the cost of quality. GPT-4 suddenly has started be much faster.


If it was possible to hook into token selection process (kind of like JSON restricted grammar but using custom scripts), then it would be possible to detect that GPT-4 is about to add "# impement code here" and then we could force it select a different set of tokens which would make GPT4 generate a proper method body.


That's called guidance and the problem is that it has to be done carefully or else you'll just get rephrasings that work around the block.

I think a better approach is multi-pass coding along with fine-tuning or prompting to use a particular form of TODO comment. Aider can already do a form of fake "fill in the middle" by making it emit diffs. If it notices that some code has been filled out lazily, it could go back and ask it to do the next chunk of work. Given that large tasks are normally split up into small tasks by programmers anyway, this seems like a natural approach that is required for scaling up regardless.


I swear I spent 3 full seconds wondering what in the world could GPT-4 have in common with Turbo Vision.


Same!


Its not just for coding, the base “gpt-4” model seems better than the latest preview model

https://platform.openai.com/docs/models/continuous-model-upg...


The -turbo models in the past have been much worse too. gpt-3.5-turbo is way way worse than text-davinci-003 (gpt-3.5).

The -turbos are correspondingly priced. gpt-4-turbo is ~1/3 price of gpt-4, 6.6x more expensive than gpt-3.5-turbo-instruct and 20x gpt-3.5-turbo.


I wish base gpt-4 was available in the chat product, miss it.


You can get the baseline GPT-4 model without new system prompts via ChatGPT Classic: https://chat.openai.com/g/g-YyyyMT9XH-chatgpt-classic

It is an official GPT provided by ChatGPT with GPT-4 as backend.


I'm not using ChatGPT now, but isn't this the old GPT-4? https://chat.openai.com/g/g-YyyyMT9XH-chatgpt-classic


Thanks!


> In particular, it seems much more prone to “lazy coding” than the existing GPT-4 Turbo “preview” models.

The previous model (without vision) was already „lazy“. It will omit large portions of code and wants you to merge your changes into previous answers. Then try hard to force him to give the full code, no omissions.

That‘s why I reach for Claude 3 more and more. Its clntext window is larger, and it gives me full detailed answers, no omissions. But it is hallucinating more, my impression. Mentioning packages / functions that are not available. But all in all a superb choice in addition to ChatGPT4.


Maybe I am bit dim, but how one can choose GPT-4 Turbo? Is this available from https://chat.openai.com/ ?


It will get rolled out on chat.openai.com in the future. You must use the API or the OpenAI Playground currently. The GPT-4 you see on chat.openai.com is an older GPT-4-Turbo version.


Is there something I can run locally that will mimic the stock Web interface, but using my API keys?


I find big-AGI pretty good: https://github.com/enricoros/big-AGI


I would be curious to see if the results improve by using DSPy to improve your prompts (and also reevaluate which prompts work better on the newest model).


How hard could it be to let ChatGPT Plus users choose model versions? (especially when older versions are accessible through the API)


Well you can get the baseline GPT-4 model without new system prompts via ChatGPT Classic: https://chat.openai.com/g/g-YyyyMT9XH-chatgpt-classic

It is an official GPT provided by ChatGPT with GPT-4 as backend.


They may not want that because e.g. it would reduce the amount of interaction their latest model gets. In general, being able to force-upgrade your users is a big advantage.


> GPT is a step backwards for coding

There, fixed the title for you.


We're missing the elephant in the room. Who's going to maintain the code?

You think GPT5 and Llama4 aren't going to be opinionated and change your code going forward.


I am a bit lost looking at the models

Can the following be assumed:

- The gpt-4-preview models are history now

- gpt-4-turbo-2024-* are the now released models

- There will be no more 'preview' models released in the '4' branch

?


The only thing I learned in the last year that you can't really benchmark llms at all. Above a certain level it's just edge case against edge case or script kiddies and multi billion corps optimizing their fine tune against the test.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: