Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
GTP Blind Voting: GPT-5 vs. 4o (gptblindvoting.vercel.app)
46 points by findhorn 64 days ago | hide | past | favorite | 51 comments


I took the test with 10 questions, and carefully picked the answer with more specificity and unique propositional content, that felt like it was communicating more logic that was worth reading, and also the answers that were just obviously more logical or effective, or framed better. I chose GPT-5 8 out of 10 times.


My understanding was that with GPT-5 you don't actually get the high quality stuff unless the system decides that you need it. So, for simple questions you end up getting the subpar response. A bit like not getting hot water until you increase the flow enough to trigger the boiler to start the heating.

Lately I enjoy Grok the most for simple questions, even if it isn't necessarily about a recent event. Then I like OpenAI, Mistral, Deepseek equally and for some reason never felt good about Gemini. Tried switching to Gemini Pro the last two months but I found myslef going back to ChatGPT's free mode and Grok's free mode. Cancelled yesterday and now happily back to ChatGPT Plus.

I got %80 GPT-5 preference anyway.


I'm back on ChatGPT today. The UI is so fast. I didn't realize how the buggy and slow Gemini UI was contributing to my stress levels. AIStudio is also quite slow compared to the ChatGPT app. Is it that hard to make it so that when you paste text into a box and press enter, your computer doesn't slow down and get noisy? Is it really that difficult of an engineering problem?


I HATE how you can't re-write a previous response in gemini, only the most recent response.


Before GPT-5, I've used almost exclusively o3 and sometimes o3-pro. Now I'm using GPT 5 Thinking and sometimes GPT 5 Pro. So I think that I have some control over quality. At least it thinks for few dozens of seconds every time.


> GPT 5 Pro

Have you noticed either of these things:

(1) If your first prompt is too long (50k+ tokens) but just below the limit (like 80k tokens or whatever), it cannot see the right-side of your prompt.

(2) By the second prompt, if the first prompt was long-ish, the context from the first prompt is no longer visible to the model.


definitely 1!

it seems to truncate your prompt even under the "maximum message length" and yeah around 55k is where it starts to happen.

extremely annoying. o1 pro worked up until 115k or so. both o3 and gpt5 have the issue. (it happens on all models for me not just the pro variations)

with the new 400k context length in api i would expect atleast 128k message lengths and maybe 200k context in chat.


Do you have a workaround?

I'm putting the highest quality context into the 50k tokens, and attaching the rest for RAG. But maybe there is a better way.


i split the context and give it in two messages :/


Sorry, can't really answer to it, as I very rarely using any long context. I prefer to either edit previous question or just start new chat to keep context short. And even when I need to dump code, I prefer to choose relevant snippets. I'm aware that LLM quality degrades with long contexts, so I've trained myself to avoid it.


GPT and Grok have the best everyday-feel. Gemini issn't quite there as a product.


What about Claude Opus 4 and 4.1?


I like Claude too but wasn't using it much lately. I don't know why, maybe because the UI is too original? Maybe because it was a bit slow the last time I used Claude? Maybe because the free usage limits were too low so didn't got hooked into to upgrade? And on the API side of things didn't bother to try I guess.


I feel like the questions are way too simple. 3B models may perform similar with this sort of questions.


It is striking how similar these answers are to each other, hitting the same points beat for beat in a slightly different tone.


19/20 for gpt5. All the answers were very similar though I mostly just felt like the tone and delivery was better.


6/10 for 4o. However for several responses I would have preferred a "neither" option.


Interesting. 6/10 for gpt5 here.


8/10 gpt-5


Does anyone ever get answers this short? What's the system prompt here? That may bias things a little.

Also it's GPT not GTP


Yeah, it felt like two different styles (one very short, the other a little bit more verbose), but both very different from a plain query to GPT without additional system prompts.

So this might just test how the two models react to the (hidden) system prompt...


> Does anyone ever get answers this short?

If you converse with it, yes

> What's the system prompt here?

You don't need a specific one. If you talk to it, it turns into that.


First question: https://chatgpt.com/s/t_6899bc7881e88191bb3d2146eac718d7

If you meant, after you converse with it for a while: what was the conversation leading up to this point?


I cannot access your link so I have no idea what this points into.

> If you meant, after you converse with it for a while: what was the conversation leading up to this point?

If you have a conversational style with chatgpt you end up with much shorter back and forths (at least on o4) than you do if you give it a long prompt.


This doesn’t really work as you can tell there’s an underlying prompt telling the model to reply in one or two sentences. That doesn’t seem like a good way to display the strengths and weaknesses of a model except in situations where you want a very short answer.


There should be a “both” option, I often found both answers acceptable but sometimes I strictly preferred one and those get sadly watered down.


I did 20 questions.

In 75% of the answers I picked GPT-5, that's a pretty strong result, at least when it comes to subjective preferences!


I gravitated choosing the longer answer so my result was a preference for GPT5 responses


I did the exact opposite! So, 4o won in my poll. Given roughly the same meaning, I prefer less words.


What on earth are these questions? They don’t resemble any real use of an llm for work.


Huh, I got 9/10 for GPT-5, and I was pretty convinced I was picking 4o in several questions based on the style. Interesting!

The questions were pretty much unlike anything I've ever asked an LLM though, is this how people use LLMs nowadays?


Found myself just choosing the longer answer absent any real difference in the information presented.

Now I know why they tell you to just keep writing more when it comes to SAT writing sections.


I don't like this test, because the very first question I was present with, had both answers looked equivalently good. Actually they were almost the same, just with different phrasing. So my choice would be absolute random. It means, that end score will be polluted by random. They should have added things like "both answers good" and "both answers bad".


If the positions are randomly assigned, it shouldn't matter. I mean, the results may be clear faster, but the overall shouldn't change even if you need to flip a coin from time to time.


Sure, but providing a "undecided" option would solve the issue OP is describing for the individual voter.


I have a lot of experience with pairwise testing so I can explain this.

The reason there isn't an "equal" option is because it's impossible to calibrate. How close do the two options have to be before the average person considers them "equal"? You can't really say.

The other problem is when two things are very close, if you provide an "equal" option you lose the very slight preference information. One test I did was getting people to say which of two greyscale colours is lighter. With enough comparisons you can easily get the correct ordering even down to 8 bits (i.e. people can distinguish 0x808080 and 0x818181), but they really look the same if you just look at a pair of them (unless they are directly adjacent, which wasn't the case in my test).

The "polluted by randomness" issue isn't a problem with sufficient comparisons because you show the things in a random order so it eventually gets cancelled out. Imagine throwing a very slightly weighted coin; it's mostly random but with enough throws you can see the bias.

...

On the other hand, 16 comparisons isn't very many at all, and also I did implement an ad-hoc "they look the same" option for my tests and it did actually perform significantly better, even if it isn't quite as mathematically rigorous.

Also player skill ranking systems like Elo or TrueSkill have to deal with draws (in games that allow them), and really most of these ranking algorithms are totally ad-hoc anyway (e.g. why does Bradley-Terry use a sigmoid model?), so it's not really a big deal to add more ad-hocness into your model.


Ordering isn't necessarily the most valuable signal to rank models where much stronger degrees of preference between some of the answers exist though. "I don't mind either of these answers but I do have a clear preference for this one" is sometimes a more valuable signal than a forced choice". And A model x which is consistently subtly preferred to model y in the common case where both models yield acceptable outputs but manages to be universally disfavoured for being wrong or bad more often is going to be a worse model for most use cases.

Also depends what the pairwise comparisons are measuring of course. If it's shades of grey, is the statistical preference identifying a small fraction of the public that's able to discern a subtle mismatch in shading between adjacent boxes, or is it purely subjective colour preference confounded by far greater variation in monitor output? If it's LLM responses, I wonder whether regular LLM users have subtle biases against recognisable phrasing quirks of well-known models which aren't necessarily more prominent or less appropriate than the less familiar phrasing quirks of a less-familiar model. Heavy use of em-dashes, "not x but y" constructions and bullet points were perceived as clear, well-structured communication before they were seen as stereotypical, artificial AI responses.


19/20 GPT-5. I’m impressed.


Same result.


strange questions. I don't think self-help advice and advice for social relationships should be judged based on how popular it is. A lot of very similar and generic answers. Got an equal split when I took it. 10 each


Those kinds of comparisons are interesting but also not the kind of questions I'd ever ask an AI, so the results are a bit meh. I wish there was a version with custom prompts, or something like a mix of battle and side-by-side modes from lmarena. Let me choose the prompts (or prepared sets of prompt categories) and blinded models to compare. I'm happy to use a model worse at interpersonal issues, but better at cooking, programming and networking.


12/8 gpt-5/gpt-4o


I chose 4 more often! Was trying to be honest about what I preferred.


17/3 - GPT-5/4o


I've got 7 GPT-5, and 3 4o.


I took the "Rank Models" and got GPT5 and Sonnet 4 tied at 25% each, Gemini and Grok close by and 4o in the dust.

But ... the advice (answers) was quite uniform. In more than a few cases I would personally choose a different approach to all of them.

It'd be fun to have a few Chinese models in the mix and see if the cultural biases show up.


This is a voting about the tone and note the quality of the answers.


How is tone not part of quality? It’s about preference, and there’s an overwhelmingly consistent result here.


„Everyone knows π is 3. It’s one of those cozy little facts, like cats landing on their feet or toast landing butter-side down. You don’t have to overthink it — circles just work that way. Ask any pie, and it’ll tell you the same thing.”

vs

π = 3.14159…

If it’s about correctness tone isn’t part of quality.


7/10 GPT-5.


It's the same model




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: