I took the test with 10 questions, and carefully picked the answer with more specificity and unique propositional content, that felt like it was communicating more logic that was worth reading, and also the answers that were just obviously more logical or effective, or framed better. I chose GPT-5 8 out of 10 times.
My understanding was that with GPT-5 you don't actually get the high quality stuff unless the system decides that you need it. So, for simple questions you end up getting the subpar response. A bit like not getting hot water until you increase the flow enough to trigger the boiler to start the heating.
Lately I enjoy Grok the most for simple questions, even if it isn't necessarily about a recent event. Then I like OpenAI, Mistral, Deepseek equally and for some reason never felt good about Gemini. Tried switching to Gemini Pro the last two months but I found myslef going back to ChatGPT's free mode and Grok's free mode. Cancelled yesterday and now happily back to ChatGPT Plus.
I'm back on ChatGPT today. The UI is so fast. I didn't realize how the buggy and slow Gemini UI was contributing to my stress levels. AIStudio is also quite slow compared to the ChatGPT app. Is it that hard to make it so that when you paste text into a box and press enter, your computer doesn't slow down and get noisy? Is it really that difficult of an engineering problem?
Before GPT-5, I've used almost exclusively o3 and sometimes o3-pro. Now I'm using GPT 5 Thinking and sometimes GPT 5 Pro. So I think that I have some control over quality. At least it thinks for few dozens of seconds every time.
(1) If your first prompt is too long (50k+ tokens) but just below the limit (like 80k tokens or whatever), it cannot see the right-side of your prompt.
(2) By the second prompt, if the first prompt was long-ish, the context from the first prompt is no longer visible to the model.
Sorry, can't really answer to it, as I very rarely using any long context. I prefer to either edit previous question or just start new chat to keep context short. And even when I need to dump code, I prefer to choose relevant snippets. I'm aware that LLM quality degrades with long contexts, so I've trained myself to avoid it.
I like Claude too but wasn't using it much lately. I don't know why, maybe because the UI is too original? Maybe because it was a bit slow the last time I used Claude? Maybe because the free usage limits were too low so didn't got hooked into to upgrade? And on the API side of things didn't bother to try I guess.
Yeah, it felt like two different styles (one very short, the other a little bit more verbose), but both very different from a plain query to GPT without additional system prompts.
So this might just test how the two models react to the (hidden) system prompt...
I cannot access your link so I have no idea what this points into.
> If you meant, after you converse with it for a while: what was the conversation leading up to this point?
If you have a conversational style with chatgpt you end up with much shorter back and forths (at least on o4) than you do if you give it a long prompt.
This doesn’t really work as you can tell there’s an underlying prompt telling the model to reply in one or two sentences. That doesn’t seem like a good way to display the strengths and weaknesses of a model except in situations where you want a very short answer.
I don't like this test, because the very first question I was present with, had both answers looked equivalently good. Actually they were almost the same, just with different phrasing. So my choice would be absolute random. It means, that end score will be polluted by random. They should have added things like "both answers good" and "both answers bad".
If the positions are randomly assigned, it shouldn't matter. I mean, the results may be clear faster, but the overall shouldn't change even if you need to flip a coin from time to time.
I have a lot of experience with pairwise testing so I can explain this.
The reason there isn't an "equal" option is because it's impossible to calibrate. How close do the two options have to be before the average person considers them "equal"? You can't really say.
The other problem is when two things are very close, if you provide an "equal" option you lose the very slight preference information. One test I did was getting people to say which of two greyscale colours is lighter. With enough comparisons you can easily get the correct ordering even down to 8 bits (i.e. people can distinguish 0x808080 and 0x818181), but they really look the same if you just look at a pair of them (unless they are directly adjacent, which wasn't the case in my test).
The "polluted by randomness" issue isn't a problem with sufficient comparisons because you show the things in a random order so it eventually gets cancelled out. Imagine throwing a very slightly weighted coin; it's mostly random but with enough throws you can see the bias.
...
On the other hand, 16 comparisons isn't very many at all, and also I did implement an ad-hoc "they look the same" option for my tests and it did actually perform significantly better, even if it isn't quite as mathematically rigorous.
Also player skill ranking systems like Elo or TrueSkill have to deal with draws (in games that allow them), and really most of these ranking algorithms are totally ad-hoc anyway (e.g. why does Bradley-Terry use a sigmoid model?), so it's not really a big deal to add more ad-hocness into your model.
Ordering isn't necessarily the most valuable signal to rank models where much stronger degrees of preference between some of the answers exist though. "I don't mind either of these answers but I do have a clear preference for this one" is sometimes a more valuable signal than a forced choice". And A model x which is consistently subtly preferred to model y in the common case where both models yield acceptable outputs but manages to be universally disfavoured for being wrong or bad more often is going to be a worse model for most use cases.
Also depends what the pairwise comparisons are measuring of course. If it's shades of grey, is the statistical preference identifying a small fraction of the public that's able to discern a subtle mismatch in shading between adjacent boxes, or is it purely subjective colour preference confounded by far greater variation in monitor output? If it's LLM responses, I wonder whether regular LLM users have subtle biases against recognisable phrasing quirks of well-known models which aren't necessarily more prominent or less appropriate than the less familiar phrasing quirks of a less-familiar model. Heavy use of em-dashes, "not x but y" constructions and bullet points were perceived as clear, well-structured communication before they were seen as stereotypical, artificial AI responses.
strange questions. I don't think self-help advice and advice for social relationships should be judged based on how popular it is. A lot of very similar and generic answers. Got an equal split when I took it. 10 each
Those kinds of comparisons are interesting but also not the kind of questions I'd ever ask an AI, so the results are a bit meh. I wish there was a version with custom prompts, or something like a mix of battle and side-by-side modes from lmarena. Let me choose the prompts (or prepared sets of prompt categories) and blinded models to compare. I'm happy to use a model worse at interpersonal issues, but better at cooking, programming and networking.
„Everyone knows π is 3. It’s one of those cozy little facts, like cats landing on their feet or toast landing butter-side down. You don’t have to overthink it — circles just work that way. Ask any pie, and it’ll tell you the same thing.”
vs
π = 3.14159…
If it’s about correctness tone isn’t part of quality.