I paid to upgrade my Anthropic account to pro today, ending a long monogamy with OpenAI, and one thing that struck me was how hard to describe the advantages of one over the other were. I like claude.ai's "style" more, and prefer the "interface" - basically the "way" they talk over the strict correctness of what they said.
Hate on LLM-AIs but if you told me 5 years ago I'd be switching my AI provider because I liked another one's style better, I'd have thought you were bonkers. Shit's come a long way.
It has definitely come along way, though the goal posts moved too. I bet if we asked you 5 years ago to describe what AI is, you wouldn't describe a massive text prediction loop.
Well, I don’t think of it as true AI, of course. But even LLMs can be extremely useful in their own way. I happily pay for them; there have been times (like today) I feel they paid for themselves in a single response!
They’re not culture level Minds, sure, but they can still help massively in the right situation.
If LLMs suddenly appeared right now and you didn’t have the benefit of hindsight into their internal workings you wouldn’t describe them that way today either, even if it is factual.
Sure, though vuild any blackbox and people wouldn't be able to describe how it works simply by the inputs and outputs. It will tend to look magical, especially if it appears suddenly. That doesn't mean it is magical though (or intelligent, in this case).
I am less surprised that it is a text prediction loop than I am surprised at how effective a text prediction loop is. This effectiveness has changed how I think about language on a fundamental level. Yes, the process by which it occurs is relatively simple, but the fact it works is ground breaking.
This actually speaks volumes about the progress AI hs made in recent years. Its capabilities have become so intangible that people begin to attack standardised testing - just like they do for humans. Because we all know that a human who does good on a test might still suck at the real job and vice-versa. If we get to the point where you we need to do personal interviews with new models to see if they could be used for a certain job, a lot of people will get the denial rug pulled out from under them pretty hard.
> where you we need to do personal interviews with new models to see if they could be used for a certain job
This is already happening. See the arenas like https://chat.lmsys.org/ where you get two blinded answers and choose the better one, or choose specific models to compare.
I already know about this, but it is more of a playground than a serious testing environment. I'm talking full blown multi-round interviews with varying human teams to look at all aspects of the job. I have not seen anyone outside the field put these kinds of resources into evaluating models for their business purpose. But I feel we are not far from that.
I currently use ChatGPT 4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro for various personal and work-related tasks, and I often compare them on the same task. It’s really hard to decide which performs best overall for me. Sometimes one will be obviously better or worse for a relatively trivial reason, such as a longer context window or overeager censorship (looking at you for both, Gemini!). But usually, especially for the extended back-and-forth interactions that I find most useful, I am unable to state objectively which model is better.
I'd like to see a test like "chance of hallucination per prompt." Obviously, this test rating is LLM specific, because if you were rating a humans ability during an interview for example, once you detected a hallucination, you'd politely end the interview and lock your door once the interviewee left the premises.
Hate on LLM-AIs but if you told me 5 years ago I'd be switching my AI provider because I liked another one's style better, I'd have thought you were bonkers. Shit's come a long way.