Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Everyone Is Judging AI by These Tests. Experts Say They're Close to Meaningless (themarkup.org)
28 points by billybuckwheat on July 18, 2024 | hide | past | favorite | 19 comments


I paid to upgrade my Anthropic account to pro today, ending a long monogamy with OpenAI, and one thing that struck me was how hard to describe the advantages of one over the other were. I like claude.ai's "style" more, and prefer the "interface" - basically the "way" they talk over the strict correctness of what they said.

Hate on LLM-AIs but if you told me 5 years ago I'd be switching my AI provider because I liked another one's style better, I'd have thought you were bonkers. Shit's come a long way.


It has definitely come along way, though the goal posts moved too. I bet if we asked you 5 years ago to describe what AI is, you wouldn't describe a massive text prediction loop.


Well, I don’t think of it as true AI, of course. But even LLMs can be extremely useful in their own way. I happily pay for them; there have been times (like today) I feel they paid for themselves in a single response!

They’re not culture level Minds, sure, but they can still help massively in the right situation.


If LLMs suddenly appeared right now and you didn’t have the benefit of hindsight into their internal workings you wouldn’t describe them that way today either, even if it is factual.


Sure, though vuild any blackbox and people wouldn't be able to describe how it works simply by the inputs and outputs. It will tend to look magical, especially if it appears suddenly. That doesn't mean it is magical though (or intelligent, in this case).


I am less surprised that it is a text prediction loop than I am surprised at how effective a text prediction loop is. This effectiveness has changed how I think about language on a fundamental level. Yes, the process by which it occurs is relatively simple, but the fact it works is ground breaking.


5 years ago I was running GPT-2 and generating D&D content with it. I expected much better text prediction loop and here we are.


By “come a long way” do you mean “now exists and has already been commoditized”


it’s too bad Claude Sonnet lags to complete uselessness after a good number of messages are exchanged


Yeah, they have to fix that. ChatGPT used to have the same issue.


This actually speaks volumes about the progress AI hs made in recent years. Its capabilities have become so intangible that people begin to attack standardised testing - just like they do for humans. Because we all know that a human who does good on a test might still suck at the real job and vice-versa. If we get to the point where you we need to do personal interviews with new models to see if they could be used for a certain job, a lot of people will get the denial rug pulled out from under them pretty hard.


> where you we need to do personal interviews with new models to see if they could be used for a certain job

This is already happening. See the arenas like https://chat.lmsys.org/ where you get two blinded answers and choose the better one, or choose specific models to compare.


I already know about this, but it is more of a playground than a serious testing environment. I'm talking full blown multi-round interviews with varying human teams to look at all aspects of the job. I have not seen anyone outside the field put these kinds of resources into evaluating models for their business purpose. But I feel we are not far from that.


Unlike humans, you can make an llm do the job you want it to do as part of the interview process.


Only for very narrow and strictly defined use cases.


Related: "let's make leaderboards steep again" https://huggingface.co/spaces/open-llm-leaderboard/blog


I currently use ChatGPT 4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro for various personal and work-related tasks, and I often compare them on the same task. It’s really hard to decide which performs best overall for me. Sometimes one will be obviously better or worse for a relatively trivial reason, such as a longer context window or overeager censorship (looking at you for both, Gemini!). But usually, especially for the extended back-and-forth interactions that I find most useful, I am unable to state objectively which model is better.


I'd like to see a test like "chance of hallucination per prompt." Obviously, this test rating is LLM specific, because if you were rating a humans ability during an interview for example, once you detected a hallucination, you'd politely end the interview and lock your door once the interviewee left the premises.


The AIs are doing an ironic highlighting in reverse: that many of the ways we benchmark humans are actually kind of sucky.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: