You just state this as if it was obviously true, but I don't see how. Why is usi...

Fargren · 2024-12-23T07:55:07 1734940507

A history book is written by someone who knows the topic, and then reviewed by more people who also know the topic, and then it's out there where people can read it and criticize it if it's wrong about the topic.

A question asked to an AI is not reviewed by anyone, and it's ephemeral. The AI can answer "yes" today, and "no" tomorrow, so it's not possible to build a consensus on whether it answers specific questions correctly.

jstummbillig · 2024-12-23T08:12:25 1734941545

A pop sci fi book can be written by someone who knows the topic and reviewed by people who know the topic — and a history book can also not.

LLM generated answers are more comparable to ad-hoc human expert's answers and not to written books. But it's much simpler to statistically evaluate and correct them. That is how we can know that, on average, LLMs are improving and are outperforming human experts on an increasing number of tasks and topics.

jacobolus · 2024-12-23T08:33:12 1734942792

In my experience LLM generated answers are more comparable to an ad-hoc answer by a human with no special expertise, moderate google skills, but good bullshitting skills spending a few minutes searching the web, reading what they find and synthesizing it, waiting long enough for the details to get kind of hazy, and then writing up an answer off the top of their head based on that, filling in any missing material by just making something up. They can do this significantly faster than a human undergraduate student might be able to, so if you need someone to do this task very quickly / prolifically this can be beneficial (e.g. this could be effective for generating banter for video game non-player characters, for astroturfing social media, or for cheating on student essays read by an overworked grader). It's not a good way to get expert answers about anything though.

More specifically: I've never gotten an answer from an LLM to a tricky or obscure question about a subject I already know anything about that seemed remotely competent. The answers to basic and obvious questions are sometimes okay, but also sometimes completely wrong (but confidently stated). When asked follow-up questions the LLM will repeatedly directly contradict itself with additional answers each as wrong as the first, all just as confidently stated.

TeMPOraL · 2024-12-23T08:40:28 1734943228

More like "have already skimmed half of the entire Internet in the past", but yeah. That's exactly the mental model IMO one should have with LLMs.

Of course don't forget that "writing up an answer off the top of their head based on that, filling in any missing material by just making something up" is what everyone does all the time, and in particular it's what experts do in their areas of expertise. How often those snap answers and hasty extrapolations turn out correct is, literally, how you measure understanding.

EDIT:

There's some deep irony here, because with LLMs being "all system 1, no system 2", we're trying to give them the same crutches we use on the road to understanding, but have them move the opposite direction. Take "chain of thought" - saying "let's think step by step" and then explicitly going through your reasoning is not understanding - it's the direct opposite of it. Think of a student that solves a math problem step by step - they're not demonstrating understanding or mastery of the subject. On the contrary, they're just demonstrating they can emulate understanding by more mechanistic, procedural means.

jacobolus · 2024-12-23T08:48:36 1734943716

Okay, but if you read written work by an expert (e.g. a book published by a reputable academic press or a journal article in a peer-reviewed journal), you get a result whose details were all checked out, and can be relied on to some extent. By looking up in the citation graph you can track down their sources, cross-check claims against other scholars', look up survey sources putting the work in context, think critically about each author's biases, etc., and it's possible to come to some kind of careful analysis of the work's credibility and assess the truth value of claims made. By doing careful search and study it's possible to get to some sense of the scholarly consensus about a topic and some idea of the level of controversy about various details or interpretations.

If instead you are reading the expert's blog post or hastily composed email or chatting with them on an airplane you get a different level of polish and care, but again you can use context to evaluate the source and claims made. Often the result is still "oh yeah this seems pretty insightful" but sometimes "wow, this person shouldn't be speculating outside of their area of expertise because they have no clue about this".

With LLM output, the appropriate assessment (at least in any that I have tried, which is far from exhaustive) is basically always "this is vaguely topical bullshit; you shouldn't trust this at all".

twometwo · 2024-12-23T10:08:52 1734948532

I am just curious about this. You said the word never, and I think your claim can be tested, perhaps you could post a list of five obscure questions for a LLM to answer and then someone could ask that to a good LLM for you, or an expert in that field, to assess the value of the answers.

Edited: I just submitted an ASK HN post about this.

jstummbillig · 2024-12-23T13:11:18 1734959478

> I've never gotten an answer from an LLM to a tricky or obscure question about a subject I already know anything about that seemed remotely competent.

Certainly not my experience with the current SOTA. Without being more specific, it's hard to discuss. Feel free to name something that can be looked at.

SheinhardtWigCo · 2024-12-23T08:31:21 1734942681

The same is true of Google, no?

TeMPOraL · 2024-12-23T08:33:06 1734942786

> A question asked to an AI is not reviewed by anyone, and it's ephemeral. The AI can answer "yes" today, and "no" tomorrow, so it's not possible to build a consensus on whether it answers specific questions correctly.

It's even more so with humans! Most of our conversations are, and has always been, ephemeral and unverifiable (and there's plenty of people who want to undo the little of permanence and verifiability we still have on the Internet...). Along the dimension of permanence and verifiability, asking an LLM is actually much better than asking a human - there's always a log of the conversation you had with the AI produced and stored somewhere for at least a while (even if only until you clear your temp folder), and if you can get ahold of that log, you can not just verify the answers, you can actually debug the AI. You can rerun the conversation with different parameters, different prompting, perhaps even inspect the inference process itself. You can do that ten times, hundred times, a million times, and won't be asked to come to Hague and explain yourself. Now try that with a human :).

Fargren · 2024-12-23T13:05:28 1734959128

The context of my comment was what is the difference between an AI and a history book. Or going back to the top comment, between an AI and an expert.

If you want to compare AI with ephemeral unverifiable conversations with uninformed people, go ahead. But that doesn't make them sound very valuable. I believe they are more valuable than that for sure, but how much, I'm not sure.