More

emil_sorensen · 2025-07-09T10:58:08 1752058688

Docs bots like these are deceptively hard to get right in production. Retrieval is super sensitive to how you chunk/parse documentation and how you end up structuring documentation in the first place (see frontpage post from a few weeks ago: https://news.ycombinator.com/item?id=44311217).

You want grounded RAG systems like Shopify's here to rely strongly on the underlying documents, but also still sprinkle a bit of the magic of the latent LLM knowledge too. The only way to get that balance right is evals. Lots of them. It gets even harder when you are dealing with GraphQL schema like Shopify has since most models struggle with that syntax moreso than REST APIs.

FYI I'm biased: Founder of kapa.ai here (we build docs AI assistants for +200 companies incl. Sentry, Grafana, Docker, the largest Apache projects etc).

chrismorgan · 2025-07-09T11:10:56 1752059456

Why do you say “deceptively hard” instead of “fundamentally impossible”? You can increase the probability it’ll give good answers, but you can never guarantee it. It’s then a question of what degree of wrongness is acceptable, and how you signal that. In this specific case, what it said sounds to me (as a Shopify non-user) entirely reasonable, it’s just wrong in a subtle but rather crucial way, which is also mildly tricky to test.

whatsgonewrongg · 2025-07-09T11:57:09 1752062229

A human answering every question is also not guaranteed to give good answers; anyone that has communicated with customer service knows that. So calling it impossible may be correct, but not useful.

(We tend to have far fewer evals for such humans though.)

girvo · 2025-07-09T12:02:31 1752062551

A human will tell you “I am not sure, and will have to ask engineering and get back to you in a few days”. None of these LLMs do that yet, they’re biased towards giving some answer, any answer.

unshavedyak · 2025-07-09T13:39:30 1752068370

I agree with you, but man i can't help but feel humans are the same depending on the company. My wife was recently fighting with several layers of comcast support over cap changes they've recently made. Seemingly it's a data issue since it's something new that theoretically hasn't propagated through their entire support chain yet, but she encountered a half dozen confidently incorrect people which lacked the information/training to know that they're wrong. It was a very frustrating couple hours.

Generally i don't trust most low paid (at no fault of their own) customer service centers anymore than i do random LLMs. Historically their advice for most things is either very biased, incredibly wrong, or often both.

tenacious_tuna · 2025-07-09T17:29:08 1752082148

In the case of unhelpful human support, I can leverage my experience in communicating with another human to tell if I'm being understood or not. An LLM is much more trial-and-error: I can't model the theory-of-mind behind it's answers to tell if I'm just communicating poorly or whatever else may be being lost in translation, there is no mind at play.

unshavedyak · 2025-07-09T17:50:03 1752083403

That's fair, though with an LLM (at least one you're familiar with) you can shape it's behavior. Which is not too different compared to some black box script that i can't control or reason through with a human support. Granted the LLM will have the same stupid black box script, so in both cases it's weaponized stupidity against the consumer.

dcre · 2025-07-09T12:46:10 1752065170

This is not really true. If you give a decent model docs in the prompt and tell them to answer based on the docs and say “I don’t know” if the answer isn’t there, they do it (most of the time).

SecretDreams · 2025-07-09T13:14:58 1752066898

> most of the time

This is doing some heavy lifting

QuadmasterXLII · 2025-07-09T15:27:48 1752074868

I have never seen this in the wild. Have you?

dcre · 2025-07-09T19:44:29 1752090269

Yes. All the time. I wrote a tool that does it!

https://crespo.business/posts/llm-only-rag/

  $ rgd ~/repos/jj/docs "how can I write a revset to select the nearest bookmark?"

  Using full corpus (length: 400,724 < 500,000)

  # Answer

  gemini-2.5-flash  | $0.03243 | 2.94 s | Tokens: 107643 -> 56

  The provided documentation does not include a direct method to select the
  nearest bookmark using revset syntax. You may be able to achieve this using
  a combination of  ancestors() ,  descendants() , and  latest() , but the
  documentation does not explicitly detail such a method.

dingnuts · 2025-07-09T15:35:25 1752075325

I need a big ol' citation for this claim, bud, because it's an extraordinary one. LLMs have no concept of truth or theory of mind so any time one tells you "I don't know" all it tells you is that the source document had similar questions with the answer "I don't know" already in the training data.

If the training data is full of certain statements you'll get certain sounding statements coming out of the model, too, even for things that are only similar, and for answers that are total bullshit

simonw · 2025-07-09T16:08:08 1752077288

Do you use LLMs often?

I get "I don't know" answers from Claude and ChatGPT all the time, especially now that they have thrown "reasoning" into the mix.

Saying that LLMs can't say "I don't know" feels like a 2023-2024 era complaint to me.

stavros · 2025-07-09T16:47:39 1752079659

Ok, how? The other day Opus spent 35 of my dollars by throwing itself again and again at a problem it couldn't solve. How can I get it to instead say "I can't solve this, sorry, I give up"?

simonw · 2025-07-09T17:29:20 1752082160

That sounds slightly different from "here is a question, say I don't know if you don't know the answer" - sounds to me like that was Opus running in a loop, presumably via Claude Code?

I did have one problem (involving SQLite triggers) that I bounced off various LLMs for genuinely a full year before finally getting to an understanding that it wasn't solvable! https://github.com/simonw/sqlite-chronicle/issues/7

stavros · 2025-07-09T17:56:23 1752083783

It wasn't in a loop really, it was more "I have this issue" "OK I know exactly why, wait" $3 later "it's still there" "OK I know exactly why, it's a different reason, wait", repeat until $35 is gone and I quit.

I would have much appreciated if it could throw its hands up and say it doesn't know.

michaelt · 2025-07-11T16:05:41 1752249941

In a single request? Oof.

I was benchmarking some models the other day via openrouter and I got the distinct impression some of these models treat the thinking token budget as a target rather than a maximum.

stavros · 2025-07-11T16:20:25 1752250825

No no, overall, trying things a few times.

conception · 2025-07-09T18:02:25 1752084145

I solve this by in my prompt. I say if you can’t fix it in two tries look online on how to do it if you still can’t fix it after two tries pause and ask for my help. It works pretty well.

cco · 2025-07-10T04:51:52 1752123112

All of these LLM doc bots do that, they set confidence levels and below them they'll ping humans or tell the user to contact a support team.

I've had good and bad experiences with them thus far, to the other poster's point, just like with human support teams.

axus · 2025-07-09T14:09:54 1752070194

Won't that be cool, when LLM-based AIs ask you for help instead of the other way around

whatsgonewrongg · 2025-07-09T12:38:21 1752064701

You’re right that some humans will, and most LLMs won’t. But humans can be just as confidently wrong. And we incentivize them to make decisions quickly, in a way that costs the company less money.

bee_rider · 2025-07-09T14:26:18 1752071178

Documentation is the thing we created because humans are forgetful and misunderstand things. If the doc bot is to be held to a standard more like some random discord channel or community forum, it should be called something without “doc” in the name (which, fwiw, might just be a name the author of the post came up with, I dunno what Shopify calls it).

ironmagma · 2025-07-09T20:35:05 1752093305

Is that not the very reason to have documentation in the first place, the fact that it is not a human?

intended · 2025-07-09T12:40:01 1752064801

This is to move the goal posts /raise a different issue. We can engage with the new point, but this is to concede that Docs bots are not docs bots.

skrebbel · 2025-07-09T11:02:32 1752058952

Why RAG at all?

We concatenated all our docs and tutorials into a text file, piped it all into the AI right along with the question, and the answers are pretty great. Cost was, last I checked, roughly 50c per question. Probably scales linearly with how much docs you have. This feels expensive but compared to a human writing an answer it's peanuts. Plus (assuming the customer can choose to use the AI or a human), it's great customer experience because the answer is there that much faster.

I feel like this is a no-brainer. Tbh with the context windows we have these days, I don't completely understand why RAG is a thing anymore for support tools.

cube2222 · 2025-07-09T11:41:06 1752061266

This works as long as your docs are below the max context size (and even then, as you approach larger context sizes, quality degrades).

Re cost though, you can usually reduce the cost significantly with context caching here.

However, in general, I’ve been positively surprised with how effective Claude Code is at grep’ing through huge codebases.

Thus, I think just putting a Claude Code-like agent in a loop, with a grep tool on your docs, and a system prompt that contains just a brief overview of your product and brief summaries of all the docs pages, would likely be my go to.

bee_rider · 2025-07-09T14:28:46 1752071326

Oh man, maybe this would cause people to write docs that are easy to grep through. Let’s start up that feedback loop immediately, please.

cluckindan · 2025-07-09T18:49:00 1752086940

How will you grep synonyms or phrases with different word choices?

bee_rider · 2025-07-09T18:59:58 1752087598

I’m hoping that the documentation will be structured in a way such that Claude can easily come up with good grep regexes. If Claude can do it, I can probably do it only a little bit worse.

Rygian · 2025-07-09T11:10:20 1752059420

What you describe sounds like poor man's RAG. Or lazy man's. You're just doing the augmentation at each prompt.

IceDane · 2025-07-09T11:36:36 1752060996

Because llms still suck at actually using all that context at once. And surely you can see yourself that your solution doesn't scale. It's great that it works for your specific case but I'm sure you can come up with a scenario where it's just not feasible.

TZubiri · 2025-07-09T16:42:31 1752079351

That is not particularly cheap, especially since it scales linearly with doc size, and therefore time.

Additionally the quality of loading the context-window decreases linearly as well, just because your model can handle 1M tokens it doesn't mean that it WILL remember 1M tokens, it just means that it CAN

RAG fixes this, in the simplest configuration a RAG can be an index, and the only context you give the LLM is the table of contents, and you let it search through the index.

Should it be a surprise that this is cheaper and more efficient? Loading the context window is like a library having every book open at every page at the same time instead of using the dewey decimal system

llm_nerd · 2025-07-09T13:57:26 1752069446

What you described is RAG. Inefficient RAG, but still RAG.

And it's inefficient in two ways-

-you're using extra tokens for every query, which adds up.

-you're making the LLM less precise by overloading it with potentially irrelevant extra info making it harder for it to needle in a haystack the specific relevant answer.

Filtering (e.g. embedding similarity & BM25) and re-ranking/pruning what you provide to RAG is an optimization. It optimizes the tokens, the processing time, and optimizes the answer in an ideal world. Most LLMs are far more effective if your RAG is limited to what is relevant to the question.

TZubiri · 2025-07-09T16:43:36 1752079416

I don't think it's RAG, RAG is specifically separating the search space from the LLM context-window or training set and giving the LLM tools to search in inference-time.

llm_nerd · 2025-07-09T19:08:05 1752088085

In this case their Retrieval stage is "SELECT *", basically, so sure I'm being loose with the terminology, but otherwise it's just a non-selective RAG. Okay ..AG.

RAG is selecting pertinent information to supply to the LLM with your query. In this case they decided that everything was pertinent, and the net result is just reduced efficiency. But if it works for them, eh.

TZubiri · 2025-07-09T20:10:36 1752091836

I'm not sure we are talking about the same thing. The root comment talks about concatenating all doc files into a loong text string, and adding that as a system/user prompt to the LLM at inference time before the actual question.

You mention the retrieval stage being a SELECT *? I don't think there's any SQL involved here.

llm_nerd · 2025-07-09T20:18:54 1752092334

I was being rhetorical. The R in RAG is filtering augmentation data (the A) for things that might or might not be related to the query. Including everything is just a lazy form of RAG -- the rhetorical SELECT *.

>and adding that as a system/user prompt to the LLM at inference time

You understand this is all RAG is, right? RAG is any additional system to provide contextually relevant (and often more timely) supporting information to a baked model.

People sometimes project RAG out to be a specific combination of embeddings, chunking, vector DBs, etc. But that is ancillary. RAG is simply selecting the augmentation data and supplying it with the question.

Anyways, I think this thread has reached a conclusion and there really isn't much more value in it. Cheers.

cluckindan · 2025-07-10T19:33:13 1752175993

The ”retrieval” refers to information retrieval, which is a technical term:

https://en.wikipedia.org/wiki/Information_retrieval

In that sense, calling ”stuff everything in the context” LLM queries a RAG system is analogous to calling a web crawler a search engine.

llm_nerd · 2025-07-11T02:00:31 1752199231

Sure thing.

TZubiri · 2025-07-09T23:22:16 1752103336

I agree it isn't embeddings or Vector DBs.

I personally define it as not including loading all data in the context-window

Very new field and not a lot of reliable sources. Would be worth it to standardize meaning.

cluckindan · 2025-07-09T11:17:18 1752059838

With RAG the cost per question would be low single-digit pennies.

emil_sorensen · 2025-07-09T12:22:15 1752063735

Accuracy drops hard with context length still. Especially in more technical domains. Plus latency and cost.

PeterStuer · 2025-07-09T12:22:27 1752063747

Indeed. Dabbling in 'RAG' (which for better or worse has become a tag for anything context retrieval) for more complex documentation and more intricate questions, you will very quickly realize that you really need to go far beyond simple 'chunking', and end up with a subsystem that constructs more than one very intricate knowledge graphs for supporting different kinds of questions the users might ask. For example: a simple question such as "What exactly is an 'Essential Entity'? is better handled by Knowledge Representation A as opposed to "Can you provide a gap and risk analysis on my 2025 draft compliance statement (uploaded) in light of the current GDPR, NIS-2 and the AI Act?"

(My domain is regulatory compliance, so maybe this goes beyond pure documentation but I'm guessing pushed far enough the same complexities arise)

dingnuts · 2025-07-09T15:38:16 1752075496

This is sort of hilarious; to use an LLM as a good search interface first build.. a search engine.

I guess this is why Kagi Quick Answer has consistently been one of the best AI tools I use. The search is good, so their agent is getting the best context for the summaries. Makes sense.

PeterStuer · 2025-07-09T17:47:15 1752083235

It is building a system that amplifies the strengths of the LLM by feeding it the right knowledge in the right format at inference time. Context design is both a search (as a generic term for everything retrieval) and a representation problem.

Just dumping raw reams of text into the 'prompt' isn't the best way to great results. Now I am fully aware that anything I can do on my side of the API, the LLM provider can and eventually will do as well. After all, Search also evolved beyond 'pagerank' to thousands of specialized heuristic subsystems.

J_Shelby_J · 2025-07-09T17:53:35 1752083615

“It’s just a chat bot Michael, how much can it cost?”

A philosophy degree later…

I ended up just generating a summary of each of our 1k docs, using the summaries for retrieval, running a filter to confirm the doc is relevant, and finally using the actual doc to generate an answers.

emil_sorensen · 2025-06-18T17:59:34 1750269574

Thanks for the feedback. We should definitely add that. :)

emil_sorensen · 2025-06-18T16:37:46 1750264666

OP here. It's kind of ironic that making the docs AI-friendly essentially just ends up being what good documentation is in the first place (explicit context and hierarchy, self-contained sections, precise error messages).

shafyy · 2025-06-18T20:42:11 1750279331

It's the same for SEO also. Good structure, correct use of HTML elements, quick loading, good accessibility, etc. Sure, there are "tricks" to improve your SEO, but the general principles are also good if you were not doing SEO.

klysm · 2025-06-19T08:22:40 1750321360

And yet in practice SEO slop garbage is SEO slop garbage. Devoid of any real meaning or purpose other than to increase rankings and metrics. Nobody cares if it’s good or useful, but it must appease the algorithm!

bobbiechen · 2025-06-18T19:07:14 1750273634

Related: "If an AI agent can't figure out how your API works, neither can your users" (from my employer's blog)

https://stytch.com/blog/if-an-ai-agent-cant-figure-out-how-y...

thom · 2025-06-18T22:32:38 1750285958

Yeah, I've started to think AI smoke tests for cognitive complexity should be a fundamental part of API/schema design now. Even if you think the LLMs are dumb, Stupidity as a Service is genuinely useful.

truculent · 2025-06-20T16:25:06 1750436706

Is this you have implemented in practice? Sounds like a great idea, but I have no idea how you would make it work it a structured way (or am I missing the point…?)

thom · 2025-06-20T19:59:50 1750449590

Can be easy depending on your setup - you can basically just write high level functional tests matching use cases of your API, but as prompts to a system with some sort of tool access, ideally MCP. You want to see those tests pass, but you want them to pass with the simplest possible prompt (a sort of regularization penalty, if you like). You can mutate the prompts using an LLM if you like to try different/shorter phrasings. The Pareto front of passing tests and prompt size/complexity is (arguably) how good a job you're doing structuring/documenting your API.

truculent · 2025-06-23T11:02:49 1750676569

Lovely idea - thanks

Cthulhu_ · 2025-06-19T09:40:09 1750326009

It's a good tool to use for code reviewing, especially if you don't have peers with Strong Opinions on it.

Which is another issue, indifference. It's hard to find people that actually care about things like API design, let alone multiple that check each other's work. In my experience, a lot of the time people just get lazy and short-circuit the reviews to "oh he knows what he's doing, I'm sure he thought long and hard about this".

jilles · 2025-06-18T18:09:30 1750270170

It's similar for writing code. Suddenly people are articulating their problems to the LLM and breaking it down in smaller sub-problems to solve....

arscan · 2025-06-18T19:07:44 1750273664

In other words, people are discovering the value of standard software engineering practices. Which, I think is a good thing.

corysama · 2025-06-18T18:19:28 1750270768

It has changed how I structure my code. Out of laziness, if I can write the code in such a way that each step follows naturally from what came before, "the code just writes itself!" Except now it's literally true :D

appreciatorBus · 2025-06-18T23:40:42 1750290042

Maybe everyone already discovered this but I find that if I include a lot of detail in my variables names, it's much more likely to autocomplete something useful. If whatever I typed was too verbose for my liking long term, I can always clean it up later with a rename.

taneq · 2025-06-19T04:53:06 1750308786

Reminds me of that Asimov story where the main character was convinced that some public figure was a robot, and kept trying to prove it. Eventually they concluded that it was impossible to tell whether they were actually a robot "or merely a very good man."

starkparker · 2025-06-19T01:01:50 1750294910

From a docs-writing perspective, I've noticed that LLMs in their current state mostly solve the struggle of finding users who both want to participate in studies, are mostly literate, and are also fundamentally incompetent

QRY · 2025-06-18T19:10:16 1750273816

Thank you for sharing this, it's really helpful to have this as top-down learning resource.

I'm in the process of learning how to work with AI, and I've been homebrewing something similar with local semantic search for technical content (embedding models via Ollama, ChromaDB for indexing). I'm currently stuck at the step of making unstructured knowledge queryable, so these docs will come in handy for sure. Thanks again!

troupo · 2025-06-19T08:29:47 1750321787

Also it makes it human-accessible, too. There are now projects converting Apple's JS-heavy documentation sites to markdown for AI consumption.

esafak · 2025-06-18T16:41:50 1750264910

Now people just have a better incentive :)

mooreds · 2025-06-18T17:01:24 1750266084

"GEO[0] has entered the chat."

We see a surprising number of folks who discover our product from GenAI solutions (self-reported). I'm not aware of any great tools that help you dissect this, but I'm sure someone is working on them.

0: Generative Engine Optimization

nlawalker · 2025-06-18T20:25:35 1750278335

Honest question - what do you mean? What's the better incentive?

esafak · 2025-06-18T22:59:04 1750287544

The documentation is now not just for other people, but for your own productivity. If it weren't for the LLM, you might not bother because the knowledge is in your memory. But the LLM does not have access to that yet :)

It's a fortunate turn of events for people who like documentation.

drusepth · 2025-06-18T22:30:54 1750285854

This is also the hilarious part of "prompt engineering".

It's just effective linguistics and speech; what people have called "soft skills" forever is now, obviously, trying to be a science for some reason.

ketzo · 2025-06-18T22:42:30 1750286550

A really effective prompt is created by developing an accurate “mental model” of the model, understanding what tools it does and doesn’t have access to, what gives it effective direction and what leads it astray

Otherwise known as empathy

Cthulhu_ · 2025-06-19T09:44:24 1750326264

It's a bit different though; the soft skills you mention are usually realtime or a chore that people don't like doing (writing down specifications / requirements), whereas "prompt engineering" puts people in their problem solving mental mode not dissimilar to writing code.

(assumption / personal theory)

alganet · 2025-06-19T07:34:38 1750318478

I can divide the suggestions into two categories:

1. Stuff that W3C already researched and defined 20 years ago to make the web better. Acessibility, semantic simple HTML that works with no JS, standard formats. All the stuff most companies just plain ignored or sidelined.

2. Suggestions to workaround obvious limits on current LLM tech (context size, ambiguity, etc).

There's really nothing to talk about category 1, except that a lot of people already said this and they were practically mocked.

Regarding category 2, it's the first stage of AI failure acceptance. "Ok, it can't reliably reason on human content. But what if we make humans write more dumb instead?"

emil_sorensen · 2025-02-26T16:11:29 1740586289

We focus mainly on external use cases (e.g., helping companies like Docker and Monday.com deploy customer facing "Ask AI" assistants) so we don't run into much of that given all data is public.

For internal use cases that require user level permissions that's a freaking rabbit role. I recently heard someone describe Glean as a "permissions company" more so than a search company for that reason. :)

3abiton · 2025-02-27T01:07:55 1740618475

> fine-tuning a model on tool usage could also allow it to gain familiarity with specific retrieval mechanisms.

I am curious if finetuning on specific usecases would outperform RAG approaches, assuming the data is static (say company documentation). I know there has been lots of posts on this, but yet to see quanitifications, especially with o3-mini.

emil_sorensen · 2025-02-26T15:17:05 1740583025

Super cool! Yep, a lot seems to get lost through distillation.

emil_sorensen · 2025-02-26T08:09:10 1740557350

Yep even with a small bump in performance (which we only saw for a subset of coding questions), it wouldn't be worth the huge latency penalty. Though that will surely go down over time.

emil_sorensen · 2025-02-26T06:37:50 1740551870

Curious if anyone else has run similar experiments?

zurfer · 2025-02-26T11:28:11 1740569291

Yes. Our main finding was that o3 mini especially is great on paper but surprisingly hard to prompt, compared to non reasoning models. I don't think it's a problem with reasoning, but rather with this specific model. I also suspect that o3 mini is a rather small model and so it can lack useful knowledge for broad applications. Especially for RAG, it seems that larger and fast models (e.g. gpt4o) perform better as of today.

emil_sorensen · 2025-02-26T12:24:18 1740572658

I suspect you're right here! Excited to get our hands on the non-distilled o3. :)

emil_sorensen · 2024-12-05T08:50:29 1733388629

That's a great point. Reminds me of the "feature, not a bug" Karpathy tweet [0].

[0]: https://x.com/karpathy/status/1733299213503787018?lang=en

mike_hearn · 2024-12-05T09:47:15 1733392035

... which is linked to from the article ;)

He's right but do people really misunderstand this? I think it's pretty clear that the issue is one of over-creativity.

The hallucination problem is IMHO at heart two things that the fine article itself doesn't touch on:

1. The training sets contain few examples of people expressing uncertainty because the social convention on the internet is that if you don't know the answer, you don't post. Children also lie like crazy for the same reason, they ask simple questions so rarely see examples of their parents expressing uncertainty or refusing to answer, and it then has to be explicitly trained out of them. Arguably that training often fails and lots of adults "hallucinate" a lot more than anyone is comfortable acknowledging.

The evidence for this is that models do seem to know their own level of certainty pretty well, which is why simple tricks like saying "don't make things up" can actually work. There's some interesting interpretability work that also shows this, which is alluded to in the article as well.

2. We train one-size-fits all models but use cases vary a lot in how much "creativity" is allowed. If you're a customer help desk worker then the creativity allowed is practically zero, and the ideal worker from an executive's perspective is basically just a search engine and human voice over an interactive flowchart. In fact that's often all they are. But then we use the same models for creative writing, research, coding, summarization and other tasks that benefit from a lot of creative choices. That makes it very hard to teach the model how much leeway it has to be over-confident. For instance during coding a long reply that contains a few hallucinated utility methods is way more useful than a response of "I am not 100% certain I can complete that request correctly" but if you're asking questions of the form "does this product I use have feature X" then a hallucination could be terrible.

Obviously, the compressive nature of LLMs means they can never eliminate hallucinations entirely, but we're so far from reaching any kind of theoretical limit here.

Techniques like better RAG are practical solutions that work for now, but in the longer run I think we'll see different instruct-trained models trained for different positions on the creativity/confidence spectrum. Models already differ quite a bit. I use Claude for writing code but GPT-4o for answering coding related questions, because I noticed that ChatGPT is much less prone to hallucinations than Claude is. This may even become part of the enterprise offerings of model companies. Consumers get the creative chatbots that'll play D&D with them, enterprises get the disciplined rule followers that can be trusted to answer support tickets.

lolinder · 2024-12-05T13:15:11 1733404511

> He's right but do people really misunderstand this?

Absolutely. Karpathy would not have felt obliged to mini-rant about it if he hadn't seen it, and I've been following this space from the beginning and have also seen it way too often.

Laypeople misunderstand this constantly, but far too many "AI engineers" on blogs, HN, and within my company talk about hallucinations in a way that makes it clear that they do not have a strong grounding in the fundamentals of this tech and think hallucinations will be cured someday as models get better.

Edit: scrolling a bit further in the replies to my comment, here's a great example:

https://news.ycombinator.com/item?id=42325795

And another: https://news.ycombinator.com/item?id=42325412

js8 · 2024-12-05T16:54:35 1733417675

I like your analogy with the child. There are different types of human discourse. There is a "helpful free man" discourse where you try to reach the truth. There is a "creative child" discourse where you are play with the world and trying out weird things. There is also a "slave mindset" discourse where you blindly follow orders to satisfy the master, regardless of your own actual opinion on the matter.

emil_sorensen · 2024-12-05T08:48:14 1733388494

I find it so interesting that it's possible to develop a "feeling" of a new model.

emil_sorensen · on March 7, 2024

They don't have to be rendered in the browser, but having all of the structures and section headings help humans to, so I would recommend it. :)

da39a3ee · on March 12, 2024

It can go too far. Too many section headings and it becomes unreadable, like an undergraduate textbook where you're constantly being distracted by sections and boxes.