Hacker News new | past | comments | ask | show | jobs | submit login

In my experience the QA with documents pattern is fairly straightforward to implement. 90% of the effort to get to a preformant system hoever goes into massaging the documents into semantically meaningfull chuncks. Most business documents, unlike blogposts and news articles, are not just running text. They have a lott of implicit structure that when lost as the typical naive chunckers do, lose much of the contextualized meaning as well.



Agree with the point about intelligent chunking being very important! Each individual app connector can choose how it wants to split each `document` into `section`s (important point: this is customized at an app-level). The default chunker then keeps each section as part of a single chunk as much as possible. The goal here is, as you mentioned, to give each chunk the relevant surrounding context.

Additionally, the indexing process is setup as a composable pipeline under the hood. It would be fairly trivial to plug in different chunkers for different sources as needed in the future.


Chunking is very important but might, I feel, best be contextualised as one aspect of the bigger substantive challenge, which is how to prevent false negatives at the context retrieval stage - a.k.a. how to ensure your (vector? hybrid?) search returns all relevant context to the LLM’s context window.

Would you mind saying a few words on how Danswer approaches this?


Yes agreed, tooling abounds, the work for anyone who's serious about this is customizing everything so it works with the idiosyncrasies of the documents and questions a customer has. I'm happy to talk to anyone who is interested, we are doing something like this for a company now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: