Hacker News new | past | comments | ask | show | jobs | submit login

I want to second this. It seems like document chunking is the most difficult part of the pipeline at this point.

You gave the example of unstructured PDF, but there are challenges with structured docs as well. We’ve run into docs that are hard to chunk because of this deeply nested and repeated structure. For example, there might be a long experimental protocol with multiple steps; at the end of each step, there’s a table “Debugging” for troubleshooting anything that might have gone wrong in that step. The debugging table is a natural chunk, except that once chunked there are a dozen such tables that are semantically similar when decoupled from their original context and position in the tree structure of the document.

This is one example, but there are many other cases where key context for a chunk is nearby in a structured sense, but far away in the flattened document, and therefore completely lost when chunking.




Is this an example that could benefit from something like knowledge graph construction or structured entity extraction?

I'm just curious because we have theorized and seen in practice that extraction is a way to answer questions which require connected information across disparate chunks, like you can see in the simple cookbook here [https://r2r-docs.sciphi.ai/cookbooks/knowledge-graph].

Or do you think this is something that can just be solved with more advanced multimodal ingestion?


I think a LLM could be successful if it wasn't just textually aware, but also spatially. Like, we know these things just chew through forum posts like this one. Knowing where the user name ones, the body of text, submit button, etc, might be foundational in actual problem in, problem out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: