Do you also see the ingestion process as the key challenge for many RAG systems to avoid "garbage in, garbage out"? How does R2R handle accurate data extraction for complex and diverse document types?
We have a customer who has hundreds of thousands of unstructured and diverse PDFs (containing tables, forms, checkmarks, images, etc.), and they need to accurately convert these PDFs into markdown for RAG usage.
Traditional OCR approaches fall short in many of these cases, so we've started using a combined multimodal LLM + OCR approach that has led to promising accuracy and consistency at scale (ping me if you want to give this a try). The RAG system itself is not a big pain point for them, but the accurate and efficient extraction and structuring of the data is.
If anyone wants to solve for RAG right from loading from source, extraction, and sending processed data to destination/API, try Unstract [2] (it is open-source)
We agree that ingestion and extraction are a big part of the problem for building high quality RAG.
We've talked to a lot of different developers about these problems and haven't found a general consensus on what features are needed, so we are still evaluating advanced approaches.
For now our implementation is more general and designed to work across a variety of documents. R2R was designed to be very easy to override with your own custom parsing logic for these reasons.
Lastly, we have been focusing a lot of our effort on knowledge graphs since they provide an alternative way to enhance RAG systems. We are training our own model for triples extraction that will combine with the automatic knowledge graph construction in R2R. We are planning to release this in the coming weeks and are currently looking for beta testers [we made a signup form, here - https://forms.gle/g9x3pLpqx2kCPmcg6 for anyone interested]
I'm actually curious what the common patterns for RAG have been. I see a lot of progress in tooling but I have seen relatively few use cases or practical architectures documented.
I want to second this. It seems like document chunking is the most difficult part of the pipeline at this point.
You gave the example of unstructured PDF, but there are challenges with structured docs as well. We’ve run into docs that are hard to chunk because of this deeply nested and repeated structure. For example, there might be a long experimental protocol with multiple steps; at the end of each step, there’s a table “Debugging” for troubleshooting anything that might have gone wrong in that step. The debugging table is a natural chunk, except that once chunked there are a dozen such tables that are semantically similar when decoupled from their original context and position in the tree structure of the document.
This is one example, but there are many other cases where key context for a chunk is nearby in a structured sense, but far away in the flattened document, and therefore completely lost when chunking.
Is this an example that could benefit from something like knowledge graph construction or structured entity extraction?
I'm just curious because we have theorized and seen in practice that extraction is a way to answer questions which require connected information across disparate chunks, like you can see in the simple cookbook here [https://r2r-docs.sciphi.ai/cookbooks/knowledge-graph].
Or do you think this is something that can just be solved with more advanced multimodal ingestion?
I think a LLM could be successful if it wasn't just textually aware, but also spatially. Like, we know these things just chew through forum posts like this one. Knowing where the user name ones, the body of text, submit button, etc, might be foundational in actual problem in, problem out.
PaddleOCR seemed to be a good library for locating and translating text. I've been puzzling over how to translate something like a simple letter form into a LLM translatable format.
I think the serious problem is most of these LLMs are already built on-top of garbage so you're already the GI and just trying to match that as best you can.
I built a library around this problem [1]. I recently did some experimenting with PaddleOCR but found the results very underwhelming (no spacing between text) - seems like it's heavily optimized for Chinese. There was a 3 year old GitHub issue around it and seems like it still has this issue out of the box. I'd be curious to hear other people's experience with it.
I run into the same issue with an internal company RAG, all unstructured data in PDFs but even once converted to markdown, they still need fine-tuning and a lot of manual intervention.
It feels like we are inching closer to automating this type of thing, or at the very least brute-forcing it in like the LLM race is trying to do with bigger models and larger contexts.
Will have to play with this over a weekend and see what it might help me with :)
We have a customer who has hundreds of thousands of unstructured and diverse PDFs (containing tables, forms, checkmarks, images, etc.), and they need to accurately convert these PDFs into markdown for RAG usage.
Traditional OCR approaches fall short in many of these cases, so we've started using a combined multimodal LLM + OCR approach that has led to promising accuracy and consistency at scale (ping me if you want to give this a try). The RAG system itself is not a big pain point for them, but the accurate and efficient extraction and structuring of the data is.