Show HN: R2R V2 – A open source RAG engine with prod features

hubraumhugo · 2024-06-26T16:12:18 1719418338

Do you also see the ingestion process as the key challenge for many RAG systems to avoid "garbage in, garbage out"? How does R2R handle accurate data extraction for complex and diverse document types?

We have a customer who has hundreds of thousands of unstructured and diverse PDFs (containing tables, forms, checkmarks, images, etc.), and they need to accurately convert these PDFs into markdown for RAG usage.

Traditional OCR approaches fall short in many of these cases, so we've started using a combined multimodal LLM + OCR approach that has led to promising accuracy and consistency at scale (ping me if you want to give this a try). The RAG system itself is not a big pain point for them, but the accurate and efficient extraction and structuring of the data is.

constantinum · 2024-06-26T18:09:51 1719425391

Any one here exploring to solve extraction/parsing problem for RAG, do try LLMWhisperer[1].

Try it with complex layout documents -> https://pg.llmwhisperer.unstract.com/

If anyone wants to solve for RAG right from loading from source, extraction, and sending processed data to destination/API, try Unstract [2] (it is open-source)

[1] https://unstract.com/llmwhisperer/

[2] https://github.com/Zipstack/unstract

ocolegro · 2024-06-26T16:44:44 1719420284

We agree that ingestion and extraction are a big part of the problem for building high quality RAG.

We've talked to a lot of different developers about these problems and haven't found a general consensus on what features are needed, so we are still evaluating advanced approaches.

For now our implementation is more general and designed to work across a variety of documents. R2R was designed to be very easy to override with your own custom parsing logic for these reasons.

Lastly, we have been focusing a lot of our effort on knowledge graphs since they provide an alternative way to enhance RAG systems. We are training our own model for triples extraction that will combine with the automatic knowledge graph construction in R2R. We are planning to release this in the coming weeks and are currently looking for beta testers [we made a signup form, here - https://forms.gle/g9x3pLpqx2kCPmcg6 for anyone interested]

tootie · 2024-06-26T19:55:19 1719431719

I'm actually curious what the common patterns for RAG have been. I see a lot of progress in tooling but I have seen relatively few use cases or practical architectures documented.

LifeIsBio · 2024-06-26T16:55:54 1719420954

I want to second this. It seems like document chunking is the most difficult part of the pipeline at this point.

You gave the example of unstructured PDF, but there are challenges with structured docs as well. We’ve run into docs that are hard to chunk because of this deeply nested and repeated structure. For example, there might be a long experimental protocol with multiple steps; at the end of each step, there’s a table “Debugging” for troubleshooting anything that might have gone wrong in that step. The debugging table is a natural chunk, except that once chunked there are a dozen such tables that are semantically similar when decoupled from their original context and position in the tree structure of the document.

This is one example, but there are many other cases where key context for a chunk is nearby in a structured sense, but far away in the flattened document, and therefore completely lost when chunking.

ocolegro · 2024-06-26T17:16:20 1719422180

Is this an example that could benefit from something like knowledge graph construction or structured entity extraction?

I'm just curious because we have theorized and seen in practice that extraction is a way to answer questions which require connected information across disparate chunks, like you can see in the simple cookbook here [https://r2r-docs.sciphi.ai/cookbooks/knowledge-graph].

Or do you think this is something that can just be solved with more advanced multimodal ingestion?

cyanydeez · 2024-06-26T20:24:15 1719433455

I think a LLM could be successful if it wasn't just textually aware, but also spatially. Like, we know these things just chew through forum posts like this one. Knowing where the user name ones, the body of text, submit button, etc, might be foundational in actual problem in, problem out.

cpursley · 2024-06-26T21:36:46 1719437806

I'm really interested in learning more about this (multimodal LLM + OCR approach for PDFs), do you have a writeup anywhere or something open source?

cyanydeez · 2024-06-26T20:21:38 1719433298

PaddleOCR seemed to be a good library for locating and translating text. I've been puzzling over how to translate something like a simple letter form into a LLM translatable format.

I think the serious problem is most of these LLMs are already built on-top of garbage so you're already the GI and just trying to match that as best you can.

serjester · 2024-06-26T20:27:35 1719433655

I built a library around this problem [1]. I recently did some experimenting with PaddleOCR but found the results very underwhelming (no spacing between text) - seems like it's heavily optimized for Chinese. There was a 3 year old GitHub issue around it and seems like it still has this issue out of the box. I'd be curious to hear other people's experience with it.

[1] https://github.com/Filimoa/open-parse/

lacoolj · 2024-06-26T17:30:02 1719423002

I run into the same issue with an internal company RAG, all unstructured data in PDFs but even once converted to markdown, they still need fine-tuning and a lot of manual intervention.

It feels like we are inching closer to automating this type of thing, or at the very least brute-forcing it in like the LLM race is trying to do with bigger models and larger contexts.

Will have to play with this over a weekend and see what it might help me with :)

ocolegro · 2024-06-26T19:49:35 1719431375

Awesome - interested to hear your thoughts / feelings after you get a chance to try it out.

machiaweliczny · 2024-06-26T16:45:59 1719420359

Try sonnet 3.5 image understanding.

ocolegro · 2024-06-26T17:09:38 1719421778

Have you tried it out yet, how does it compare with gpt-4o?

davedx · 2024-06-26T19:37:36 1719430656

Danswer supports pdf natively, I’ve been trialing it and it works pretty well

jonathan-adly · 2024-06-26T16:03:53 1719417833

This is excellent. I have been running a very similar stack for 2 years, and you got all the tricks of the trade. Pgvector, HyDe, Web Search + document search. Good dashboard with logs and analytics.

I am leaving my position, and I recommended this to basically replace me with a junior dev who can just hit the API endpoints.

michaelmior · 2024-06-26T16:57:20 1719421040

As someone without no experience with RAG in production, I'm curious how effective you've found HyDE to be in practice.

ocolegro · 2024-06-26T17:24:34 1719422674

I can't answer for the kindly poster above (ty), but from our experience techniques like HyDE are great when you are getting a lot of comparative questions.

For instance, if a user asks "How does A compare to B" then the query expansion element of HyDE is incredibly useful. The actual value of translating queries into answers for embedding is a bit unclear, since most embedding models we are using have been ft'ed to map queries onto answers.

qeternity · 2024-06-26T17:27:07 1719422827

Not GP but Hyde is a crutch for having poor semantic indexing imho. Most people just take raw chunks and embed those. You really need a robust preprocessing pipeline.

vanillax · 2024-06-26T21:18:12 1719436692

The quick start is defiantly not quick. You really should provide a batteries included docker compose with Postgres image ( docker.io/tensorchord/pgvecto-rs:pg14-v0.2.0 )

If I want to use dashboard I have to clone another repo? 'git clone git@github.com:SciPhi-AI/R2R-Dashboard.git' ? why not make it available in a docker container so that if im only interested in rag I can plug into the docker container for dashboard?

This project feels like a collection of alot of things thats not really providing any extra ease to development. It feels more like joining a new company and trying to find out all the repo and set everything up.

This really looks cool, but Im struggling to figure out if its a SDK or suite of apps or both but in the later case the suite of apps is really confusing if i have to still write all the python, then it feels more like a SDK?

Perhaps provide better "1 click" install experience to preview/show case all the features and then let devs leverages the r2r lalter...

ocolegro · 2024-06-26T21:25:02 1719437102

thanks, this is really solid feedback - we will make a more inclusive docker image to make the setup easier/faster.

Think of R2R as an SDK with an out of the box admin dashboard / playground that you can plug into.

rahimnathwani · 2024-06-26T21:41:32 1719438092

The installation instructions should be:

1. Download this docker compose file.

2. Run docker compose using this command.

3. Upload your first file (or folder) of content using this command.

It's fine to have to pip install the client, but it might be worth also providing an example curl command for uploading an HTML/text/PDF file.

The quickstart confused me because it started with python -m r2r.quickstart.example or something. It wasn't clear why I need to run some quickstart example, or how I would specify the location of my doc(s) or what command to run to index docs for real. Sure I could go read the source, but then it's not really a quick start.

Also it would be good to know:

- how to control chunk size when uploading a new document

- what type(s) of search are supported. You mention something about hybrid search, but the quickstart example doesn't explain how to choose the type of search (I guess it defaults to vector search).

HTH

ocolegro · 2024-06-26T22:22:42 1719440562

Thanks I agree that would be a more streamlined introduction.

The quickstart clearly has too much content in retrospect, and the feedback here makes it clear we should simplify.

ocolegro · 2024-06-27T03:00:16 1719457216

new docs are out if anyone was still wanting that, thanks.

A4ET8a8uTh0 · 2024-06-27T04:18:03 1719461883

GP quote

<< 1. Download this docker compose file. << 2. Run docker compose using this command. << 3. Upload your first file (or folder) of content using this command.

I think I will throw in the towel for now ( tomorrow is just a regular workday and I need some sleep:D ). I went the docker route with local ollama. Everything seems up, but I get an almost empty page.

To your point, I did not see the stuff GP asked for ( this is the file, this is how you run it and so on ). If I missed that, please let me know. I might be going blind at this point.

Will try again tomorrow, sleep well HN.

vanillax · 2024-07-05T13:45:26 1720187126

I did follow up and try this and all my issues are resolved. Thanks!

doctorpangloss · 2024-06-27T03:54:35 1719460475

Do you really need pgvecto-rs? It isn't supported on RDS, Google, Azure, etc. It complicates deployment everywhere.

ldjkfkdsjnv · 2024-06-26T15:50:16 1719417016

This looks great, will be giving it a shot today. Not to throw cold water on the release, but I have been look at different RAG platforms. Anyone have any insight into which is the flagship?

It really seems like document chunking is not a problem that can be solved well generically. And RAG really hinges on which documents get retrieved/the correct metadata.

Current approaches around this seem to be using a ReRanker, where we fetch a ton of information and prune it down. But still, document splitting, is tough. Especially when you start to add transcripts of video that can be a few hours long.

SubiculumCode · 2024-06-26T17:06:02 1719421562

I've been interested in building a RAG for my documents, but as an academic project I do not have the funds to spend on costly APIs like a lot of RAG projects out there depend on, not just LLM part, but for the reranking, chunking, etc, like those form Cohere.

Can R2R be built with all processing steps implementing local "open" models?

ocolegro · 2024-06-26T17:07:20 1719421640

Yes, there is a guide to running R2R with local models here - https://r2r-docs.sciphi.ai/cookbooks/local-rag

SubiculumCode · 2024-06-26T20:00:41 1719432041

awesome!

davedx · 2024-06-26T15:17:25 1719415045

I’ve checked out quite a few RAG projects now and what I haven’t seen really solved is ingestion, it’s usually like “this is an endpoint or some connectors, have fun!”.

How do I do a bulk/batch ingest of say, 10k html documents into this system?

ocolegro · 2024-06-26T15:58:05 1719417485

All the pipelines are async, so for ingestion we have typically seen that R2R can saturate the vector db or embedding provider. We don't yet have backpressure so it is up to the client to rate limit.

Ingestion is pretty straightforward, you can call R2R directly or use the client-server interface to pass the html files in directly to the ingest_files endpoint (https://r2r-docs.sciphi.ai/api-reference/endpoint/ingest_fil...).

The data parsers are all fairly simple and easy to customize. Right now we use bs4 for handling HTML but have been considering other approaches.

What specific features around ingestion have you found lacking?

davedx · 2024-06-26T19:39:29 1719430769

Thanks, I’ll give it a try!

vintagedave · 2024-06-26T15:55:29 1719417329

I'd like to know this too. A quick: "take these docs as input, ingest and save, now sit there providing an API to get results" service guide.

ocolegro · 2024-06-26T16:13:19 1719418399

Take a look here - https://r2r-docs.sciphi.ai/quickstart#ingest-data and here https://r2r-docs.sciphi.ai/cookbooks/client-server#ingest-do...

Since multiple people have requested we are pushing a quick change to make this emphasized in the docs.

vintagedave · 2024-06-26T16:20:40 1719418840

Thankyou. My own comment giving a quickstart scenario was downvoted :( https://news.ycombinator.com/item?id=40801453 but I saw you kindly replied to it! Thankyou, I appreciate that.

shepardrtc · 2024-06-26T15:36:56 1719416216

LlamaIndex can ingest directories if you want to do bulk.

namanyayg · 2024-06-26T15:24:47 1719415487

What do you want to do with the data after ingesting?

p1esk · 2024-06-26T17:18:30 1719422310

“ What were the UK's top exports in 2023?"

"List all YC founders that worked at Google and now have an AI startup."

How to check the accuracy of the answers? Is there some kind of a detailed trace of how the answer was generated?

ocolegro · 2024-06-26T17:29:26 1719422966

great question, I can talk about how we do the more challenging "List all YC founders that worked at Google and now have an AI startup."

For this we have a target dataset (the YC co directory) that we have around 100 questions over. We have found that when feeding an entire company listing in along with a single question we can get an accurate single answer (needle in haystack problem).

So to build our evaluation dataset we feed each question with each sample into the cheapest LLM we can find that reliably handles the job. We then aggregate the results.

This is not perfect but it allows us to have a way to benchmark our knowledge graph construction and querying strategy so that we can tune the system ourselves.

p1esk · 2024-06-26T18:54:34 1719428074

OK, so you have a way to evaluate the accuracy and convince yourself that it’s probably works as expected. But what about me, a user? How can I check that the question I asked was answered correctly?

GTP · 2024-06-26T19:34:19 1719430459

I think there's no substitute for doing your own research and comparing the results.

p1esk · 2024-06-26T20:58:12 1719435492

I just want to avoid putting one black box on top of another if possible.

sandeepnmenon · 2024-06-26T15:43:13 1719416593

Could you provide more details on the multimodal data ingestion process? What types of data can R2R currently handle, and how are non-text data types embedded? Can the ingestion be streaming from logs?

ocolegro · 2024-06-26T17:13:48 1719422028

Currently R2R has out of the box logic for the following:

csv, docx, html, json, md, pdf, pptx, txt, xlsx, gif, jpg, png, svg, mp3, mp4.

There are a lot of good questions around ingestion today, so we will likely figure out how to intelligently expand this.

For mp3s we use whisper to transcribe, for videos we transcribe with whisper and sample frames to "describe" with a multimodal model. For images we again transcribe to a thorough text description - https://r2r-docs.sciphi.ai/cookbooks/multimodal

We have been testing multi-modal embedding models and open source models to do the description generation. If anyone has suggestions on SOTA techniques that work well at scale we would love to chat and work to implement these. Long run we'd like the system to be able to handle multi-modal data locally.

Kluless · 2024-06-26T14:33:29 1719412409

Interesting. Can you talk a bit about how the process is faster/better optimized for the dev teams? Sounds like there's a big potential to accelerate time to MVP.

ocolegro · 2024-06-26T14:52:04 1719413524

Sure, happy to.

R2R is built around RESTful API and is dockerized, so devs can get started on app development immediately.

The system was designed so that devs can typically scale data ingestion up to provider bottlenecks w/out extra work.

We have implemented user-level permissions and high level document management alongside the vector db, which most devs need to build in a production setting, along with the API and data ingestion scaling.

Lastly, we also log every search and RAG completion that flows through the system. This is really important to find weaknesses and tune the system over time. Most devs end up needing an observability solution for their RAG.

All of these connect to an open source developer dashboard that allows you to see uploaded files, test different configs, etc.

These basic features mean that devs can spend more time on iterating / customizing their application specific features like custom data ingestion, hybrid search and advanced RAG.

FriendlyMike · 2024-06-26T18:03:08 1719424988

Is there a way to work with source code? I've been looking for a rag solution that can understand the graph of code. For example "what analytics events get called when I click submit"

ocolegro · 2024-06-26T18:13:36 1719425616

No we don't have any explicit code graph tools. Sourcegraph might be a good starting point for you, their SCIP indices are pretty nice

causal · 2024-06-26T16:20:12 1719418812

Have you integrated with any popular chat front-ends, e.g. OpenWebUI?

ocolegro · 2024-06-26T17:09:14 1719421754

No not yet, I've had difficulty getting these different providers to work together on integrations. If you have any suggestions we are all ears.

In the meantime we've built our own dashboard which shows ingested documents, and has a customizeable chat interface - https://github.com/SciPhi-AI/R2R-Dashboard.

It's still a bit rough though.

jhoechtl · 2024-06-26T19:26:23 1719429983

Get neo4j out and count me in. No need for that Ressource hog.

ocolegro · 2024-06-26T19:31:56 1719430316

its a optional dep used for kgs

Onawa · 2024-06-26T20:51:33 1719435093

What about swapping out neo4j for EdgeDB? Then you get to keep using Postgres with PG vector, and get knowledge graph all in one shot.

vintagedave · 2024-06-26T15:57:55 1719417475

> R2R is a lightweight repository that you can install locally with `pip install r2r`, or run with Docker

Lightweight is good, and running it without having to deal with Docker is excellent.

But your quickstart guide is still huge! It feels very much not "quick". How do you:

* Install via Python

* Throw a folder of documents at it

* Have it set there providing a REST API to get results?

Eg suppose I have an AI service already, so I throw up a private Railway instance of this as a Python app. There's a DB somewhere. As simple as possible. I can mimic it at home just running a local Python server. How do I do that? _That's_ the real quickstart.

ocolegro · 2024-06-26T16:05:07 1719417907

You are right that the quickstart is pretty large, we will think about how we can trim that and show only the essentials.

To do what you are requesting is pretty easy, you can just launch the server and use the client directly. The code would look like this:

```python

from r2r import R2RClient

base_url = "http://localhost:8000" # or other

client = R2RClient(base_url)

# load my_file_paths

...

response = client.ingest_files(file_paths=my_file_paths)

# optionally set metadata, document ids, etc.. [https://r2r-docs.sciphi.ai/api-reference/endpoint/ingest_fil...]

```

vintagedave · 2024-06-26T16:21:46 1719418906

Thankyou! I appreciate that, that's a good mini-start, ie quickstart :)

I have an AI service that I need to add RAG too, running as a direct Python server, and I can see running this as a second service being very useful. Much appreciated.

GTP · 2024-06-26T19:42:59 1719430979

How does this compare with Google's NotebookLM?

shreyaspgkr · 2024-06-26T21:27:26 1719437246

There are many exciting products that enable users to perform RAG on their own data, the growing number of use cases highlights the need for developer-friendly tools to build such applications.

While building our own RAG system with existing tools, we encountered numerous challenges in experimentation, deployment, and analysis. This led us to create our own solution that is truly developer-friendly.

You can check our docs for more details: https://r2r-docs.sciphi.ai/introduction

mentos · 2024-06-26T19:41:30 1719430890

Seems like there is an opportunity to make this as easy to use as Dropbox.

ocolegro · 2024-06-26T20:27:23 1719433643

yes, I think so.

hdjsvdjue7 · 2024-06-26T16:31:27 1719419487

I can't wait to try it after work. How would one link it to ollama?

ocolegro · 2024-06-26T17:25:35 1719422735

See the guide here - https://r2r-docs.sciphi.ai/cookbooks/local-rag

we have instructions for getting setup and running w/ ollama. It should be pretty smooth.

wmays · 2024-06-26T14:55:01 1719413701

What’s the benefit over langchain? Or other bigger platforms?

ocolegro · 2024-06-26T18:15:33 1719425733

I'm just seeing this now.

The key advantages can be extracted from the response above to Kluless -

R2R is built around RESTful API and is dockerized, so devs can get started on app development immediately.

The system was designed so that devs can typically scale data ingestion up to provider bottlenecks w/out extra work.

We have implemented user-level permissions and high level document management alongside the vector db, which most devs need to build in a production setting, along with the API and data ingestion scaling.

Lastly, we also log every search and RAG completion that flows through the system. This is really important to find weaknesses and tune the system over time. Most devs end up needing an observability solution for their RAG.

All of these connect to an open source developer dashboard that allows you to see uploaded files, test different configs, etc.

These basic features mean that devs can spend more time on iterating / customizing their application specific features like custom data ingestion, hybrid search and advanced RAG.

revskill · 2024-06-27T05:16:17 1719465377

As soon as it does not require openai then it is good.

croes · 2024-06-27T18:25:24 1719512724

Here

https://r2r-docs.sciphi.ai/cookbooks/local-rag

haolez · 2024-06-26T21:45:41 1719438341

On a side note, is there an open source RAG library that's not bound to a rising AI startup? I couldn't find one and I have a simple in-house implementation that I'd like to replace with something more people use.

d4rkp4ttern · 2024-06-27T13:50:22 1719496222

You can have a look at Langroid[1], a multi-agent LLM framework from ex-CMU/UW-Madison researchers, in production-use at companies (some have publicly endorsed us). RAG is just one of the features, and we have a clean, transparent implementation in a single file, intended for clarity and extensibility. It has some state of the art retrieval techniques, and can be easily extended to add others. In the DocChatAgent the top level method for RAG is answer_from_docs , here's the rough pseudocode:

  answer_from_docs(query):
    extracts = get_relevant_extracts(query):
        passages = get_relevant_chunks(query):
            p1 = get_semantic_search_results(query)   # semantic/dense retrieval + learned sparse
            p2 = get_similar_chunks_bm25(query)       # lexical/sparse
            p3 = get_fuzzy_matches(query)             # lexical/sparse
            p = rerank(p1 + p2 + p3)        # rerank for lost-in-middle, diversity, relevance
            return p
        # use LLM to get verbatim relevant portions of passages if any
        extracts = get_verbatim_extracts(passages)          
        return extracts
    # use LLM to get final answer from query augmented with extracts
    return get_summary_answer(query, extracts)

[1] Langroid https://github.com/langroid/langroid

taylorbuley · 2024-06-26T19:49:31 1719431371

I could see myself considering this. And not just because it's got a great project name.