Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: R2R V2 – A open source RAG engine with prod features (github.com/sciphi-ai)
251 points by ocolegro 6 months ago | hide | past | favorite | 71 comments
Hi HN! We're building R2R [https://github.com/SciPhi-AI/R2R], an open source RAG answer engine that is built on top of Postgres+Neo4j. The best way to get started is with the docs - https://r2r-docs.sciphi.ai/introduction.

This is a major update from our V1 which we have spent the last 3 months intensely building after getting a ton of great feedback from our first Show HN (https://news.ycombinator.com/item?id=39510874). We changed our focus to building a RAG engine instead of a framework, because this is what developers asked for the most. To us this distinction meant working on an opinionated system instead of layers of abstractions over providers. We built features for multimodal data ingestion, hybrid search with reranking, advanced RAG techniques (e.g. HyDE), automatic knowledge graph construction alongside the original goal of an observable RAG system built on top of a RESTful API that we shared back in February.

What's the problem? Developers are struggling to build accurate, reliable RAG solutions. Popular tools like Langchain are complex and overly abstracted and lack crucial production features such as user/document management, observability, and a default API. There was a big thread about this a few days ago: Why we no longer use LangChain for building our AI agents (https://news.ycombinator.com/item?id=40739982)

We experienced these challenges firsthand while building a large-scale semantic search engine, having users report numerous hallucinations and inaccuracies. This highlighted that search+RAG is a difficult problem. We're convinced that these missing features, and more, are essential to effectively monitor and improve such systems over time.

Teams have been using R2R to develop custom AI agents with their own data, with applications ranging from B2B lead generation to research assistants. Best of all, the developer experience is much improved. For example, we have recently seen multiple teams use R2R to deploy a user-facing RAG engine for their application within a day. By day 2 some of these same teams were using their generated logs to tune the system with advanced features like hybrid search and HyDE.

Here are a few examples of how R2R can outperform classic RAG with semantic search only:

1. “What were the UK's top exports in 2023?". R2R with hybrid search can identify documents mentioning "UK exports" and "2023", whereas semantic search finds related concepts like trade balance and economic reports.

2. "List all YC founders that worked at Google and now have an AI startup." Our knowledge graph feature allows R2R to understand relationships between employees and projects, answering a query that would be challenging for simple vector search.

The built in observability and customizability of R2R helps you to tune and improve your system long after launching. Our plan is to keep the API ~fixed while we iterate on the internal system logic, making it easier for developers to trust R2R for production from day 1.

We are currently working on the following: (1) Improve semantic chunking through third party providers or our own custom LLMs; (2) Training a custom model for knowledge graph triples extraction that will allow KG construction to be 10x more efficient. (This is in private beta, please reach out if interested!); (3) Ability to handle permissions at a more granular level than just a single user; (4) LLM-powered online evaluation of system performance + enhanced analytics and metrics.

Getting started is easy. R2R is a lightweight repository that you can install locally with `pip install r2r`, or run with Docker. Check out our quickstart guide: https://r2r-docs.sciphi.ai/quickstart. Lastly, if it interests you, we are also working on a cloud solution at https://sciphi.ai.

Thanks a lot for taking the time to read! The feedback from the first ShowHN was invaluable and gave us our direction for the last three months, so we'd love to hear any more comments you have!




Do you also see the ingestion process as the key challenge for many RAG systems to avoid "garbage in, garbage out"? How does R2R handle accurate data extraction for complex and diverse document types?

We have a customer who has hundreds of thousands of unstructured and diverse PDFs (containing tables, forms, checkmarks, images, etc.), and they need to accurately convert these PDFs into markdown for RAG usage.

Traditional OCR approaches fall short in many of these cases, so we've started using a combined multimodal LLM + OCR approach that has led to promising accuracy and consistency at scale (ping me if you want to give this a try). The RAG system itself is not a big pain point for them, but the accurate and efficient extraction and structuring of the data is.


Any one here exploring to solve extraction/parsing problem for RAG, do try LLMWhisperer[1].

Try it with complex layout documents -> https://pg.llmwhisperer.unstract.com/

If anyone wants to solve for RAG right from loading from source, extraction, and sending processed data to destination/API, try Unstract [2] (it is open-source)

[1] https://unstract.com/llmwhisperer/

[2] https://github.com/Zipstack/unstract


We agree that ingestion and extraction are a big part of the problem for building high quality RAG.

We've talked to a lot of different developers about these problems and haven't found a general consensus on what features are needed, so we are still evaluating advanced approaches.

For now our implementation is more general and designed to work across a variety of documents. R2R was designed to be very easy to override with your own custom parsing logic for these reasons.

Lastly, we have been focusing a lot of our effort on knowledge graphs since they provide an alternative way to enhance RAG systems. We are training our own model for triples extraction that will combine with the automatic knowledge graph construction in R2R. We are planning to release this in the coming weeks and are currently looking for beta testers [we made a signup form, here - https://forms.gle/g9x3pLpqx2kCPmcg6 for anyone interested]


I'm actually curious what the common patterns for RAG have been. I see a lot of progress in tooling but I have seen relatively few use cases or practical architectures documented.


I want to second this. It seems like document chunking is the most difficult part of the pipeline at this point.

You gave the example of unstructured PDF, but there are challenges with structured docs as well. We’ve run into docs that are hard to chunk because of this deeply nested and repeated structure. For example, there might be a long experimental protocol with multiple steps; at the end of each step, there’s a table “Debugging” for troubleshooting anything that might have gone wrong in that step. The debugging table is a natural chunk, except that once chunked there are a dozen such tables that are semantically similar when decoupled from their original context and position in the tree structure of the document.

This is one example, but there are many other cases where key context for a chunk is nearby in a structured sense, but far away in the flattened document, and therefore completely lost when chunking.


Is this an example that could benefit from something like knowledge graph construction or structured entity extraction?

I'm just curious because we have theorized and seen in practice that extraction is a way to answer questions which require connected information across disparate chunks, like you can see in the simple cookbook here [https://r2r-docs.sciphi.ai/cookbooks/knowledge-graph].

Or do you think this is something that can just be solved with more advanced multimodal ingestion?


I think a LLM could be successful if it wasn't just textually aware, but also spatially. Like, we know these things just chew through forum posts like this one. Knowing where the user name ones, the body of text, submit button, etc, might be foundational in actual problem in, problem out.


I'm really interested in learning more about this (multimodal LLM + OCR approach for PDFs), do you have a writeup anywhere or something open source?


PaddleOCR seemed to be a good library for locating and translating text. I've been puzzling over how to translate something like a simple letter form into a LLM translatable format.

I think the serious problem is most of these LLMs are already built on-top of garbage so you're already the GI and just trying to match that as best you can.


I built a library around this problem [1]. I recently did some experimenting with PaddleOCR but found the results very underwhelming (no spacing between text) - seems like it's heavily optimized for Chinese. There was a 3 year old GitHub issue around it and seems like it still has this issue out of the box. I'd be curious to hear other people's experience with it.

[1] https://github.com/Filimoa/open-parse/


I run into the same issue with an internal company RAG, all unstructured data in PDFs but even once converted to markdown, they still need fine-tuning and a lot of manual intervention.

It feels like we are inching closer to automating this type of thing, or at the very least brute-forcing it in like the LLM race is trying to do with bigger models and larger contexts.

Will have to play with this over a weekend and see what it might help me with :)


Awesome - interested to hear your thoughts / feelings after you get a chance to try it out.


Try sonnet 3.5 image understanding.


Have you tried it out yet, how does it compare with gpt-4o?


Danswer supports pdf natively, I’ve been trialing it and it works pretty well


This is excellent. I have been running a very similar stack for 2 years, and you got all the tricks of the trade. Pgvector, HyDe, Web Search + document search. Good dashboard with logs and analytics.

I am leaving my position, and I recommended this to basically replace me with a junior dev who can just hit the API endpoints.


As someone without no experience with RAG in production, I'm curious how effective you've found HyDE to be in practice.


I can't answer for the kindly poster above (ty), but from our experience techniques like HyDE are great when you are getting a lot of comparative questions.

For instance, if a user asks "How does A compare to B" then the query expansion element of HyDE is incredibly useful. The actual value of translating queries into answers for embedding is a bit unclear, since most embedding models we are using have been ft'ed to map queries onto answers.


Not GP but Hyde is a crutch for having poor semantic indexing imho. Most people just take raw chunks and embed those. You really need a robust preprocessing pipeline.


The quick start is defiantly not quick. You really should provide a batteries included docker compose with Postgres image ( docker.io/tensorchord/pgvecto-rs:pg14-v0.2.0 )

If I want to use dashboard I have to clone another repo? 'git clone git@github.com:SciPhi-AI/R2R-Dashboard.git' ? why not make it available in a docker container so that if im only interested in rag I can plug into the docker container for dashboard?

This project feels like a collection of alot of things thats not really providing any extra ease to development. It feels more like joining a new company and trying to find out all the repo and set everything up.

This really looks cool, but Im struggling to figure out if its a SDK or suite of apps or both but in the later case the suite of apps is really confusing if i have to still write all the python, then it feels more like a SDK?

Perhaps provide better "1 click" install experience to preview/show case all the features and then let devs leverages the r2r lalter...


thanks, this is really solid feedback - we will make a more inclusive docker image to make the setup easier/faster.

Think of R2R as an SDK with an out of the box admin dashboard / playground that you can plug into.


The installation instructions should be:

1. Download this docker compose file.

2. Run docker compose using this command.

3. Upload your first file (or folder) of content using this command.

It's fine to have to pip install the client, but it might be worth also providing an example curl command for uploading an HTML/text/PDF file.

The quickstart confused me because it started with python -m r2r.quickstart.example or something. It wasn't clear why I need to run some quickstart example, or how I would specify the location of my doc(s) or what command to run to index docs for real. Sure I could go read the source, but then it's not really a quick start.

Also it would be good to know:

- how to control chunk size when uploading a new document

- what type(s) of search are supported. You mention something about hybrid search, but the quickstart example doesn't explain how to choose the type of search (I guess it defaults to vector search).

HTH


Thanks I agree that would be a more streamlined introduction.

The quickstart clearly has too much content in retrospect, and the feedback here makes it clear we should simplify.


new docs are out if anyone was still wanting that, thanks.


GP quote

<< 1. Download this docker compose file. << 2. Run docker compose using this command. << 3. Upload your first file (or folder) of content using this command.

I think I will throw in the towel for now ( tomorrow is just a regular workday and I need some sleep:D ). I went the docker route with local ollama. Everything seems up, but I get an almost empty page.

To your point, I did not see the stuff GP asked for ( this is the file, this is how you run it and so on ). If I missed that, please let me know. I might be going blind at this point.

Will try again tomorrow, sleep well HN.


I did follow up and try this and all my issues are resolved. Thanks!


Do you really need pgvecto-rs? It isn't supported on RDS, Google, Azure, etc. It complicates deployment everywhere.


This looks great, will be giving it a shot today. Not to throw cold water on the release, but I have been look at different RAG platforms. Anyone have any insight into which is the flagship?

It really seems like document chunking is not a problem that can be solved well generically. And RAG really hinges on which documents get retrieved/the correct metadata.

Current approaches around this seem to be using a ReRanker, where we fetch a ton of information and prune it down. But still, document splitting, is tough. Especially when you start to add transcripts of video that can be a few hours long.


I've been interested in building a RAG for my documents, but as an academic project I do not have the funds to spend on costly APIs like a lot of RAG projects out there depend on, not just LLM part, but for the reranking, chunking, etc, like those form Cohere.

Can R2R be built with all processing steps implementing local "open" models?


Yes, there is a guide to running R2R with local models here - https://r2r-docs.sciphi.ai/cookbooks/local-rag


awesome!


I’ve checked out quite a few RAG projects now and what I haven’t seen really solved is ingestion, it’s usually like “this is an endpoint or some connectors, have fun!”.

How do I do a bulk/batch ingest of say, 10k html documents into this system?


All the pipelines are async, so for ingestion we have typically seen that R2R can saturate the vector db or embedding provider. We don't yet have backpressure so it is up to the client to rate limit.

Ingestion is pretty straightforward, you can call R2R directly or use the client-server interface to pass the html files in directly to the ingest_files endpoint (https://r2r-docs.sciphi.ai/api-reference/endpoint/ingest_fil...).

The data parsers are all fairly simple and easy to customize. Right now we use bs4 for handling HTML but have been considering other approaches.

What specific features around ingestion have you found lacking?


Thanks, I’ll give it a try!


I'd like to know this too. A quick: "take these docs as input, ingest and save, now sit there providing an API to get results" service guide.


Take a look here - https://r2r-docs.sciphi.ai/quickstart#ingest-data and here https://r2r-docs.sciphi.ai/cookbooks/client-server#ingest-do...

Since multiple people have requested we are pushing a quick change to make this emphasized in the docs.


Thankyou. My own comment giving a quickstart scenario was downvoted :( https://news.ycombinator.com/item?id=40801453 but I saw you kindly replied to it! Thankyou, I appreciate that.


LlamaIndex can ingest directories if you want to do bulk.


What do you want to do with the data after ingesting?


“ What were the UK's top exports in 2023?"

"List all YC founders that worked at Google and now have an AI startup."

How to check the accuracy of the answers? Is there some kind of a detailed trace of how the answer was generated?


great question, I can talk about how we do the more challenging "List all YC founders that worked at Google and now have an AI startup."

For this we have a target dataset (the YC co directory) that we have around 100 questions over. We have found that when feeding an entire company listing in along with a single question we can get an accurate single answer (needle in haystack problem).

So to build our evaluation dataset we feed each question with each sample into the cheapest LLM we can find that reliably handles the job. We then aggregate the results.

This is not perfect but it allows us to have a way to benchmark our knowledge graph construction and querying strategy so that we can tune the system ourselves.


OK, so you have a way to evaluate the accuracy and convince yourself that it’s probably works as expected. But what about me, a user? How can I check that the question I asked was answered correctly?


I think there's no substitute for doing your own research and comparing the results.


I just want to avoid putting one black box on top of another if possible.


Could you provide more details on the multimodal data ingestion process? What types of data can R2R currently handle, and how are non-text data types embedded? Can the ingestion be streaming from logs?


Currently R2R has out of the box logic for the following:

csv, docx, html, json, md, pdf, pptx, txt, xlsx, gif, jpg, png, svg, mp3, mp4.

There are a lot of good questions around ingestion today, so we will likely figure out how to intelligently expand this.

For mp3s we use whisper to transcribe, for videos we transcribe with whisper and sample frames to "describe" with a multimodal model. For images we again transcribe to a thorough text description - https://r2r-docs.sciphi.ai/cookbooks/multimodal

We have been testing multi-modal embedding models and open source models to do the description generation. If anyone has suggestions on SOTA techniques that work well at scale we would love to chat and work to implement these. Long run we'd like the system to be able to handle multi-modal data locally.


Interesting. Can you talk a bit about how the process is faster/better optimized for the dev teams? Sounds like there's a big potential to accelerate time to MVP.


Sure, happy to.

R2R is built around RESTful API and is dockerized, so devs can get started on app development immediately.

The system was designed so that devs can typically scale data ingestion up to provider bottlenecks w/out extra work.

We have implemented user-level permissions and high level document management alongside the vector db, which most devs need to build in a production setting, along with the API and data ingestion scaling.

Lastly, we also log every search and RAG completion that flows through the system. This is really important to find weaknesses and tune the system over time. Most devs end up needing an observability solution for their RAG.

All of these connect to an open source developer dashboard that allows you to see uploaded files, test different configs, etc.

These basic features mean that devs can spend more time on iterating / customizing their application specific features like custom data ingestion, hybrid search and advanced RAG.


Is there a way to work with source code? I've been looking for a rag solution that can understand the graph of code. For example "what analytics events get called when I click submit"


No we don't have any explicit code graph tools. Sourcegraph might be a good starting point for you, their SCIP indices are pretty nice


Have you integrated with any popular chat front-ends, e.g. OpenWebUI?


No not yet, I've had difficulty getting these different providers to work together on integrations. If you have any suggestions we are all ears.

In the meantime we've built our own dashboard which shows ingested documents, and has a customizeable chat interface - https://github.com/SciPhi-AI/R2R-Dashboard.

It's still a bit rough though.


Get neo4j out and count me in. No need for that Ressource hog.


its a optional dep used for kgs


What about swapping out neo4j for EdgeDB? Then you get to keep using Postgres with PG vector, and get knowledge graph all in one shot.


> R2R is a lightweight repository that you can install locally with `pip install r2r`, or run with Docker

Lightweight is good, and running it without having to deal with Docker is excellent.

But your quickstart guide is still huge! It feels very much not "quick". How do you:

* Install via Python

* Throw a folder of documents at it

* Have it set there providing a REST API to get results?

Eg suppose I have an AI service already, so I throw up a private Railway instance of this as a Python app. There's a DB somewhere. As simple as possible. I can mimic it at home just running a local Python server. How do I do that? _That's_ the real quickstart.


You are right that the quickstart is pretty large, we will think about how we can trim that and show only the essentials.

To do what you are requesting is pretty easy, you can just launch the server and use the client directly. The code would look like this:

```python

from r2r import R2RClient

base_url = "http://localhost:8000" # or other

client = R2RClient(base_url)

# load my_file_paths

...

response = client.ingest_files(file_paths=my_file_paths)

# optionally set metadata, document ids, etc.. [https://r2r-docs.sciphi.ai/api-reference/endpoint/ingest_fil...]

```


Thankyou! I appreciate that, that's a good mini-start, ie quickstart :)

I have an AI service that I need to add RAG too, running as a direct Python server, and I can see running this as a second service being very useful. Much appreciated.


How does this compare with Google's NotebookLM?


There are many exciting products that enable users to perform RAG on their own data, the growing number of use cases highlights the need for developer-friendly tools to build such applications.

While building our own RAG system with existing tools, we encountered numerous challenges in experimentation, deployment, and analysis. This led us to create our own solution that is truly developer-friendly.

You can check our docs for more details: https://r2r-docs.sciphi.ai/introduction


Seems like there is an opportunity to make this as easy to use as Dropbox.


yes, I think so.


I can't wait to try it after work. How would one link it to ollama?


See the guide here - https://r2r-docs.sciphi.ai/cookbooks/local-rag

we have instructions for getting setup and running w/ ollama. It should be pretty smooth.


What’s the benefit over langchain? Or other bigger platforms?


I'm just seeing this now.

The key advantages can be extracted from the response above to Kluless -

R2R is built around RESTful API and is dockerized, so devs can get started on app development immediately.

The system was designed so that devs can typically scale data ingestion up to provider bottlenecks w/out extra work.

We have implemented user-level permissions and high level document management alongside the vector db, which most devs need to build in a production setting, along with the API and data ingestion scaling.

Lastly, we also log every search and RAG completion that flows through the system. This is really important to find weaknesses and tune the system over time. Most devs end up needing an observability solution for their RAG.

All of these connect to an open source developer dashboard that allows you to see uploaded files, test different configs, etc.

These basic features mean that devs can spend more time on iterating / customizing their application specific features like custom data ingestion, hybrid search and advanced RAG.


As soon as it does not require openai then it is good.



On a side note, is there an open source RAG library that's not bound to a rising AI startup? I couldn't find one and I have a simple in-house implementation that I'd like to replace with something more people use.


You can have a look at Langroid[1], a multi-agent LLM framework from ex-CMU/UW-Madison researchers, in production-use at companies (some have publicly endorsed us). RAG is just one of the features, and we have a clean, transparent implementation in a single file, intended for clarity and extensibility. It has some state of the art retrieval techniques, and can be easily extended to add others. In the DocChatAgent the top level method for RAG is answer_from_docs , here's the rough pseudocode:

  answer_from_docs(query):
    extracts = get_relevant_extracts(query):
        passages = get_relevant_chunks(query):
            p1 = get_semantic_search_results(query)   # semantic/dense retrieval + learned sparse
            p2 = get_similar_chunks_bm25(query)       # lexical/sparse
            p3 = get_fuzzy_matches(query)             # lexical/sparse
            p = rerank(p1 + p2 + p3)        # rerank for lost-in-middle, diversity, relevance
            return p
        # use LLM to get verbatim relevant portions of passages if any
        extracts = get_verbatim_extracts(passages)          
        return extracts
    # use LLM to get final answer from query augmented with extracts
    return get_summary_answer(query, extracts)
[1] Langroid https://github.com/langroid/langroid


I could see myself considering this. And not just because it's got a great project name.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: