Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Repo2vec – an open-source library for chatting with any codebase (github.com/storia-ai)
93 points by nutellalover 4 months ago | hide | past | favorite | 54 comments
Hi HN, We're excited to share repo2vec: a simple-to-use, modular library enabling you to chat with any public or private codebase. It's like Github Copilot but with the most up-to-date information about your repo.

We made this because sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through the code itself.

We tried to make it dead-simple to use. With two scripts, you can index and get a functional interface for your repo. Every generated response shows where in the code the context for the answer was pulled from.

We also made it plug-and-play where every component from the embeddings, to the vector store, to the LLM is completely customizable.

If you want to see a hosted version of the chat interface with its features, here's a link: https://www.youtube.com/watch?v=CNVzmqRXUCA

We would love your feedback!

- Mihail and Julia




Very useful! I was just thinking this kind of thing should exist!

I would also like to be able to have the LLM know all of the documentation for any dependencies in the same way.


OP's cofounder here. The nice thing is that a lot of repos include the documentation as well, so it comes for free by simply indexing the repo (like huggingface/transformers for instance).


Thanks!

This is a great idea. Definitely something we plan to support.


I want to feed it not only the code but also a corpus of questions and answers, e.g. from the discussions page on GitHub. Is that possible?


Thanks for the request! This is on our roadmap, as is supporting Github issues and eventually external documentation/code discussions from Slack, Jira/Linear, etc.


Feel free to submit an issue on the repo and we'll get to it!


I just need to have gemini 1.5 pro in VS code dev environment and pass in the entire codebase in the context window. THEY STILL HAVEN'T DONE THIS.


Depending on how large your codebase is, that could get pricey, at least for now. But it's probably just a matter of time until it all gets dirt cheap.


Definitely agree that the trend is toward lower cost where a lot of these use-cases are unlocked. Especially as all the major 3rd party LLM providers scramble to ship better models to retain mind-share.


Very cool project, I'm definitely going to try this out. One question — why use the OpenAI embeddings API instead of BGE (BERT) or other embeddings model that can be efficiently run client-side? Was there a quality difference or did you just default to using OpenAI embeddings?


OP's cofounder here. For us, OpenAI embeddings worked best. When building a system that has many points of failure, I like to start with the highest quality ones (even if they're expensive / lack privacy) just to get an upper threshold of how good the system can be. Then start replacing pieces one by one and measure how much I'm losing in quality.

P.S. I worked on BERT at Google and have PTSD from how much we tried to make it work for retrieval, and it never really did well. Don't have much experience with BGE though.


Understood, thanks for the clear answer. Very cool that you worked on BERT at Google — thank you (and your team) for all of the open source releasing and publishing you've done over the years.

I'm using OpenAI embeddings right now in my own project and I'm asking because I'd like to evaluate other embedding models that I can run in/adjacent-to my backend server, so that I don't have to wait 200ms to embed the user's search phrase/query. I'm very impressed by your project and I thought I might save myself some trouble if you had done some clear evals and decided OpenAI is far-and-away better :)


I wish you could tell the stories of how you eval'ed BERT at Google. Sounds meaty.


Retrieval is rarely ever evaluated in isolation. Academics would indirectly evaluate it by how much it improved question answering. The really cool thing at Google is that there were so many products and use cases (beyond the academic QA benchmarks) that would indirectly tell you if retrieval is useful. Much harder to do for smaller companies with a smaller suite of products and user bases.


We ran some qualitative tests and there was a quality difference. In fact, benchmarks show that trend to generally hold: https://archersama.github.io/coir/

That being said, our goal was to make the library modular so you can easily add support for whatever embeddings you want. Definitely encourage experimenting for your use-case because even in our tests, we found that trends which hold true in research benchmarks don't always translate to custom use-cases.


> we found that trends which hold true in research benchmarks don't always translate to custom use-cases.

Exactly why I asked! If you don't mind a followup question, how were you evaluating embeddings models — was it mostly just vibes on your own repos, or something more rigorous? Asking because I'm working on something similar and based on what you've shipped, I think I could learn a lot from you!


Happy to help!

At the beginning, we started with qualitative "vibe" checks where we could iterate quickly and the delta in quality was still so significant that we could obviously see what was performing better.

Once we stopped trusting our ability to discern differences, we actually bit the bullet and made a small eval benchmark set (~20 queries across 3 repos of different sizes) and then used that to guide algorithmic development.


Thank you, I appreciate the details.


We have LLMs with hundreds of thousands of tokens context windows and prompt caching that makes using them affordable. Why don’t we just stuff the whole code base in the context window?


This paper shows that 200-800 is the ideal chunk size; if you go above, the model starts getting confused / distracted. https://arxiv.org/pdf/2406.14497


Makes sense. Thanks!


The truth is we started there. But for any reasonably-sized, complex codebase this just isn't going to work as the context window isn't sufficient and moreover it becomes harder for the LLM to reason over arbitrary parts of the context.

For the time being, indexing and retrieving a good collection of 10-20 code chunks is more effective/performant in practice.


Not an expert, but OP is right and this is generally a known issue with large windows and RAG. Small chunks are usually best. Also how you chunk is important. OP - what’s the most optimal way to parse/chunk code snippets?


You can use the AST to chunk the code: https://docs.sweep.dev/blogs/chunking-2m-files


We're using an improvement over this exact blogpost actually. We started from there, but weren't happy that some of the chunks were really small (and they would undeservedly get surfaced to the top). So we added some extra logic to merge the siblings if they're small.

https://github.com/Storia-AI/repo2vec/blob/1864102949e720320...


Is it somehow different from Cursor codebase indexing/chat? I’m using this setup to analyse repos currently.


Big fans of Cursor ourselves. One of the goals with this library is to make it easy for maintainers of OSS projects to expose chat support functionality to their users in a very streamlined, easy-to-setup fashion.

So yes you can certainly use to index and query your own repos for yourself, but it's also a way to get more of your OSS lib users onboarded.


Sorry for the dumb question but can I use this on private repositories or is it sending my code to OpenAI?


Out of interest, are you worried that OpenAI would go against their API license terms and train on your data anyway, or are you worried that they might log your data and then have a security breach that exposes it to malicious attackers?


I think people simply worry calling Open AI on a lower price plan would cause the data to be scan for training purposes.


Their API terms and conditions say they won't do that.

I'm fascinated by how little people trust them!


Terms and conditions only mean something if you have the money and patience to hold someone’s feet to the fire.

If I’m a CTO figuring out how to enable my team, I care a great deal about whether or not our private code is going to OpenAI.


I'm confident they don't want your code in their training data. The amount they have to lose if they're found to be using customer code as training data is enormous. Plus there are no guarantees that your code is good for training a model - model providers have been focusing much more heavily on quantity rather than quality of training data recently.

(Worrying that they may log your data and then have a security breach is a different matter - that's a reasonable concern, they've had security bugs in the past.)

I call this the AI trust crisis: people absolutely won't believe AI companies that say they won't train on their data: https://simonwillison.net/2023/Dec/14/ai-trust-crisis/


Quality over quantity, rather?


Yes, that's what I meant! Too late to edit now.


This is a great read Simon.


All of the above. I’m not overly worried. But it’s surprising that they don’t mention it anywhere.


You can certainly apply to a private repo. If you want to ensure data stays local, you would have to add support for an OSS embedding/LLM model (of which there are many good offerings to pick from).


Please update TL;DR: repo2vec is a simple-to-use, modular library enabling you to chat with any public or private codebase "by sending data to OpenAI."


Thanks for the note! We welcome contributions!


This looks super cool! Is there currently a limit to how big a repo can be for this to work efficiently?


We noticed an interesting phenomenon related to the size of the repo. The bigger it is, the more its utility skews towards learning how to use the library as opposed to how to change it, i.e. for the big repos the chat is more useful for users than developers/maintainers.


Great question. For most small repos (10-20 source files) this works incredibly well out-of-the-box.

We stress-tested with repos like langchain, llamaindex, kubernetes and there the retrieval still needs work to effectively return relevant chunks. This is still an open research question.


Is this for a specific language? Does it support polygot (multiple languages in 1 project)?


Yup! We use tree-sitter and parse it at the file-level.


Any plans on allowing the use of a local LLM like Ollama or LM Studio?


OP's cofounder here. Yes, we started with what we perceived as highest quality (OpenAI embeddings + Claude autocompletions), but will definitely make our way to local/OSS. The code is super modular so hopefully the community will help as well.


Super easy to use! Thanks! What's powering this under the hood?


The starter config is Openai embeddings + llm, pinecone vector store, gradio for the UI. But it's customizable so you can swap out whatever you want easily.


What is Pinecone used for? I would assume that an average repo yields only a few hundred or thousand chunks. Even with brite force similarity search that is just 2-digit milliseconds on CPU. Faster than any API call. And even if you got into the million chunk scale, there’s FAISS and HNSW. So wouldn’t outsourcing this to an external provider not only be unnecessary, but making things slower?


I wonder if it will work on https://github.com/organicmaps/organicmaps

So far two similar solutions I tested crapped out on non-ASCII characters. Because Python's UTF-8 decoder is quite strict about it.


OP's cofounder here. Thanks for pointing out this test case. Surfaced that we weren't handling symlinks properly. With this fix, I was able to successfully embed and index most of the repo (though I stopped at 100 embedding jobs so that we don't burn through OpenAI credits).

P.S. You'll see a bunch of warnings for e.g. binary files that are ignored. https://github.com/Storia-AI/repo2vec/commit/1864102949e7203...


OP here! I love this stress test. Will index and get back to you!


is there a docker image?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: