Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Danswer – Open-source question answering across all your docs (github.com/danswer-ai)
189 points by Weves on July 10, 2023 | hide | past | favorite | 66 comments
My friend and I have been feeling frustrated at how inefficient it is to find information at work. There are so many tools (Slack, Confluence, GitHub, Jira, Google Drive, etc.) and they provide different (often not great) ways to find information. We thought maybe LLMs could help, so over the last couple months we've been spending a bit of time on the side to build Danswer.

It is an open source, self-hosted search tool that allows you to ask questions and get answers across common workspace apps AND your personal documents (via file upload / web scraping)! Full demo here: https://www.youtube.com/watch?v=geNzY1nbCnU&t=2s.

The code (https://github.com/danswer-ai/danswer) is open source and permissively licensed (MIT). If you want to try it out, you can set it up locally with just a couple of commands (more details in our docs - https://docs.danswer.dev/introduction). We hope that someone out there finds this useful

We’d love to hear from you in our Slack (https://join.slack.com/t/danswer/shared_invite/zt-1u3h3ke3b-...) or Discord (https://discord.gg/TDJ59cGV2X). Let us know what other features would be useful for you!




I maintain an open source documentation platform, for which I had received a few queries about AI tooling. I'm not into the AI world of development, and my tech stack & distribution approach aren't great to provide AI friendly tech in my project itself, but connecting to external applications that can consume/combine multiple sources seemed like a good potential approach.

I came across Danswer a few days ago as an option for this, so I spent a day building a connector [1]. I was pleasantly surprised how accurate the output was for something like this. I have a few pages detailing my servers and I could ask things like "Where is x server hosted"? and get a correct response accompanied with a link to the right source page.

Some things to be aware of specifically about Danswer: It only works with OpenAI right now, although the team said that open model support is important as a future focus. Additionally it felt fairly heavy to run and required a 30 minute docker build process but I think they've improved on this now with pre-built images, and I'm not familiar with the usual requirements/weight of this kind of tech. Otherwise, things were easy to start up and play around with, even for an AI noob like me. Both their web and text-upload source connectors worked without issue in my testing.

[1]: https://github.com/danswer-ai/danswer/pull/139


There are a couple open source projects that expose llama.cpp and gpt4j models via a compatible OpenAI API. This is one of them: https://github.com/lhenault/simpleAI


Nowadays falcon-40b is probably more accurate than gpt4j, here's to hoping we get llama.cpp support for falcon builds soon [0]!

[0]: https://github.com/ggerganov/llama.cpp/issues/1602


The GGLLM fork seems to be the leading falcon winner for now [1]

It comes with its own variant of the GGML sub format "ggcv1" but there's quants available on HF [2]

Although if you have a GPU I'd go with the newly released AWQ quantization instead [3] the performance is better.

(I may or may not have a mild local LLM addiction - and video cards cost more then drugs)

[1] https://github.com/cmp-nct/ggllm.cpp

[2] https://huggingface.co/TheBloke/falcon-7b-instruct-GGML

[3] https://huggingface.co/abhinavkulkarni/tiiuae-falcon-7b-inst...


In my experience the QA with documents pattern is fairly straightforward to implement. 90% of the effort to get to a preformant system hoever goes into massaging the documents into semantically meaningfull chuncks. Most business documents, unlike blogposts and news articles, are not just running text. They have a lott of implicit structure that when lost as the typical naive chunckers do, lose much of the contextualized meaning as well.


Agree with the point about intelligent chunking being very important! Each individual app connector can choose how it wants to split each `document` into `section`s (important point: this is customized at an app-level). The default chunker then keeps each section as part of a single chunk as much as possible. The goal here is, as you mentioned, to give each chunk the relevant surrounding context.

Additionally, the indexing process is setup as a composable pipeline under the hood. It would be fairly trivial to plug in different chunkers for different sources as needed in the future.


Chunking is very important but might, I feel, best be contextualised as one aspect of the bigger substantive challenge, which is how to prevent false negatives at the context retrieval stage - a.k.a. how to ensure your (vector? hybrid?) search returns all relevant context to the LLM’s context window.

Would you mind saying a few words on how Danswer approaches this?


Yes agreed, tooling abounds, the work for anyone who's serious about this is customizing everything so it works with the idiosyncrasies of the documents and questions a customer has. I'm happy to talk to anyone who is interested, we are doing something like this for a company now.


Sadly completely unusuable for our usecase - if you are targeting Enterprise, you should know better than to use OpenAI models as the only LLM available.

For now I will stick to PrivateGPT and LocalGPT.


Completely unusable for internal docs* should be the caveat. For external doc OpenAI is fine unless you have stuff behind a password.


You may be in a situation where the document is public but the question is confidential, i.e. a user having specific question about an agreement with public TOS that a legal or medical department is managing.


At the very least, I'd start with adding support for the Azure flavor of OpenAI API. It's literally the same models, but the difference is that it's your company deploying those models on Azure, under proper enterprise contract with Microsoft, literally so that they can be safely used with proprietary data.


Yea, that's good feedback - we've gotten requests for open source model support from a lot from people we've talked about. It's one of our highest priorities, and should be available soon!


Are you about to use GPT4ALL[^1] or anything else? If you're going with the second option, then please share any link to such resources... I'd be interested.

And, to share with you something: I saw somewhere a tool (maybe it was GPT4ALL itself) that had the ability to expose a OpenAI-compatible local API on localhost:8080... Ah, yes. Here it is. Actually, there are two. They are described as possible backends for Bavarder (that's a free access to multiple online models, API key is not required): https://bavarder.codeberg.page/help/local/

[^1]: https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-backen...


From FAQ:

> Danswer provides Docker containers that you can run anywhere, so data never has to leave your PVC. The one exception is using GPT for inference but we are working on allowing for locally hosted generative models as well.

Look, you can plug this hole trivially for many companies, by adding support for Azure OpenAI API. It's almost identical to OpenAI API - the main difference is how you pass keys and specify the model to use. But that alone will make it possible to use Danswer with company data in places that signed a relevant contract with Microsoft.


That's a good suggestion! Will look into it, should be fairly easy to support


How are you thinking about the "document level access control" to make this viable for business environments?

Ex: If a connected gdrive document gets indexed, but then someone fixes the share settings in google docs for some item to be more restrictive.. How does Danswer avoid leaking that data? Dynamic check before returning any doc that the live federated auth settings safelist the requesting user reading that doc?


Great question! Right now, our access control is very basic. When admins setup connectors to other apps, all documents indexed are accessible by all (meant to be public documents only). Individual users can index private documents by providing their own access tokens for connectors, and those docs will be only available to the user who owns that access token. Improving this is a high priority item for us, as we understand this is a deal-breaker for enterprises.

The immediate plan is to extend our current poll / push based connectors to also grab access information (+ add IdP integrations for cross-app identity). There will be some delay to grab access updates, which will be combatted by the dynamic check with the app / IdP itself at query time that you mentioned (still investigating exactly how this will work).

We are also considering adding support for group based access defined within Danswer itself for sources that don't provide APIs to get access information (default being all-public if not specified). Of course, for these, we will not be able to sync permissions.


I wonder how long it will be before Google Workspace just has this feature for your Docs. It can't be long... Question-answering against external docs is something Google could easily add. I worry about the defensibility of startups working in this area as it's so fully in front of the steamroller.


Google can't even make a half rate Bard.


We at Qdrant are glad to be a part of this awesome solution, providing the Vector Database resource for Danswer. https://github.com/qdrant/qdrant


Amazing foundational tools like Qdrant make building in this space so much easier <3


Could you say a few words about why you chose Qdrant for this project? It seems to me there is definitely a place in this space for a back-end focused on retrieval for LLMs that goes beyond simple vector similarity search and encapsulates other metadata creation / indexing and hybrid retrieval techniques to tackle the “false negatives” (missing context) challenge. We’re trying to decide whether leaning on something like Qdrant, Weaviate or Pinecone instead of our current Postgres / pgvector stack might be worth the cost of learning and running extra infra.



Looks great and will test it out, but for enterprises definitely needs support for Azure/Office 365 integration to index Word, Excel, etc. Lots of docs are stored in Onedrive, Teams channels, and SharePoint. I'm going to test these use cases, but would be nice if it supports it OOB like Google Docs. Also, any thought of OOB connectors to ServiceNow or other ticketing/KB platforms?


Native support for the Microsoft suite of tools is something we plan to add fairly soon! We're a small team, and currently swamped with connector/feature requests so no promises on the timeline.

Ticketing platforms like ServiceNow fall under a similar category, although a bit lower priority in my mind.


Wait, this is not local? Why use OpenAI third-party requests instead of a local model?


Adding local/open source model support is at the top of our TODO list! When we started building, open source models were quite a bit more behind of GPT-4 then they are now. At that time the performance gap was at a point where locally hosted models would provide a significantly hampered experience, but we think that gap has (and will continue) to close rapidly.


Is there a single local model in existence that's good enough to support this use case?


I've seen a few of these, and this one looks like it is more feature complete than many (e.g. including web scraping I think is an important component).

Looks nice! Curious about the business model or is it just a hobby project?


Thanks! For now, we're just focused on making sure this solves a problem effectively for people. In the long term, if we're able to build up trust, we'll probably offer a managed version.


Check this out. Built on a vector database (https://github.com/milvus-io/milvus) and a semantic cache (https://github.com/zilliztech/GPTCache)

https://osschat.io/


I checked your service, but it does not work. For example, I've asked:

> What are the latest features introduced in ClickHouse?

and it answered:

> :sparkles::crystal_ball: Oh dear! A mystical internal server error has occurred. Please weave your magic and try again after a few seconds. :male_mage::star2:


I just had a play with this on the LangChain repo but the results were a bit mixed. Could you speak a bit about the approach taken and some of the challenges you faced?


Could I ask a somewhat naïve question? Does either Danswer or SimpleAI send data to a third party such us openai or huggingface?

As I understand it, you have a pre-trained LLLM model which you tune LOCALLY, and you can use it from the ui provided.

From what I am reading, there is ambiguity about whether data is sent elsewhere.

Obviously it doesn't matter if you host in EKS or EC or locally as far as I understand.


Right now, Danswer does send data to OpenAI. Once we support locally hosted, open source generative models that will change. If you go with that option, your data won't be sent anywhere.


This is great, love it!

Crawling sites to index the FAQ's and knowledge bases, into the vector search, isn't as intimidating as it sounds, at least on linux systems. Sometimes a thin wrapper function over plain old wget will get you 99% of the way

    wget -rnH -t 1 --waitretry=0 'https://{{domain}}' -P '{{domain}}'


Noooooooo, not openAI! It looks perfect, just allow to run models like Vicuna or Llama locally - well, since it’s open source anyone can contribute to make this happen.

Thank you for your work, it looks great


This looks interesting. Thank you for making public. I made something similar that uses data from only Notion. Do you happen to have / be developing a Notion connector?


We are actively building a Notion connector! Will be out very soon :)


Cool, written in NextJS! Are you taking on new contributors?

I would be interested in adding Azure OpenAI support and OneNote support.


We'd love for you to contribute. Let us know in Discord / Slack (if you haven't already), and we'd be happy to find something interesting for you to hack on.


The video looks impressive, well done. Why didn't you build it on top of langchain or other similar frameworks?


We're actually planning on migrating to LangChain very soon (primarily to allow for memory / tool usage + automatic integrations with llamacpp / other open source model serving frameworks). We didn't start with it initially since we didn't want to restrict our usage patterns too much while we were (even more) unsure of what exactly we were going to build.

As far as using other data connector frameworks, we found that we either (1) didn't think they were very good or/and (2) they didn't support automatic syncing effectively. For larger enterprises, it's not feasible to do a complete sync every X minutes. We need to be able to get a time-bounded subset (or have them push updates to us), which is something LangChain, LlamaIndex, etc. don't support natively.


Can you elaborate on the difficulties and nuances surrounding syncing? I'm not sure exactly what you mean. Do you mean keeping indexes update to date when new documents are provided? or something else...


how does this compare to llama-index


LlamaIndex is a very generic framework to ingest data (from anywhere, with no specific context in mind). Developers then build on top of this framework in order to simplify the process of creating LLM-powered on apps. Developers need to handle automatically syncing, building a UI to manage connections, build out the actual features/functionality they desire, etc.

Danswer is:

(1) itself an end-to-end application which allows you to connect to all your workplace tools via a UI, and then ask questions and get answers based on these documents. The goal is to be a permissively licensed, open source solution to the enterprise knowledge retrieval problem.

(2) an ingestion framework specifically targeted for enterprise applications. We provide a UI for admins to manage connections to other common workplace apps, automatically sync them into a vector DB + a keyword search engine, and expose APIs that allow access to these underlying data stores (more to come in this direction). We take care of access control (more in the pipeline here as well), only grabbing updates so we don't have to pull thousands (or millions) of documents every X minutes, etc. TL;DR: we're focused on a specific ingestion use case.


why did you choose not to build on top of something like llama-index or langchain? how easy or difficult would it be to integrate with a library like llama-index?


See https://news.ycombinator.com/item?id=36671406 for more details on this. TL;DR didn't like llama-index's connectors, and we're planning to move to LangChain soon.


Also wondering this.


Another great tool solving the exact problem we're willing to solve using an external service we can't use.

No company at a decent size (those who actually reach some complexity of documentation) will be okay with exfiltrating confidential information to an external service we have no deal or NDA with. Sure, OpenAI is easy to integrate, but it's also an absolute showstopper for a company.

We don't need state-of-the-art LLMs with 800k context, we need confidentiality.


I'm kinda confused by this. Every company already keeps their data in Google Docs, Notion, Slack, Confluence, Jira, or any number of other providers. When you sign up for one of these services, there's always a compliance step to make sure it's ok. OpenAI's TOS says they don't use API data for training. So what makes sending this data to OpenAI different than sending it to any of the above providers? This is an honest question. I don't understand the difference.


> Every company already keeps their data in Google Docs

The TOS for the (paid) enterprise products such as Google workspace are totally different from the (free) consumer versions. For example Google can't use the data for AI training.


TOS of OpenAI API (which tools like this use) do not allow for model training on the data either. You might be confusing their API with ChatGPT, which has a different policy.


The important point being, with Google, Notion, Slack, Confluence, etc. your company has an actual contract with the vendor, properly signed, with provisions about data handling that your company (and unlike you as an individual) can effectively enforce. There's an actual relationship created here, with benefits and losses flowing both ways.

The Terms of Service? They're worth less than it costs to print them out.

Case in point: right now, Microsoft is repackaging OpenAI models on their Azure platform and raking it in - the main value proposition here is literally just that it's "OpenAI, but with proper contract and an SLA". But companies happily pay up, because that's what makes the difference between "reliable and safe to use at work" vs. "violating internal and external data safety standards, and in some cases plain up illegal".


So if the product from OP used Azure OpenAI, it would be okay? You say "companies happily pay up", but the pricing is exactly the same (source: my company is paying for both APIs).

It's been quite clear for some time that, between OAI and MS, they very neatly split their market: OAI handles the early development and direct customers, and MS handles enterprises. It would require OAI to be a much bigger company than it is right now to properly handle enterprises, and MS already has all that infrastructure (legal, support, etc.). Seems like a sensible setup to me, and I don't see the need for enterprises to run open source models themselves (in this context - of course I see the value in all the other respects about lock-in and specialization), especially if they are already on Azure.


IANAL, but I read the openai api TOS earlier today, and they keep data for up to 30 days for "review" and multiple people can get access to it. If I had confidential data I would not send it to them. Microsoft on the other hand seems to have a option where absolutely no data is stored for their openai service.


You use "review" in quotes, but I don't see that word used in reference to the 30 day policy. This is what I see:

> Any data sent through the API will be retained for abuse and misuse monitoring purposes for a maximum of 30 days, after which it will be deleted (unless otherwise required by law). [0]

The word "review" implies humans reading your data, but this wording only says it's retained for "monitoring". That could mean other things.

Or are you seeing the "review" wording somewhere else?

[0] https://openai.com/policies/api-data-usage-policies


It is true that the word "review" was my own, it was my interpretation of this paragraph

>OpenAI retains API data for 30 days for abuse and misuse monitoring purposes. A limited number of authorized OpenAI employees, as well as specialized third-party contractors that are subject to confidentiality and security obligations, can access this data solely to investigate and verify suspected abuse.


For our part, we self-host Confluence and gitlab, have tons of internal documentation and web pages, are are prohibited from using external tools unless they can be hosted internally in a sandboxed manner. There's no way on the planet they would approve the use of connecting to an OpenAI API for trawling through internal documentation.


There are open source models that can deliver pretty well for chatbot over internal documentation. If you're interested, feel free to reach out to me.


Trust. OpenAI's ignored everyone's copyright and legal usage terms for the rest of their training data, what lawyer is going to trust them to follow their contractual terms?


Why would you send your data to the company that built its value by slurping up everyone's data without consent? It doesn't matter what they promise now, they have shown that they dont care about intellectual property, copyright or any of that. They literally cannot be trusted.


Isn't this what Google search does? Yet Google Docs, Gmail, etc are all OK?


It doesn't matter either way. What matters is that Google offers proper enterprise contracts. Contracts that are enforceable and transfer a lot of legal liability to the vendor. OpenAI, generally, does not offer such things.

Google Search itself is a somewhat special case - it gets a free pass because of its utility and because you're unlikely to paste anything confidential into a search box. But there are many places where even Google Search is banned on the data security grounds.

OpenAI offerings - ChatGPT, the playground, and the API - all very much encourage pasting large amounts of confidential information into them, which is why any organization with minimum of legal sense is banning or curtailing their use.


Two weeks ago I finished a project for a client who wanted a "talk to your documents" application, without using OpenAI or other 3rd party APIs, but by using open source models running on their own infrastructure.

If you're interested in something similar, send me an email.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: