I need to investigate scope and cost for training and hosting an LLM to ingest a small to medium size corpus of docs (400 K pages of web text plus PDFs) and train an LLM over it for the purposes of providing a chat like query interface.
How do I a) scope and estimate GPU/hrs needed, and b) decide what pre trained transformer model(s) might be best as a starting point?
I'm assuming you're wanting Q&A over the docs. If so, I think what you'd want to do:
* Use a local embeddings tool like BERT to embed all the docs (in chunk sizes up to 512 tokens)
* Use an open LLM like MPT-30B or Falcon-40B
* Then with the user query, do the following - generate an answer to the query with no context. Then do an embeddings similarity search based on both a) the question and b) the generated no-context answer. Then feed the 3-6 most similar chunks to to your LLM as a context (with a prompt like: Please answer the user's question. Here's context: """[context]"""". Here's the question: [question].)
All of that said, I think with the current state of open source commercially usable models, your results will be disappointing and I don't expect end users will be happy with the results. It sounds like you can't use GPT-4 (otherwise I'd say check this list I put together - https://llm-utils.org/List+of+tools+for+making+a+%22ChatGPT+...) - if you can use GPT-4, you'll get better results.
The other thing you can do is of course provide a source link to the documents that contained the most relevant chunks below the answer. (And you can also add a separate LLM prompt that asks the LLM which of the 3-6 chunks were highly relevant, and then use that to re-rank the results - I think of LLMs as better at similarity ranking than embeddings, though of course much slower and more expensive for the task, so best used sparingly when it's something embeddings can do).
So answering your questions:
1 - Minimal GPU hours needed with this approach, as you're not doing any fine tuning, only inference. If using MPT-30B I'd suggest 1x H100 on Lambda Labs or FluidStack, if using Falcon-40B I'd suggest 2x 6000 Ada on Runpod. See also this table I put together - https://gpus.llm-utils.org/recommended-gpus-and-gpu-clouds-f...
2 - I'd suggest MPT-30B if you need a commercial-ok model, otherwise Guanaco-33B.
I founded a company working to help companies use AI to organize their private knowledge.
We have focused on semantic search and knowledge graphs, but we started integrating a chatbot last week and it seems a short leap from where we are.
We'd be happy to help implement something.
You'll certainly want an embedding database.
The open models are getting pretty good, but you'll want to stand up a testing framework. I have a reasonably good model running on a desktop machine in my office with a reasonably priced consumer grade nvidia GPU.
We also have some tactics and practices around hallucination prevention that we'd be happy to share.
Feel free to reach out:
human at summitlabs.ai
Do you have a scope of what you want the llm to do? If it is just answering questions on the data then you likely don’t need to do a lot of training. I think the problem is more the infra setup, the embedding database, and the cost of hosting a model for inference and so on. Transparently I am also the founder of a start up that is providing this as a service, so feel free to book some time if you are interested in us doing it for you or if you want tips on how to do it yourself. Here is my Calendly if you want to talk: https://calendly.com/andrew-vb
Yes mostly answering questions and also summarizing and exploring by asking for explanations (understanding there could be non trivial amounts of hallucination)
Really appreciate you offering advice. Will book time after the long weekend.
Are you building your own "LLM" for this use case (more complex and expensive)? Or just generating embeddings for OpenAI or other LLMs that you integrate your own data with (easier and less expensive) https://openai.com/blog/introducing-text-and-code-embeddings
The LLM needs to be run locally. So it can’t be a GPT API or an API for some cloud LLM. The docs can’t leave the perimeter. It can be a Pretrained model that is being “extended” (is there such a thing?) with the new docs. Alternately it can be an existing LLM trained from scratch on these docs but I’m not sure that’s the right approach.
Cloud VM possible, not sure. Is it possible to run on HuggingFace without compromising data privacy?On-premise preferred. Capital cost to get to “Hello world” will probably convince principals to use HuggingFace if data privacy concerns can be met.
* Use a local embeddings tool like BERT to embed all the docs (in chunk sizes up to 512 tokens)
* Use an open LLM like MPT-30B or Falcon-40B
* Then with the user query, do the following - generate an answer to the query with no context. Then do an embeddings similarity search based on both a) the question and b) the generated no-context answer. Then feed the 3-6 most similar chunks to to your LLM as a context (with a prompt like: Please answer the user's question. Here's context: """[context]"""". Here's the question: [question].)
All of that said, I think with the current state of open source commercially usable models, your results will be disappointing and I don't expect end users will be happy with the results. It sounds like you can't use GPT-4 (otherwise I'd say check this list I put together - https://llm-utils.org/List+of+tools+for+making+a+%22ChatGPT+...) - if you can use GPT-4, you'll get better results.
The other thing you can do is of course provide a source link to the documents that contained the most relevant chunks below the answer. (And you can also add a separate LLM prompt that asks the LLM which of the 3-6 chunks were highly relevant, and then use that to re-rank the results - I think of LLMs as better at similarity ranking than embeddings, though of course much slower and more expensive for the task, so best used sparingly when it's something embeddings can do).
So answering your questions:
1 - Minimal GPU hours needed with this approach, as you're not doing any fine tuning, only inference. If using MPT-30B I'd suggest 1x H100 on Lambda Labs or FluidStack, if using Falcon-40B I'd suggest 2x 6000 Ada on Runpod. See also this table I put together - https://gpus.llm-utils.org/recommended-gpus-and-gpu-clouds-f...
2 - I'd suggest MPT-30B if you need a commercial-ok model, otherwise Guanaco-33B.