In your opinion, is it an either or scenario? Or would fine-tuning on docs + RAG...

chartpath · on Aug 29, 2023

I've been wondering this myself lately.

After using RAG with pgvector for the last few months with temperature 0, it's been pretty great with very little hallucination.

The small context window is the limiting factor.

In principle, I don't see the difference between a bunch of fine-tuned prompts along the lines of "here is another context section: <~4k-n tokens of the corpus>", which is the same as what it looks like in a RAG prompt anyway.

Maybe the distinction of whether it is for "tone" or "context" is based on the role of the given prompts and not restricted by the fine-tuning process itself?

In theory, fine-tuning it on ~100k tokens like that would allow for better inference, even with the RAG prompt that includes a few sections from the same corpus. It would prevent issues where the vector search results are too thin despite their high similarity. E.g. picking out one or two sections of a book which is actually really long.

For example, I've seen some folks use arbitrary chunking of tokens in batches of 1k or so as an easy config for implementation, but that totally breaks the semantic meaning of longer paragraphs, and those paragraphs might not come back grouped together from the vector search. My approach there has been manual curation of sections allowing variations from 50 to 3k tokens to get the chunks to be more natural. It has worked well but I could still see having the whole corpus fine-tuned as extra insurance against losing context.

ofermend · on Aug 29, 2023

It's not impossible that fine-tuning would also help RAG. but it's certainly not guaranteed and hard to control. Fine-tuning essentially changes the weights of the model, and might result in other, potentially negative outcome, like loss of other knowledge of capabilities of the resulting fine-tuned LLM.

Other considerations: (A) would you fine-tune daily? weekly? as data changes? (B) Cost and availability of GPUs (there's a current shortage)

My experience is that RAG is the way to go, at least right now.

But you have to make sure your retrieval engine work optimally: getting the very most relevant pieces of text from your data: (1) using a good chunking strategy that's better than arbitrary 1K or 2K chars (2) using a good embedding model (3) Using hybrid search, and a few other things like that.

Certainly the availability of longer sequence models is a big help

Sharing this relevant discussion from LinkedIn: https://www.linkedin.com/feed/update/urn:li:activity:7101638...