I have to assume that targeted/curated LLM training sets will have a tendency to be less accurate than very general, just by the very nature of how they work.
I know it's not quite analogous, but I fine-tuned GPT-3 on a small (200 examples) data set and it performed extremely poorly compared to the untrained version.
This surprised me, I thought it wouldn't do much better, but I wasn't expecting that specializing it on my target data would reduce performance! I had fewer examples than the minimum OpenAI recommends, so maybe it was a case of overfitting or something like that.
(edited for clarity)