Curious to learn more about your use case. If fine-tuning is only ineffective for your most complex queries (and presumably those are less frequent as well, since you mentioned you have few examples), then couldn't you use fine-tuning to handle the simpler queries (presumably the lion's share) and thus free up excess man hours to focus on the more complex queries? Is there any benefit to AI being able to answer 90% of queries vs 0%?
Love this! I don't think you could be more right about the practical challenges of implementing something like this. In my experience, this same problem is what makes it so challenging to onboard new data scientists/analysts.
It takes a lot of training to get a team member up to speed - with the same amount of training, do you think an LLM can compete?