"For the most advanced model (GPT-4 with retrieval augmented generation), 30% of individual statements are unsupported and nearly half of its responses are not fully supported"
Show us the source code and data. The way the RAG system is implemented is responsible for that score.
Building a RAG system that provides good citations on top of GPT-4 is difficult (and I would say not a fully solved problem at this point) but those implementation details still really matter for this kind of study.
So that "30% of individual statements are unsupported" number was actually a test of how well ChatGPT's GPT-4 browsing mode with Bing could provide citations when answering medical questions.
Importantly this doesn't actually guarantee that it does any kind of search.
I'm confused as to whether they're using the API or not. Afaik only the assistant API has access to the web search, so I would expect this was manually done? But then the reason for only doing this with openai is that the others don't provide an API
> GPT-
4 (RAG) refers to GPT-4’s web browsing capability pow-
ered by Bing. Other RAG models such as Perplexity.AI or
Bard are currently unavailable for evaluation due to a lack
of API access with sources, as well as restrictions on the
ability to download their web results. For example, while
pplx-70b-online produces results with online access,
it does not return the actual URLs used in those results.
Gemini Pro is available as an API, but Bard’s implementa-
tion of the model with RAG is unavailable via API.
> Importantly this doesn't actually guarantee that it does any kind of search.
What's more important is that a user _can see_ whether GPT-4 has searched for something or not, and can ask it to actually search the web for references.
I saw a presentation about this last week at the Generative AI Paris meetup, by the team building the next generation of https://vidal.fr/, the reference for medical data in French-speaking countries. It used to be a paper dictionary and exists since 1914.
They focus on the more specific problem of preventing drug misuse (checking interactions w/ other drugs and diseases, pathologies, etc). They use GPT-4 + RAG with qdrant and return the exact source of the information highlighted in the data. They are expanding their test set - they use real questions asked by GPs - but currently they have 0 % error rate (and less than 20 % cases where the model cannot answer).
Same; a doctor’s judgment is supported by a system of accountability, which distributes the risk of error beyond the patient to the doctor/medical practice/insure. In contrast, (at least as of today) user facing AI deployments absolve themselves of responsibility with a ToS.
Who knows if that’ll stand up to legal scrutiny, but if I have to bet on something ITT it would be that legal repercussions of bad AI will look a lot like modern class action lawsuits. I look forward to my free year of “AI Error Monitoring by Equifax”.
I wonder if th result changes if you put a high quality medical reference in context. Feels like there might be an opportunity for someone to try and cram as much medical knowledge as possible in 1m tokens and use the new Gemini model.
> So that "30% of individual statements are unsupported" number was actually a test of how well ChatGPT's GPT-4 browsing mode with Bing could provide citations when answering medical questions.
Man, I am so disappointed. This is not a good study. Come on.
Show us the source code and data. The way the RAG system is implemented is responsible for that score.
Building a RAG system that provides good citations on top of GPT-4 is difficult (and I would say not a fully solved problem at this point) but those implementation details still really matter for this kind of study.
UPDATE: I found it in the paper: https://arxiv.org/html/2402.02008v1#S3 - "GPT-4 (RAG) refers to GPT-4’s web browsing capability powered by Bing."
So that "30% of individual statements are unsupported" number was actually a test of how well ChatGPT's GPT-4 browsing mode with Bing could provide citations when answering medical questions.