Hacker News new | past | comments | ask | show | jobs | submit login

"For the most advanced model (GPT-4 with retrieval augmented generation), 30% of individual statements are unsupported and nearly half of its responses are not fully supported"

Show us the source code and data. The way the RAG system is implemented is responsible for that score.

Building a RAG system that provides good citations on top of GPT-4 is difficult (and I would say not a fully solved problem at this point) but those implementation details still really matter for this kind of study.

UPDATE: I found it in the paper: https://arxiv.org/html/2402.02008v1#S3 - "GPT-4 (RAG) refers to GPT-4’s web browsing capability powered by Bing."

So that "30% of individual statements are unsupported" number was actually a test of how well ChatGPT's GPT-4 browsing mode with Bing could provide citations when answering medical questions.




It's using the web search provided by openai.

Importantly this doesn't actually guarantee that it does any kind of search.

I'm confused as to whether they're using the API or not. Afaik only the assistant API has access to the web search, so I would expect this was manually done? But then the reason for only doing this with openai is that the others don't provide an API

> GPT- 4 (RAG) refers to GPT-4’s web browsing capability pow- ered by Bing. Other RAG models such as Perplexity.AI or Bard are currently unavailable for evaluation due to a lack of API access with sources, as well as restrictions on the ability to download their web results. For example, while pplx-70b-online produces results with online access, it does not return the actual URLs used in those results. Gemini Pro is available as an API, but Bard’s implementa- tion of the model with RAG is unavailable via API.


> Importantly this doesn't actually guarantee that it does any kind of search.

What's more important is that a user _can see_ whether GPT-4 has searched for something or not, and can ask it to actually search the web for references.


That's wildly misleading then. It would be interesting to see how GPT-4, properly augmented with actual medical literature, would do.


I saw a presentation about this last week at the Generative AI Paris meetup, by the team building the next generation of https://vidal.fr/, the reference for medical data in French-speaking countries. It used to be a paper dictionary and exists since 1914.

They focus on the more specific problem of preventing drug misuse (checking interactions w/ other drugs and diseases, pathologies, etc). They use GPT-4 + RAG with qdrant and return the exact source of the information highlighted in the data. They are expanding their test set - they use real questions asked by GPs - but currently they have 0 % error rate (and less than 20 % cases where the model cannot answer).


Likely better than the average doctor. If I had the opportunity to take that bet, I would.


I would take the other side of that bet in a heart beat.

But given the vagueness of the wording, much is going to depend on the details.


Same; a doctor’s judgment is supported by a system of accountability, which distributes the risk of error beyond the patient to the doctor/medical practice/insure. In contrast, (at least as of today) user facing AI deployments absolve themselves of responsibility with a ToS. Who knows if that’ll stand up to legal scrutiny, but if I have to bet on something ITT it would be that legal repercussions of bad AI will look a lot like modern class action lawsuits. I look forward to my free year of “AI Error Monitoring by Equifax”.


I wonder if th result changes if you put a high quality medical reference in context. Feels like there might be an opportunity for someone to try and cram as much medical knowledge as possible in 1m tokens and use the new Gemini model.


I agree. While I appreciate what doctors do, there sure are a lot of shitty doctors out there who skirts by - like any profession.


So odd they call GPT4+Bing a "RAG' system


Is it not?

They’re Retrieving data from big to Augment gpt’s Generations.


> So that "30% of individual statements are unsupported" number was actually a test of how well ChatGPT's GPT-4 browsing mode with Bing could provide citations when answering medical questions.

Man, I am so disappointed. This is not a good study. Come on.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: