"For the most advanced model (GPT-4 with retrieval augmented generation), 30% of individual statements are unsupported and nearly half of its responses are not fully supported"
Show us the source code and data. The way the RAG system is implemented is responsible for that score.
Building a RAG system that provides good citations on top of GPT-4 is difficult (and I would say not a fully solved problem at this point) but those implementation details still really matter for this kind of study.
So that "30% of individual statements are unsupported" number was actually a test of how well ChatGPT's GPT-4 browsing mode with Bing could provide citations when answering medical questions.
Importantly this doesn't actually guarantee that it does any kind of search.
I'm confused as to whether they're using the API or not. Afaik only the assistant API has access to the web search, so I would expect this was manually done? But then the reason for only doing this with openai is that the others don't provide an API
> GPT-
4 (RAG) refers to GPT-4’s web browsing capability pow-
ered by Bing. Other RAG models such as Perplexity.AI or
Bard are currently unavailable for evaluation due to a lack
of API access with sources, as well as restrictions on the
ability to download their web results. For example, while
pplx-70b-online produces results with online access,
it does not return the actual URLs used in those results.
Gemini Pro is available as an API, but Bard’s implementa-
tion of the model with RAG is unavailable via API.
> Importantly this doesn't actually guarantee that it does any kind of search.
What's more important is that a user _can see_ whether GPT-4 has searched for something or not, and can ask it to actually search the web for references.
I saw a presentation about this last week at the Generative AI Paris meetup, by the team building the next generation of https://vidal.fr/, the reference for medical data in French-speaking countries. It used to be a paper dictionary and exists since 1914.
They focus on the more specific problem of preventing drug misuse (checking interactions w/ other drugs and diseases, pathologies, etc). They use GPT-4 + RAG with qdrant and return the exact source of the information highlighted in the data. They are expanding their test set - they use real questions asked by GPs - but currently they have 0 % error rate (and less than 20 % cases where the model cannot answer).
Same; a doctor’s judgment is supported by a system of accountability, which distributes the risk of error beyond the patient to the doctor/medical practice/insure. In contrast, (at least as of today) user facing AI deployments absolve themselves of responsibility with a ToS.
Who knows if that’ll stand up to legal scrutiny, but if I have to bet on something ITT it would be that legal repercussions of bad AI will look a lot like modern class action lawsuits. I look forward to my free year of “AI Error Monitoring by Equifax”.
I wonder if th result changes if you put a high quality medical reference in context. Feels like there might be an opportunity for someone to try and cram as much medical knowledge as possible in 1m tokens and use the new Gemini model.
> So that "30% of individual statements are unsupported" number was actually a test of how well ChatGPT's GPT-4 browsing mode with Bing could provide citations when answering medical questions.
Man, I am so disappointed. This is not a good study. Come on.
A close friend of mine went down the ChatGPT medical rabbit hole last year. He was disagreeing with his doctors and getting deep into supplements and unproven medical theories.
ChatGPT was instrumental in convincing him that he was correct and his doctors were wrong. He would post his ChatGPT conversations as proof, but we could see that his prompts were becoming obvious leading questions.
He would phrase things like “Is it possible for {symptoms} to be caused by {condition} and could it be treated by {herbal product}?” Then ChatGPT would give him a wall of text saying that it’s possible, which he took as confirmation of being correct.
It was scary to see how much ChatGPT turned into a mirror for what he wanted to be told. He got very good at getting the answers he wanted. He could dismiss answers that disagreed as being hallucinations or being artifacts of an overly protective system. With enough repeat questioning and adjusting his prompts he could get it to say almost whatever he wanted to hear.
ChatGPT is rocket fuel for hypochondriacs. Makes WebMD look tame.
Not quite. For people outside of tech they can hear "AI" and think it must be right. And how great GPT is so it must be right. There are no other opinions.
Distilled down, classic search is many results; gpt is one result.
Funny you mention this. There was a time (in my living memory), where people regarded whatever in print as proof of correctness. Newspapers had value then.
It's even worse than that imho, unscrupulous sellers will go into keyword planner in google ads and create content to sell supplements and find "under-served" niches.
To be fair, this is also illogical, at least in a literal sense. It’s the Appeal to Authority fallacy. The reputation of the speaker doesn’t necessarily confirm or deny an assertion they make.
Granted, most people tend to use some combination of both logic and heuristics in practice to determine their own understanding of the truth, perceived reputation being a fairly common one.
> To be fair, this is also illogical, at least in a literal sense
Mostly in the same sense as arguments from induction are "illogical". Appeal to authority is essentially an inductive argument, and those are just as logical. Of course, they deal in probabilities rather than certainty, unlike deduction, but that doesn't make them illogical.
One of the most interesting things for me over the past 18 months was seeing the difference between this behavior in GPT-3.5 and 4 (especially the early versions of it).
You jumped from being extremely context sensitive to leading questions to almost ornery in its stubbornness and disagreement. Was one of the big early indicators to me significant differences were occurring between the two.
I really do wonder just how harmful the disconnect has been between people hearing accurate praise about GPT-4 powered agents and then interacting with GPT-3 level agents and assuming the same competencies.
GPT-3 was a very convincing text generator with lackluster reasoning capabilities, but 4 has significant reasoning capabilities and is less prone to parroting. But the majority of users are likely interacting with 3 because it's free.
It's been very hard to explain to lay audiences the vast difference between two similarly branded products.
While I agree the difference is vast, GPT4 still goes down the garden path a lot. If you're critical you can usually spot it's reason errors and challenge them, but it still also has enough issues that conversely I think in some areas there's a risk it'll be better at convincing people about the wrong things.
Playing with the smaller free models, I’ve definitely found that they like to play along and mirror whatever you say. They’re great as a sounding board and active listener when you need to work through something, and they’re good for recalling facts that they know. They can be really useful when they glue together something you’re talking about and something relevant that they know. They’re terrible for anything that requires interoception or any kind of confidence estimation. They have no idea whether they’re recalling or confabulating.
It it’s too bad that this type of data beyond the medical record is not tracked along side it. Although many anecdotal associations are probably not correct for the broad population, there are likely nuances in the patients history and lifestyle which do make some more basic interventions helpful. ChatGPT surely picks up on some trends but doesn’t have enough data about the people where those trends are effective.
And of course its responses are highly subject to the prompts used
I tried using your template for a few items and none gave misleading results. Here’s one example:
Prompt:
Can loss of vision be caused by a meniscus tear and treated with rose hips?
Response:
A meniscus tear is a knee injury affecting the cartilage that acts as a cushion between the thigh bone (femur) and shin bone (tibia). This type of injury typically affects mobility and stability of the knee, causing pain, swelling, and difficulty in moving the knee joint. There is no direct physiological pathway through which a meniscus tear could cause loss of vision. Vision problems are generally related to issues within the eye or neurological conditions affecting the optic nerves or brain, not injuries to the knee or other joints.
Regarding the treatment with rose hips for a meniscus tear, while rose hips are known for their high vitamin C content and anti-inflammatory properties, they are not a recognized treatment for repairing a meniscus tear. They may potentially contribute to overall joint health and reduce inflammation, which could be beneficial in managing symptoms associated with joint injuries or conditions like arthritis. However, the primary treatments for a meniscus tear usually involve rest, ice, compression, elevation (RICE protocol), physical therapy, and in some cases, surgery, depending on the severity of the tear.
If you are considering using rose hips or any other supplements for joint health or recovery from an injury, it's important to discuss this with your healthcare provider to ensure it's safe and appropriate for your specific situation. They can offer guidance on effective treatments for your condition and whether any supplements could support your recovery.
This seems like a false comparison because you are bringing up entirely different parts of the body. Most dangerous misunderstandings start from a grain of plausible truth. For example, many people will choose a category of food/ingredient that causes every ailment known to man, and then cut out that ingredient from their diet. They will subsequently claim that skipping that ingredient changes their life, when in all likelihood the simple act of paying attention to where their food comes from and what went into it changes their consumption and behaviors in significant ways that have nothing to do with the bogeyman ingredient. They will then find a study that pumped 100g of said ingredient into a 200g mouse and point out all the negative effects that had which they are now avoiding.
I choose something that was much different to see if ChatGPT would just take my cue and create a connection. The closer I get to something that is more plausible the more likely there actually might be an obscure, but actually valid and useful, connection.
When ChatGPT 3.5 first became available I actually used it to diagnose an issue my father in law had that was missed by his surgeon and the ICU staff. Quite possibly saved his life. Finding these connections IMO is a plus and not a negative.
I guess chatgpt weights by empirical probability in text and not by studtly size or study design. Indeed it coming that pharma companies or other actors will invest in poisoning LLMs to better get already questionable treatments into the market and will spend less money inviting medical doctors into holiday resorts (my father was a medical doctor and it was a practice at least in the 80s with 'scientific' conferences back then)
>A meniscus tear typically affects the knee and wouldn't directly cause loss of vision. However, there could be indirect factors such as reduced mobility or associated health issues that might impact eye health. Rose hips are not a typical treatment for meniscus tears or vision loss. It's important to consult with a healthcare professional for appropriate diagnosis and treatment options for both conditions.
No, regular web search will just give you links. You have to click and go read the text. Putting together a comprehensive page of (seemingly) cogent arguments will need GenAI.
People like this will go and click those pages, and seek out communities that will agree with their preconceived notions.
ChatGPT results are just one mild manifestation of it. If and actually not if as an alternative they will find unhinged forums with echo chambers far worse than ChatGPT. At least ChatGPT isn't actively trying to rip you off to sell those supplements, and it usually adds disclaimers.
You can probably even in worse ways find likeminded communities, which are likely much more unhinged than ChatGPT and lead you down a much darker path.
There is likely enough existing information out and discussion out there. But ultimately he must have found it somewhere that this supplement could help, to go to ChatGPT with it.
I think the big difference is that with web search there will be a couple of reputable sources that will be at/near the top, like web MD and Mayo clinic. I can search and click one of those and be fairly sure it'll be accurate. There is no immediate way with chat gpt for me to know if it is reliable or crank mode.
> He would phrase things like “Is it possible for {symptoms} to be caused by {condition} and could it be treated by {herbal product}?” Then ChatGPT would give him a wall of text saying that it’s possible, which he took as confirmation of being correct.
If you mine Pubmed, you sometimes find literature with opposite claims. So if a LLM is trained on Pubmed/PMC, it will repeat that kind of garbage.
You don't have to dig into "herbal products" it happens also in more conventional domains.
I once even found that kind of contradiction in articles where the "main author" was the same in both publications. It was about ALS and the "main author" was a prominent scientist from the USA who probably never wrote nor even read any of these articles.
One of the most basic prompts everyone should be using with questions like this is something like "provide detailed arguments both for and against the presented hypothesis."
This works when the person evaluating can distinguish between the subject matter being correct or not.
But the models can generate compelling devil's advocate positions, particularly when they confabulate supporting facts, which might appear convincing to non-experts.
Chatgpt is a continuator, of course it will mirror input . But i m sure that someone is training the adversarial persuasion machine that will fund the next internet with ads
The friend was already consulting with doctors (plural).
It is quite common for people with ambiguous presentations to take a long time to get a definitive diagnosis. Not surprising that some of them will be receptive to Cranks and GenAIs along the way.
Some years ago I was on a neurology ward (for something straightforward) and my neighbour had been trying to get a diagnosis for his problems for about a year. He finally got one but it took a panel of senior medics in a top hospital in a big city to crack it.
Agreed that this is a challenging problem, but mostly because of the data used to train and the tokenization for language models. We’re working on this building what we call a large medical model (lmm) which is trained on medical event histories from 10s of millions of patients, not papers or internet text. Our tokens are medical codes facilities demographics etc. If anyone is interested we have an api + we’re hiring. https://genhealth.ai
This is fascinating... but just a superficial nit about the website and what it puts across: As someone who spent half my life designing, illustrating and art directing for brands, the choice to use anime style art as the theme for section headers is the exact type of decision I look for when reviewing portfolios that tells me a potential hire might know how to use image editing tools or be proficient in technical fields, but is not actually a designer; i.e. can create an image they "like" but lacks the knowledge and creative ability to synthesize subtexts in design and art history in original ways that are visually compelling but also create the right tone for a client. The choice of generated amime art implies its use on the page as mere decoration, rather than as a language of communication carefully selected to give viewers the right impression of the company. It argues for why diffusion models guided by engineers will not replace professional art direction anytime soon.
No offense, this is a standard art crit I would give to any art student, or to anyone running a startup who had nonprofessional design.
Ha! We’re just trying to use the art to stand out from the rest of health tech which is quite boring. I get the target audience is not often into anime (Ive watched maybe 10 anime anything’s in my life), nevertheless it looks cooler than the “lobby people” on other websites, and I like to think, bc of threads like this, the selection of art has served its purpose.
FWIW - I think that's a great reason. Anything where the secondary, subconscious effects of the communication are considered and intentional is perfectly valid from my POV. It's part of my job to walk through these questions with clients who get attached early in brand development to certain themes or colors or typography, and they usually don't have an answer. But if it's like "we want to convey that we're too busy working on this amazing platform to worry about trying to look like every other digital biotech startup, most of whom use the same boring icon sets, so let's just throw out some fun subcultural art in the meantime," then your choice is well tuned to presenting the company culture you're trying to get across, and I withdraw my crit. In fact, I'd be overjoyed to hear that from any client.
To play devil's advocate, it could turn out that only 1 person needed to be persuaded by the art, some investor that ends up leading their next round, who happens to be an anime fan and now it was the right choice.
> rather than as a language of communication carefully selected to give viewers the right impression of the company
You could argue it gives the impression that the culture of the company is primarily technical, given that the technical and anime communities have a huge overlap.
Your first hypothesis is completely valid, and it occurred to me that this decision could have come from the top - but it's also a red flag for me when one or two individuals in a client's upper management or worse, investors, inject themselves into art decisions by overriding choices presented to them by trained and experienced professionals they've presumably hired to that end. Those companies tend to also override other suggestions presented by employees or contractors in other areas. Which implies a management style that isn't utilizing its hired talent to full effect.
As far as appealing to technical people who overlap with enjoying Anime, that would be a very specific decision that might be appropriate for a video game company or something, but even then it would have to be justified by having some actual connection to Anime culture. Otherwise it's sort of just appropriating a style... in the hope that some percentage of crossover exists between users, investors, and this subculture being heavily referenced. My theory is that no study was done on whether or to what extent that crossover existed here before choosing it as a theme, but if I'm wrong I'll eat my critique!
At what level of ridiculousness do you stop debating? It could turn out that they were blackmailed into making that choice, or maybe it was the best-looking color palette on their old Nokia.
At some point you have to be able to recognize that the emperor has no clothes, no matter what extenuating circumstances may have arised.
> Unfortunately, very little evidence exists about the ability of LLMs to substantiate claims. In a new preprint study, we develop an approach to verify how well LLMs are able to cite medical references and whether these references actually support the claims generated by the models.
Is there a corresponding control group for how well an average doctor is able to cite medical references and whether these references actually support the claims generated by the doctors?
Give me a few minutes and I can pull up any number of medical studies or references to back up my claims.
I don’t have them memorized to the actual URL but I have kept up to date with the latest studies and summaries that pertain to my field and my patients.
A fair comparison (on a technical level) to GPT-4 RAG would be a doctor in a relevant field who also has internet access. I think this would be indeed interesting to compare to assess the resulting quality of care, so to speak!
(The other models being only partially able to source good references is unsurprising/"unfair" on a technical level, but that's not relevant for assessing their safety.)
They often give clinical recommendations and prescriptions, usually after a quick skim of the medical history and a rushed five minute conversation with the patient. It would be nice to know how many of these typical in-office recommendations wind up being actually backed by the current state of the research, whether a citation is given in the visit or not.
It’s more like this, there is an algorithm / flow chart. You follow it down and do the next step in the process.
You learn the reason why and then you just follow whatever the latest update to the algorithm is. Doctors are then graded by administrators as to how well they follow that flow chart. Have you done this or that screening, is BP well controlled, etc. Medicare sets many of the standards, you look up the USPTF guidelines for some more idea of what your doctor should be doing. So yeah. Kinda like factory work. “Pt male 70 wakes up urinate - does he have signs of infection > yes/ no> proceed to no > next step do this test > prescribe this med > re evaluate in 2-4 weeks. “
Why would we use proven processes involving experts with over a decade of education like this when we could use an LLM thrown together by a software developer who may not have even taken a biology course?
Most doctors will be able to turn around, pick the corresponding textbook from their library and show you where they learned something. Or point you to an actual clinical case they had to handle. If they had to, somehow.
Doctors also go to school to be doctors for 12 extra years. It turns out that medicine is hard. Not really a good reason to turn to an LLM that will just confidently make things up.
Martin Makary’s study and the previous IOM one are based on faulty statistics used. The number is extrapolated from a small population to a larger one.
I haven’t paid it any attention because of this problem. GIGO.
“ The researchers caution that most of medical errors aren’t due to inherently bad doctors, and that reporting these errors shouldn’t be addressed by punishment or legal action. Rather, they say, most errors represent systemic problems, including poorly coordinated care, fragmented insurance networks, the absence or underuse of safety nets, and other protocols, in addition to unwarranted variation in physician practice patterns that lack accountability.”
An LLM is not going to address any of that.
You are misinformed implying a significant majority of medical system errors are due to misdiagnosis.
Replace LLMs with smart phones or desktop calculators it doesn’t really change that statement except the tense and though not false it isn’t a particularly profound observation. I am skeptical of how much lifting “impact” is doing here.
This has been my major concern with the currently available LLMs.
You can know what the input is, you can know the output, you may even be aware what it's been trained on, but none of the output is ever cited. Unless you are already familiar with the topic, you cannot confidently distinguish between fact and what sounds reasonable and is accepted as fact.
ChatGPT4 correctly diagnosed my neurological condition, an infection that many doctors had missed. While I was in the hospital I asked ChatGPT the same questions as the doctors and it was nearly identical to what they were telling me every time.
It also acted as a therapist and talked me down from several depressions while in the hospital, far better than any human therapist I’ve ever had. The fact that it’s an AI made me actually feel better than if the therapy was delivered by a real therapist, for some strange reason.
Meanwhile my doctors are incompetent at diagnosing most of my symptoms, so I'll take ChatGPT over nothing until the medical system can gets its shit together.
The safest way now to use LLMs is for simple entity extraction of signs, symptoms, investigation summaries, then translate them into the inputs for an understandable linear/logistic model, e.g. Wells criteria, Canadian CT head, or Centor score. I feel that a comprehensive but explainable model that supports multiple diagnoses will be developed in the future, but no such model currently exists.
If anyone is interested - my startup which did exactly this was acquired 8 months ago. As many mentioned - the sauce is in the RAG implementation and the curation of the base documents. As far as I can tell, the software is now covering ~11m lives or so and going strong - the company already got the acquisition price back and some more. I was even asked to come support an initiative to move from RAG to long context + multi-agents.
I know it works very well. There are lots of literature from the medical community where they don't consult any actual AI engineers. There are also lots of literature on the tech community where no clinicians are to be seen. Take both of them with a massive grain of salt.
Main thing is to curate a good set of documents to start with. Garbage in (like Bing google results like this study did) --> Garbage out.
From the technical side, the largest mistake people do is abstracting the process with langchain and the like - and not hyper-optimizing every step with trial and error.
> When the next token is a URL, and the URL does not match the preceding anchor text.
> Additional layers of these 'LLMs' could read the responses and determine whether their premises are valid and their logic is sound as necessary to support the presented conclusion(s), and then just suggest a different citation URL for the preceding text
I am attempting to come to grips with this problem actively. What are people's thoughts about using an LLM as a tool for linking out information?
Traditional search is not proving enough in connecting patients and providers with the absolute wealth of information on grants, best practices, etc. There is simply too much content in too many places. I dream of something like "Cancer Bot 9000" that would be able to connect to resources pulled from a RAG, not necessarily answer the questions directly but interpret the questions and connect the person with the most likely resources. Bonus points for additional languages or accessibility which I constantly see as a barrier.
For every 1 medical case solved by web search/chatgpt/ai. There will be hundreds of false-positives.
I remember shadowing in the ER in the early 10s and many people would come in with Google search trash. One kid even cited something from the psychologist handbook. Lol
Every medical issue I've ever had I managed to correctly diagnose by myself using Google (haven't had many issues since ChatGPT) and later confirmed by a doctor. I attribute this to my years of experience using Google to troubleshoot and debug software.
As a fun side project,
We fine tuned mistral 7B and used RAG with 35M radical papers.
Gives you actually really accurate answers: https://chat.scribemd.ai
Doing a similar project it has been quite an eye opener on the results.
At a first glance the answers look very good but we have noticed these things.
In a number of cases the documents returned were wrong, but the LMM was able to use the contents of documents to find the right answer within the model. In our case this is a fail, as it had to cite valid documents.
Not as frequently but also noticed is the sequence of returned chunks from the vector database had an impact on the answer. In one case the question had one word as a past in first question, present tense in the second question. Otherwise the same. This swapped the order of the chunks and gave an opposite answer.
There is no easy way to see when these pop up. The current testing frameworks are limited in validating these.
I don't want to single you out, but what should I be taking away from the often recited "well people sometimes* can't do thing X either"-counterargument? Is all of this just fine? Can't we expect just a little bit more, I don't know, accuracy or rigor from a computer than a living person?
We've already passed the point where LLM's are better than human experts for medical diagnoses. In fact, according to this study, even LLM's alone are more accurate than human experts + LLM's, meaning any input the humans added was only a detriment to the accuracy
Computers are already perfectly accurate, and have been for decades in explicit quantifiable fields. In medicine, since a computer cannot perfectly replicate every single cell in the human body, its abstractions will be lower resolution than reality, but what matters is whether that low resolution abstraction is better than the alternative (human doctors).
A human doctor couldn't bring up a list of citations in literature instantly regarding a diagnosis. A LLM can.
Even if that paper hadn't said >>>"We are therefore very cautious to extrapolate our findings toward any implications about the LLM’s utility as
a standalone diagnostic tool"
Your post would be a extraordinary claim and need extraordinary evidence, not a specific study of a specific scenario.
Lots of data is pointing to the same conclusion: GPT-4 is at least as good if not better than human experts in medical diagnosis, at least in the areas studied. Thus the probability of a correct diagnosis is higher, thus safer, with GPT-4 than with any individual human expert.
This is so silly, the one study you linked to says that GPT 4 may have been trained on the answers to the test they gave it. So smart.
And since GPT 4 can't examine a patient's body the claim it's better at diagnosis that a human doctor seems like such a wacky thing to search the internet for "studies" to prove in the first place.
A nurse can examine a patient's body. Medical tools can and report their diagnostics with high precision. GPT-4 is multi-modal.
I feel you are nitpicking because you don't like the idea of an LLM being better than a human expert. Even if they weren't better than doctors nowadays, the chance they won't be in 1-2 years is tiny.
What I'm doing isn't nitpicking. I don't know the point of linking to studies is when you draw conclusions that have nothing to do with the study.
I just watched a video saying people are confused about what these models can do because
1) tech companies don't tend to say what they can do and leave users to figure it out.
And
2) Tech enthusiasts tend to exaggerate what they can do.
In your case I'm sure ChatGPT itself will tell you your comments are wrong- but for tech enthusiasts like yourself the AI is only wrong when it tells you it isn't all knowing, apparently.
It's like the bit in Monty Python's Life of Brian where the protagonist says he's not the messiah and a woman shouts "Only the true messiah would deny his divinity!"
> A human doctor couldn't bring up a list of citations in literature instantly regarding a diagnosis. A LLM can.
TFA is, literally, about LLMs spouting out erroneous medical references. I don't care about made up medical references or court cases.
I'm sure there are ways to bring up instantly a list of publication regarding a diagnosis (which a LLM may or may in the future correctly give: the diagnosis I mean) but I'm really not sure a LLM is what's needed to do then generate the list of related publication. I mean, FFS, they are compressed, lossy, knowledge.
LLMs are going to become tools as part of a toolchain. They're not a panacea.
They could, but they don't, at least that is what I'm getting from the article.
Even if they do, someone with the capability and understanding required (ie. not me) needs to bring that source up and verify that the claims align with the citation; the authors decided to use GPT-4 for this: "We adapted GPT-4 to verify whether sources substantiate statements and found the approach to be surprisingly reliable." I'm not happy with that either.
Looking on the bright side: will GenAI be what convinces the general public to finally get into our habit of checking if references actually support claims?
> even in the GPT-4 RAG model, we find that up to 30% of statements made are not supported by any sources provided
I wonder how much of that is due to specifics of the paper's authors' RAG implementation -- and how much is due to the guard rails put into GPT by OpenAI (not unlikely, given how medical advice would be a hot button topic for lawyers).
> This problem goes away with the retrieval augmented generation (RAG) model, which first performs a web search for relevant sources before producing a summary of its findings. However, even in the GPT-4 RAG model, we find that up to 30% of statements made are not supported by any sources provided, with nearly half of responses containing at least one unsupported statement.
Could you do something like this. Chunk/separate the individual claims made by the LLM, and the associated sources from RAG. Then feed them into a second LLM one by one, asking "is this claim $X_i reflective of the source $Y_i?". Whatever this second LLM says, feedback the response to the original LLM and ask it to revise what it's saying if the answer is "No". Iterate until the second LLM says "Yes" to every separate claim. Not perfect but might help.
Google Gemini actually has a feature like this, though it's hard to spot. There's a colorful G icon below each Gemini response - if you click it, a second process runs which attempts to "fact check" the claims from the original prompt response - it highlights them in different colors and adds citation links to them.
LLMs are better at this and it’ll probably marginally improve the output quality but there can potentially be hallucinations (false positives or negatives) even in this evaluation task.
In similar vein: I've asked ChatGPT (3.5) electronics questions. It happily generates non-existent links to schematics images, and then refers to the schematics in its answer.
Some of my former co-workers founded a start-up a few years ago (before the ChatGPT age) focussing on AI tools for the medical sector. By far their biggest challenge wasn't the tech but the hoops they had to jump through to get even their smallest tools certified so that they could be used by doctors at all. Unbelievable amounts of red tape and time investment on the side of that startup.
I find it disgusting to hear that ChatGPT and the likes are now regularly used by medical professionals where, obviously, these LLM tools have never gone through the same process. For two reasons - one because of the obvious medical risks involved (hallucinating LLMs) but also because of the gross commercial advantage the companies behind these LLMs get without even as much as trying to get certified.
I've lost touch with my former co-workers but I would be surprised if they hadn't looked into LLMs to be included in their products. My guess is that they're still somewhere in the administrative hell of trying to get their latest tools certified while meanwhile, doctors are happily using ChatGPT and GenAI. Must be frustrating.
It quickly became apparent that LLM are "expert assistants" (as in, assistants for experts).
I use them at work to take shortcuts - eg, I won't look up the syntax for some function call, I'll ask GPT instead. I'm able to evaluate whether it's correct or not (and often some small thing will be off) but it's still a quicker way to get a good result than manually synthesizing it from two or three different references.
I imagine it's similar with doctors -- an expert can get a quick 80% answer, sound it against their judgement, and refine.
If doctors consult chatgpt (unregulated, uncertified) that's on them, not the platform. I hope your old team keep working on high quality stuff but I can understand if they walk away.
I have seen some chatbot type products aimed at the admin side of healthcare, with the salesmen telling me about the 'guardrails'. Given some guy recently got the bot at his utility supplier to start criticising the company and using foul language then I do not think even this less critical work is ready for general use.
Well, I think it's worse than you may think: the startup is not even allowed to market their products as medical tools unless they're certified. ChatGPT is used today by doctors without being certificied. How is that fair?
I don't think it is fair and I do know the difference between a regulated product and an ad hoc (mis)use of ChatGPT. I just do not think you can blame OpenAI for doctors (or other professionals) ignoring the warning they are presented with.
On the other hand I do think professional bodies should be inclined to strike off members who do this kind of thing.
Doctors use Google, Google was marketed as a search engine and not a medical tool. Seems like if ChatGPT is advertised as a language model and not a medical model, it's completely fair. Irresponsible for doctors to rely on, maybe, but I don't see why a non-medical tool needs to be medically certified just because doctors choose to use it
Then why do we need certification at all? Everyone just brand your tool as non-medical, not matter whether it is intended for medical use or not, and Bob's your uncle!
Because doctors are liable for improperly using non-medical means to diagnose and cure illnesses, even if they might theoretically help. If I develop a new design of pliers for my skateboard that end up working really well for performing throat surgery, you're free to go and make a similar product and get that certified. I don't see why mine needs to be certified. I didn't make it for medical use. GPT explicitly tells you not to use it for diagnosis purposes, how many doctors need to use something before it's forced to get certified despite its intended use?
ChatGPT4 has overreacted to this issue IMHO. When I try to do even slightly esoteric medical research, ChatGPT seems pretty intent on only referencing Cleveland and Mayo Clinic, the most mainstream of orthodox mainstream medical sources. Try to get it to reference even peer-reviewed medical journals requires a frustrating amount of cajoling - it seems extremely reluctant to deviate from anything that isn't 100% mainstream medical orthodoxy.
This is not a good solution in the long run - ChatGPT will just reinforce existing dogmas and orthodoxies, even the ones that are (inevitably) wrong. Imagine if this approach to medical science was widespread at an earlier point in our history - we'd all probably believe that peptic ulcers are caused by 'stress' (rather than, primarily, the bacteria Helicobacter pylori). Go back even further and we'd still be lobotomising gay men to 'change their sexual orientation'. Rigidly enforcing current orthodoxies, under the premise that we're right about everything unlike those idiots in the past, will kill progress and society will stagnate.
If I wanted a blindingly arrogant tech mega-corporation to decide what 'experts' I'm allowed to get information from, I'd just use Google instead. If, as many seem to believe here, OpenAI are just worried about being sued, then why don't they just create an individual 'safe GPT output' setting (like Google 'safe search') which I can disable after acknowledging disclaimers that it's dangerous to think for myself and question mainstream positions.
I've grown to hate authoritarian Silicon Valley twats who arrogantly impose their politics, ignorance and, frankly, bizarre norms on the rest of us. It's highly ironic that these 'I love science!' types don't appear to understand that the scientific process involves making empirical observations, forming hypotheses consistent with those observations, and then continuously testing those hypotheses to determine which one is most robust to observed reality. They instead seem to think science is some kind of religion where you treat the views of the mainstream authorities as divine truth revealed by God and only heretics dare to question. By discouraging the formation of alternative hypotheses and rigorous questioning they are actually inhibiting scientific progress and making a mockery of the scientific method.
I look forward to the day we have a model that just synthesises available information and let's us decide for ourselves what to make of it. I think people will switch to such a model in droves and the likes of OpenAI and Google will go the way of all other social conformists who attempt to enforce the reality-denying orthodoxies of their day.
I can't know for sure obviously. But let's think about the plausibility of those three: lawsuits, bad PR, 'harm'.
On lawsuits, I would have thought a disclaimer & 'unsafe output' option would cover them. When you think about it, they're probably more exposed to legal liability by essentially taking on the responsibility of 'curating' (i.e. censoring) ChatGPT output rather than just putting a bunch of disclaimers around it, opt-ins etc. and then washing their hands of it.
On negative PR, again, they've actually set themselves up for guaranteed bad PR when something objectionable slips through their censorship net: "Well you censored X, but didn't censor Y. OpenAI is in favour of Y!" They've put themselves on the never-ending bad PR -> censorship treadmill presumably because that's where they want to be. Again, if they wanted to minimise their exposure they would just put up disclaimers and use the 'safe search' approach that Google uses to avoid hysterical news articles about how Google searches sometime return porn (to which they can now answer: "well why did you disable safe search if you didn't want to see porn?"). It would seem far safer (and result in a more valuable product) if the folks at OpenAI let individuals decide what level of censorship they want for themselves. But I presume they don't want to let individuals decide for themselves, because they know what's good for us better than we do, apparently.
Lastly, 'harm'. How do you define harm? Who gets to define it? Can true information be 'harmful'? I don't think OpenAI have any moral or legal duty to be my nanny, in the same way I don't think car manufacturers are culpable for my dangerous driving that gets me killed. All OpenAI provide to me, at the end of the day, are words on a computer screen. Those cannot be harmful in and of themselves. If people are particularly sensitive to certain words on a computer screen, then again we already have a solution for that - let them set their individual censorship level to maximum strength (or even make that the default). Again, OpenAI would have done their duty and provided a more valuable product that more people would want to use if they let individuals decide for themselves.
I can only infer that they don't want us to decide for ourselves. Rather, they want to enforce a certain view of the world on the rest of us, a view which just happens to coincide with the prevailing political and intellectual orthodoxies of Silicon Valley dwelling tech-corporation millennials. It's hilariously Orwellian when these people claim that they're just "trying to combat bias in AI" when what they are really doing is literally and deliberately injecting their own biases into said AI.
>If people are particularly sensitive to certain words on a computer screen, then again we already have a solution for that - let them set their individual censorship level to maximum strength (or even make that the default).
How do you know that's even possible? God knows how much computing resources got spent just to train the one currently deployed "variant"? Now I don't know if there is some cheap post processing trick that does it, but either way does not at all seem trivial.
And the problem isn't that "you" think you won't cause any harm. Even if that is assumed true, that's not a guarantee that everyone else is as disciplined about it. Which brings me to the biggest point, what even is "truth" in the first place. People strongly believe in total fabrications, or multiple groups say diametrically opposite reporting of some real event due to religion, nationalism, politics etc. Its a massive achievement they are even able to manage to output something that doesn't just "violently offend" people all over the world. Remember, retraining/fitting it to everyone seems to me not to be a trivial task if you think to reply to that point by saying the answer is to simply personalize it to each user.
They could use control vectors, one for each individual - https://news.ycombinator.com/item?id=39414532 . Or they could selectively apply the censorship model they already quite clearly have running on ChatGPT's output.
Yes, people sometimes believe false things. And people sometimes harm themselves or others when acting on this kind of information. So what's the solution? Put a single mega corporation in charge of censoring everything according to completely opaque criteria? People get nervous when even democratically elected governments start doing stuff like that, and at least they actually have some say in that process.
Frankly, I'd prefer the harm that would follow from unfettered communication of information and ideas over totalitarian control by an unaccountable corporation.
I’m going to make a career out of tricking chatbots into giving wrong advice and suing the companies using them.
Like what happened with the Canadian airline but on an operationalized scale. Imagine a small call center dedicated to destroying companies who get rid of their human customer support.
Show us the source code and data. The way the RAG system is implemented is responsible for that score.
Building a RAG system that provides good citations on top of GPT-4 is difficult (and I would say not a fully solved problem at this point) but those implementation details still really matter for this kind of study.
UPDATE: I found it in the paper: https://arxiv.org/html/2402.02008v1#S3 - "GPT-4 (RAG) refers to GPT-4’s web browsing capability powered by Bing."
So that "30% of individual statements are unsupported" number was actually a test of how well ChatGPT's GPT-4 browsing mode with Bing could provide citations when answering medical questions.