Hacker News new | past | comments | ask | show | jobs | submit login
AI models that predict disease are not as accurate as reports might suggest (scientificamerican.com)
231 points by arkj on Oct 21, 2022 | hide | past | favorite | 155 comments



I worked in healthcare ML solutions, as part of my PhD & also as consultant to a telemedicine company.

My experience in dealing with data (we had sufficient, and somewhat well labeled) & methods made me realize that a lot of the prediction human doctors make are multimodal - and that is something deep learning will struggle for the time being. For example, say in detection of a disease X, physicians factor in blood work, family history, imaging, racial genealogy, general symptoms (like hoarseness, gait, sweating etc), even texture & palpitations of affected regions sometimes before narrowing down on a set of assessments & making diagnostic decisions.

If we just add in more dimensions of data to model, it just makes the search space sparser, not easier. Throwing in more data will likely just fit more common patterns & classes well, whereas a large number of symptoms may be treated as outliers and mispredicted.

We humans are incredibly good at elimination of factors & differential diagnosis. The findings don't surprise me. There is much more work needing to be covered. For straightforward, and conditions with limited, clear cut symptoms they are showing promising advancements, but it cannot be trusted to wide arrays of diagnosis - especially when models don't know what 'they do not know'.


are you really sure the doctors are doing a better job when they go through the motions of incorporating a wide range of data? Or do we just convince ourselves they're better?

I suspect we massively underestimate the amount of misdiagnosis due to incorrect analysis of data using fairly naive medical mental models of disease.


> Are you really sure the doctors are doing a better job when they go through the motions of incorporating a wide range of data? Or do we just convince ourselves they're better?

Personal story: I was diagnosed with a rare genetic disease in 2019. If I ran the symptoms through a ML gauntlet, I would be sure they would cancel each other out or make little sense. Chest CT (clean), fever (high), TB test (negative), latent TB marker (positive), vision difficulty (Nothing unusual yet), edema in eye socket (yes), WBC count (normal), tumors (none), hormones (normal) & retina images (severely abnormal)

My condition was zeroed in within 5 minutes of a visitation to a top retina specialist, after regular opthalmologists were in a fix about two conflicting conditions. This was differential diagnosis based even though genetic assay hadn't returned yet, which also later came in favor. I cannot overemphasize enough how good human brain is in recalling information & connecting the sparse dots to logical conclusions

(I am one of 0.003% unlucky ones among all opthalmological cases & the only active patient with that affliction in one of the busiest hospitals in the country. My data is part of the 36 people in a NIH study & opthalmo residents are routinely called in to see me as case study when I go for follow up quarterly).


How many specialists did you go to before it was identified?

How many other people with the condition were misidentified?

I only say this because of a family member with a rare genetic condition. For years they were told it was something else, or told 'it was in their head'. The family member started a journal of their medical conditions and experiences that was detailed then brought that to their PC which whom then sent them to a specialist, this specialist wasn't sure and sent them to another specialist that had a 3 month wait. After 5+ years of living with increasing severity of the condition it was identified.

So, just saying, it's as much likely that the condition was identified because you kept a detailed list (on paper or in your mind) of the aliments and presented them in a manner that helped with the final diagnosis.


> How many specialists did you go to before it was identified?

2 opthalmo, 1 internal medicine, 1 retina super-specialist & finally someone from USC Davey

> How many other people with the condition were misidentified?

Historical data: I don't know. It is fairly divided between two types, one being zoonotic & other to IL2 gene. I am told this distinction of pathways was identified in 2007.

> [..] you kept a detailed list (on paper or in your mind) of the aliments and presented them in a manner that helped with the final diagnosis.

I might have been a better informed patient but I went with a complaint of pink eye, flu & mild light sensitivity. Never imagined that visit would change my life forever. Thank you though, for expressing your concern & support


This escalation is going through a range of increasingly more specialized experts is a critical part if a difgicult diagnosis. The process endures different unique perspectives.


That all sounds shitty but I don't see how that's valuable information. They did eventually solve the problem and there's no comparison to some ML success story.

Human minds can be really good at diagnostics and still fail sometimes when faced with very difficult cases.

In my experience, ML would just classify everything as a very common disease and people would call it a success because it has an 80% effectiveness rate.

The problem that needs to be solved is a case like your example, not diagnosing the common cold.


>They did eventually solve the problem

One of the points taken should be that either via ML or human diagnostic is that these rare problems are either not diagnosed for long periods of time reducing quality of life or diagnosed posthumously.

The reduction of these measures are what we should use when making meat vs machine efficiency correlations.


Maybe I'm reading it wrong but doesn't your story confirm that doctors aren't good at diagnoses? It took the cream of the crop top tier specialist to correctly diagnose your condition.


What is the name of this disease?


Modern medicine already incorporates wide ranges of data. Doctors use flowcharts, scales, point systems, etc, to diagnose certain conditions because those tools have been developed by studying and considering a lot of cases.

However, there's a lot that isn't covered with data. The "middle of the scale", the "almost but not quite there", the "this is weird"... Doctors are good at that, through experience, and those are the difficult cases. Those are the ones where ML will not only likely fail, but won't even explain why it fails. We're talking about human lives here. If anything, I think software engineers massively overestimate the performance of ML and underestimate doctors.


Yes, notwithstanding those factors you described, it is not uncommon for tests to reveal false-negative or false-positive results due to their intrinsic specificity and sensitivity. A normal value is not always indicative of health, either.


My view on this is framed a bit differently but probably a similar ultimate perspective:

I think it's probably going to be a long time before models only using quantifiable measurements can even meet the performance of top doctors. I can't recommend enough that someone experiencing issues doctor-shop if they haven't gotten a well-explained diagnosis from their current doctor.

But I'm very curious how good one has to be in order to be better than a below-average doctor, or a 50th-percentile doctor, or a 75th...

But I also think there may be weird failure modes similar to today's not-fully-self-driving cars along the lines of "if even the 75th-percentile-doctor uses the tool and sees an output that stops them from asking a question they otherwise might have, can it hurt things too?"


> But I'm very curious how good one has to be in order to be better than a below-average doctor, or a 50th-percentile doctor, or a 75th.

In dermatology, on which I was working, models were better (at detecting skin cancers) than 52% of the GPs, going by just images. In a famous Nature paper by Esteva et al., the TPR was at 74% for detecting Melanomas. There is a catch which probably got underreported (The skin cancer positivity rate was strongly correlated to clinical markings in photos. Their models didn't do quite as well when 'clean', holdout data were used).

But the nature of information in all these models were skin deep (pun intended). They were designed with a calibrated objective in place unlike how we approach clinical diagnostics as open ended problems for the doctors.


Isn't it a tad unfair to compare a ML model for dermatology, only working with pictures, against general practitioners? IMHO comparing said model against dermatologists would be a better approach. And just working from images is not necessarily a dermatological model, buy rather an image analysis model.


> We test its performance against 21 board-certified dermatologists

From Andre Esteva's paper:

https://www.nature.com/articles/nature21056


Interesting, could you explain more about the clinical markings? Was this mentioned in the paper itself or was it later commentary?


I remember a similar New Yorker article ~5 years ago about medical imaging ML/AI where they realized it's good hit rate was actually a data artifact from training. Something along the lines that essentially all the positives had secondary scans and so there was a set of known positives which had say run through the same machine/lab and had similar label color & text markings in the margin of the imaging.

When they went back and tested with clean images that didn't basically have the "im a positive cuz I have this label over here in the margins", the hit rate dropped below that of humans.

It was an article with anecdotes about some of the hospice cats that seemingly are able to detect when a patient is about to die. Entirely possible as they have a sense of smell and patient tumors likely giving out detectable odors.

Nonetheless, the ML model & the cat were similarly inscrutable.


Nice, a model identifying the cases a doctor had enough doubts about to have a second test run.


Statistical mortality models are already more accurate than physicians for an average patient.


Not this ignorant comment again. AI will replace software engineers long before it replaces doctors. There is an arrogant ignorance of what doctors do that always shows up in comments when topics like this pop up.

And yes I'm a physician and MLE. So i understand both worlds clearly


Well look at that! If it isn’t another member of the medical mafia on HN.

It is kind of funny to see comments complain about the lack of perfect sensitivity and specificity of their physicians.

We complain about the same thing from the the various ML techniques in radiology which currently are pitiful and a gigantic waste of time and money. When I went into rads I was pretty worried about ML - not anymore.

I’m hoping this upcoming recession will dry out a lot of institutional use of ML. In radiology it’s not that helpful and there’s no technical fee increase for it. But you can advertise with it i guess? Commercials with lasers, robots, and AI with pleasant voice overs about cutting edge techniques and getting the care you need in the 21st century and blah blah blah


But we all saw House, Grey's Anatomy and Emergency Room. So we know exactly what doctors do!


> I suspect we massively underestimate the amount of misdiagnosis due to incorrect analysis of data using fairly naive medical mental models of disease.

I suspect software engineers massively underestimate the value of skills outside their domain.


The good ones don't, they also realize the value of working with experts from the field they develop software for and with. And ues, that includes business people. The bad ones they can replace decades worth of experience, training and eductaion in certain field with deducting the necessary insight with first principle thinking.

The same applies not just for software devs, but for every other domain as well.


I wouldn't call it massively underestimate -- if I recall the research of Meehl et al correctly, clinicians, on most types of cases considered independently, underperform simple arithmetical models of 2--5 variables by something on the order of 10 %. So not a huge effect, but also humans aren't as good as they think. (They do get lucky though! Sometimes some people get very lucky and accidentally get a long string of cases right.)


I’m confused by your comment because these are exactly the type of problems that humans generally really do a poor job classifying.


Most modern ML techniques do a poor job on these types of problems too unless they have a lot of data (hence the reference to sparsity) or assume structure that requires domain specific modeling to capture.


It could be that after we train a biological nural net for decades it can get pretty good at intuiting things even if it can't explain how.

The numeral net in question is the Drs. Brain.


> We humans are incredibly good at elimination of factors & differential diagnosis.

I don't automatically buy this.

Didn't heart attack care in the ER get dramatically better when people started following checklists? That suggests that human doctors aren't that great at even getting the basics correct.

In addition, most doctors are below average. So, maybe the best doctors are better than the AI. However, I may not have access to that doctor and the AI may be better than all the doctors I have access to.


Using check lists means stabdardization, and that makes results compareable and reduces the risk of forgetting something under stress. Check lists have nothing to do with ML or AI so.


It wasn't just "forgetting". Every doctor had their own take on diagnosis and the checklist was actually better than a lot of them since the checklist was constructed from data.


It's interesting you say this. I read a book several years ago, I don't remember what it was. But it talked about how there used to be a lot of questions that were used to determine if someone was having a heart attack.

They then did a lot of stats and number crunching and determined that with just 2 questions they could accurately predict whether or not someone was having a heart attack 95% of the time which was a considerable improvement.

I wonder how many problems are we making worse by throwing more data at rather than making them better?


For ML to really make a dent in medicine, the whole system needs to be altered, in a similar vein to how building roads tailored to self driving cars would make them much more successful. Most medical diagnoses are only made when severe physiological derangement has already occurred. If we had access to longitudinal streams of data, then ML would be essential to detecting anomalies which point to early evidence of disease. For example, streaming in regular noninvasive measurements on breath and urine; wearable readouts on heart rate, oxygen saturation, blood pressure, temperature, movement; neurocognitive streams based on analysis of email, video calls, text messages etc. But this is a fundamental change on many levels, with many barriers. It will also probably fail to improve the health of people who need it the most. I’m not sure it’s even that appealing as an alternative to the current meatspace system.


> We humans are incredibly good at elimination of factors & differential diagnosis. The findings don't surprise me. There is much more work needing to be covered. For straightforward, and conditions with limited, clear cut symptoms they are showing promising advancements, but it cannot be trusted to wide arrays of diagnosis - especially when models don't know what 'they do not know'.

If it were presented this way, while accurate and honest, it would in no way get the media hype and thus funding from both state actors and private investors looking to get to be a part of the 'winner take all' model.

As a person studying AI and ML at the undergrad level, is there any advice you have in order to the pitfalls that this Industry has become?


Thanks for sharing. My belief is that we need to figure out a way to make humans interact with prediction models in a virtuous way. Prediction models suck at "connecting the dots" or considering multiple sources of information (for example: multiple models predicting different outcomes). Until we get true general artificial intelligence, I think the way to go forward is to try to quantify those unknowns through confidence intervals (conformal prediction seems to be quite nice for many models) plus some multiple hypothesis testing to handle the multiple outcomes / multiple models.

This needs to be then implemented on a real flow where humans and prediction models interact (for example: approve these things automatically, send these other test for humans to revise)


Personally knowing the hit rate of these ML models & their non-explanatory nature, weighed against their low cost.. I'd argue they should be used as a default automated second opinion to radiologist opinion.

Recently went through a pet cancer death so though medical imaging, diagnostic testing, specialist escalation and second opinion workflows are pretty fresh in my mind. There is a shortage of specialists, backlog for appointments and many astonishingly bad practitioners out there.


In my experience, multiple knee MRTs, the liver, the intestine, the ankle..., radiologists by default are the second opinion to the specialist, e.g. an oncologist, that sent you to get the scan in the first place. I never ever had a radiologist come up with a diagnosis by himself.


The system itself should be built around these capabilities, not the other way around. Instead of collecting data at regular intervals we wait until symptoms to go to the doctor. This is why the dataset is so sparse.


Exactly this. The features (or limitations) of medical data is inherent in the process of clinical practice, but this seems to be oftentimes overlooked.


I recently published a paper, where we explain how an FDA approved prediction model, build into a widely used cardiac monitor was developed with an incredibly biased method.

https://doi.org/10.1097/ALN.0000000000004320

Basically, the training and validation data was engineered so an important range for one of the predictor variables was only present in one of the outcomes, making perfect prediction possible for these cases.

I summarize the paper in this Twitter thread: https://twitter.com/JohsEnevoldsen/status/156164115389992960...


Sorry for asking, but how is this relevant to the article?


Fair question. The model we comment on both suffer from the problem described in the article but also a more severe problem:

The developers sampled obvious cases og hypotension and nonhypotension, and trained the model to distinguish those. And also validated it on data that was similarly dichotomous. In reality the outcome is often between these two scenarios.

But worse, they also introduce a more severe problem where as range of an important predictor is only available in the hypotension outcome.


I quit research forever after I was ignored pointing out a similar problem in our predictive model.


I can only imagine the frustration. Just getting this through peer-review took half a year, but at least there was the academic currency of a publication to motivate me.


Thanks for explaining!


Sorry for asking, but how is it not?


Do you agree that it’s ok to pose a question whenever you don’t understand?


I’m not sure where you got this form of communication where you respond to everything with a question, and I assume you mean well, but it comes across as patronizing and de-humanizing to try to follow these “rules to winning arguments passively”, or whatever it is.

Indeed, the confusion here is (I think) because your first comment

> Sorry for asking, but how is this relevant to the article?

Sounds accusatory.

Please don’t respond to this with a question.


The basic idea of that kind of question is to find the minimal place of agreement. And then understand where one deviates.

Going back the path of arguments to common ground if you will. It works quite well in my experience if you’re interested in genuine discussion.

PS: how something “sounds” is really difficult to say in a written medium. It might say more about the reader than the writer.


> PS: how something “sounds” is really difficult to say in a written medium. It might say more about the reader than the writer.

No, it’s not difficult. And not it’s not the reader. When multiple readers all agree about the same interpretation of the writer. It might have been unintentional on the part of the writer, but that doesn’t make it “difficult” Or the readers fault.


I think what youre trying to encourage is open ended discussion? It's my opinion that this only tends to work IRL or in online mediums with more moderation e.g. wikipedia, stackoverflow.

Random open ended discussion can be good, but I bet it's wise to assume tht most random musings arent really as interesting as you might think.

In any case thanks for clarifying.


Ironically, that's exactly what NovemberWhiskey is doing here :)


This has to be intentional no?


The problem is quite subtle, though obvious in retrospect. I've seen a paper from a separate, academic, research group make similar model with the exact same problem.

The problem would, however, have been clear, if the model was compared to simply using the current mean blood pressure (MAP) as a predictor of hypotension, because MAP is the problematic predictor variable. Instead, the model was only compared to short-term changes in MAP (ΔMAP), which is obviously nonsensical and has an AUROC of ~0.55.


Hm, reading the linked tweets the problem seems like a big screaming red target on the side of a white barn, not a feature engineering subtlety. It seems like the typical case of the drunk guy looking for his keys under the streetlight. (Having insufficient data, and comparing the model to an arbitrarily picked one that just happens to be even worse. And then everyone including the FDA patting them on the back.)


I'm glad that you seem to get the severity! I'm just hesitant to ascribe malice.


I think it is the general incompetence of the "academia + R&D biz + regulation pipeline". (In the land of the blind the one-eyed is king, etc.)

It's sort of inevitable in such a non-teleological process. As in each step in it serves its own purpose, and so the whole thing doesn't really serve the purpose that we like to assume for it - ie. give us great thoughtful inventions. That's why it took so long to stop the Theranos train, that's why it takes so fucking long to roll out polyvalent vaccines (ie. all-in-one vaccines), and so on. (I'm picking on medtech here but there are many others, the Boeing + FAA MCAS fuckup, the absolute limpdick paralysis of nuclear power - it needed a combination of half the world on fire + prelude-to-WWIII to get it moving again, and so on.)


One thing I've learned, is people are definitely morons, so you're probably making the correct decision here.


> I decline to answer that question on the grounds that it might be used to incriminate me


As someone who works in healthcare, so much of what I read about AI makes me think that the people who are enthusiastic about healthcare AI don't have much experience doing it.

The scenarios rarely seem to fit with what I'm actually practicing. Most of medicine is boring, it is largely routine, and if we don't know what's going on, it's because we're not the right person to be managing the patient. Most of my time is spent talking to people - patients, colleagues, family. I explain the diagnosis, I talk about the plan, I am getting ideas of what the patient wants and values, and then actioning it. I spend very little of my time like Dr House pondering what the next most important test to perform is for a patient who is confounding us.


I work in radiology with MRI as a tech. We use AI slightly differently to the examples here, but it’s changing a lot of what we do. It’s more about enhancing images than directly about diagnosing.

The image is denoised ‘intelligently’ in k-space and then the resolution is doubled via another AI process in the image domain (or maybe the resolution is quadrupled, as it depends on how you measure it. Our pixel count doubles in x and y dimensions).

These are 2 distinct processes which we can turn on or off and have some parameters which with we can alter the process.

The result is amazing and image quality has gone up a lot.

We haven’t got a full grasp yet and have a few theories. The vendors are also still getting to grips.

We think the training data set turns out to have some weird influences on requires acquisition parameters. For example, parallel imaging factor 4 works well, 3 and 2 less so, which is not intuitive. More acceleration being better for image quality is not how MRI used to work (except in a few edge cases).

Bandwidth, averages, square pixel, turbo factor and appropriate TE matter a bit more than they did pre-AI.

Images are now acquired faster, look better and sequence selection can be better tailored to the patient as we have less of a time pressure.

I’d put our images up against almost anything I’ve seen before as examples of good work. We are seeing anatomy and pathology that we didn’t previously appreciate. Sceptics ask if the things we see are really there, but after some time with the images the concern goes away and the pre-AI images just look broken.

In the below link, ignore Gain (it isn’t that great), Boost and Sharp are the vendor names for the good stuff. The brochure undersells it.

https://www.siemens-healthineers.com/magnetic-resonance-imag...


My partner had a clinician review her paperwork and say "why are you here" explaining the enhanced imaging was leading to tentative concerns being raised about structural change so small it was below the threshold for safe surgical treatment.

Moral of the story: the imaging has got so good that diagnostics is now on the fringe of over diagnosing and the stats need to catch up


This has been a thing for a long time, with MRI in particular.

It gets quite philosophical. To diagnose something you need some pattern on the images. As resolution and tissue contrast improves you see more things, and the radiologist gets to decide if the appearance is something.

When a clinician says there is a problem in some area of anatomy and there is something on the scan, the radiologist has to make a call.

The great thing about being a tech is that making the call isn’t my job. I have noticed that keeping the field of view small tends to make me more friends.

A half imaged liver haemangioma, a thyroid nodule or a white matter brain lesion as an incidental finding are a daily occurrence at least.


I think it cuts both ways though, over-testing vs under-testing, as the question is when do people actually get access to imaging, and is there more pro-active imaging screening we should have done.

A good friend recently had an unrelated routine surgical procedure go awry, that lead to a CT scan to check on the damage. The CT scan ended up finding stage 2 cancer, larger than a billiard ball, in an organ that is going to be surgically removed. Our friend had absolutely no symptoms of any kind related to the cancer. There is no reason he would have gotten a CT scan other than the unrelated surgical accident. Imagine in 5 years he finally had had some symptoms, they do the scan & and find its stage 4, sorry.

The fact that we only have routine screening regiments for a handful of cancers (breast, colon, prostate, skin) is something that I've been thinking about a lot lately.


Yes. And the balance between under testing and some loss of early treatable detection and over testing with some incurring unneeded operations is a hard one. Everyone tends to the over test side. For some things like knee operations the evidence appears strong that surgery is the worst path to take in most cases. Surgery stems from improving imaging of knee joint tissue. Treatment regimes need to catch up to return to a sweet spot of detection and remediation.

My partner feels her breast tissue calcification detected in improving imaging of annual checks should have been left alone and incurred discomfort and scarring she didn't need, but we both know breast cancer survivors who owe their life to detection and intervention


One of the first things AI will be really good at will be image post processing. Even then I, in case of medical diagnosis, I'd prefer to have an actual person compare the "RAW" image to whatever the AI came up with. Simply because post processing can create artifacts that can throw you of quite a bit.

Regarding tue quality of imaging: I tend to agree, and the better imaging gets the more we will have to relly on humans to judge whether or not treatment is necessary or recommended. That judgement alone is, IMHO, in the same league as full self driving and requires general AI.


The raw image is not good and isn’t usable. AI denoises and this is what makes it usable. Then it doubles the resolution.

There is no point in reviewing the raw image as it doesn’t add anything. A study is generally 300-1000 images. If you’re going to review the raw and therefore look at 600-2000 images you’ve just wasted everything AI gained and you might as well not use it.

I acquire the images and I look at what I get and re-run anything that’s got artifact. This is not unusual and generally happens due to movement, excessive image noise or incorrect parameter selection.

I don’t review the raw, but I do review the output. Keep in mind that MR images have always been very heavily processed at every stage of image formation as any way of squeezing more out the the acquisition will have time benefits.

There are a ton of ways a tech can create a misleading or faulty image with parameter selection which leads to an incorrect diagnosis. This could happen quite easily and AI being part of any error m is not something that keeps me awake at night, unlike other other parameters with which I have seen mistakes and have made them myself.


Thanks for the explanation! Shows that some knowledge in digital photography doesn't translate into other domains taking pictures.

Comments like yours are what I love about HN!


I did my Masters in NMR. Can confirm a lot of ML based plug-and-play solutions are helping denoising k-space.

Trivia: I am also one of the pulse sequence developers affiliated to Siemens LiverLab package on Syngo platform :) [Specifically the multiecho Dixon fat-water sequence]. SNR improvement was a big headache for rapid Dixon echos.


Ha, small world. Thanks for your work, I used to use this daily until a year ago, now my usage is less frequent.

I guess Dixons are still a headache with their new k-space stuff as Boost (the denoising) isn’t compatible with it yet. Gain is but looks distinctly lame when you compare it Boost.

We are yet to see the tech applied to breath hold sequences (haste, vibe etc), Dixon, 3D, gradient sequences and probably others.

I’m looking forward to seeing it on haste and 3D T2s (space) in particular. MRI looks very different today compared to how it looked just 6 months ago.

I’d compare it to the change we saw going from 1.5T to 3T, just accelerated in how quickly progress is being made.


I have long since left collaboration with team at Cary, NC. But all I can say there was a great deal of interest in 3D sequence improvement by interpolation with known k-space patterns like in the GRASE or PROPELLR sequence for e.g. They also learned a good deal from working with NYU's fastMRI


I just went through this with my knee surgeon, he was raving about the quality now coming out of the 3T scanner, saying that after the latest software update it gives better quality than the experimental 7T scanner, to the point where they are considering abandoning the 7T scanner study they were doing prematurely.


AI denoising may be making information that actually is in the image easier to spot, but AI upscaling is just inventing detail that doesn't actually exist in the source image, which seems rather dangerous for this use case.


I disagree.

It’s ‘only’ converting each pixel into 4, so the starting point is not going to change a lot. Also keep in mind that interpolating has been part of image reconstructions for 20+ years. MR suffers from slow acquisition times so we cheat and image resolution is rarely symmetrical in pixel dimensions when you compare x, y and z directions.

Previously we just made pixels square (made up data) then doubled the pixel count (more made up data). It was that dumb.

I’ve tried acquiring an image and up scaling it 2x. Then acquiring the same image at double the resolution and not upscaling.

It’s hard to compare as the longer acquisition is often hampered by patient movement.

When I set up a scan with the AI I’d estimate it as adding about about 30% signal. I make up a scan that will look good. Then turn the AI on, then shorten the scan or increase the resolution such that it’s 30% ish down on signal, then I press go.

We are seeing things we didn’t previously, particularly with cartilage injuries.


> We are seeing things we didn’t previously, particularly with cartilage injuries.

I am curious about the process by which you ensure that the things you see are actually there and are not side effects of the enhancement?


The same way we do for everything. We scan in multiple planes and image weighting’s (t1, t2fs, etc). Ax, sag cor. We do other angles for various things too. Eg for knees we do dedicated views for the patella cartilage and ACL.


So much this. I just interviewed about 10 doctors in the space of neurology and radiology to start some new projects. The truth is most of the headaches are from insurance coverage check or for radiologist for filling out correct reports. The fancy AI stuff is with maybe a few exceptions due to the great advancement imaging still far away from validation and I didn’t even start about it’s usage and gotomarket.

Most of the cases the doctors sees are boring / regular cases - and problems like access to medical history is way more basic but more prevalent.


I'm a physician and former engineer and I agree with you. Even the giant tech companies that are in this space seem to have very little insight into how to utilize AI to improve healthcare.

The mistake everyone seems to make is thinking that making a diagnosis is what makes medicine hard. It's not. e.g. An algorithm that can diagnose diabetic retinopathy from fundus photos 'as well as a specialist' makes for a good headline, but isn't particularly useful. Medicine is nothing like "House" in real life. AI should be leveraged to either improve efficiency or develop novel tools.


I can't upvote this enough


> and problems like access to medical history is way more basic but more prevalent.

This here is a problem I wish we could solve, get a central data broker that could hold aggregate medical history and just dole it out to companies. My wife has to get Medical supplies on a regular basis, the company we got them from got bought out by a PEC so we wanted to try switching companies. Months later we still can't get all the Drs insurance and supplier lined up.

The problem is of course as a security guy I know a centralized clearinghouse would be a nightmare from every stand point. The real solution may be in the development of a good protocol, but good luck getting the Med industry that is still putzing around on ASM/400s to adopt something like that. Even with the force of law, i.e. HIPPA


That scenario sounds like it lends itself more to AI automation than a Dr. House type one.


I don't know, compassion and understanding and nuanced understanding of individual desires when talking to someone is not what I associate AI with in my mind, but being able to assess sociological and cultural taboos and try to what a patient actually wants rather then what they might initially express seems like something I good doctor would get to through explorative conversation.


Maybe removing a human from the equation would lead to more honest outcome? E.g. people google all sorts of issues more earnestly than they would describe it to the doctors. The bottleneck would be properly understanding what the user intends, which might be out of reach.


Indeed. Language has been historically difficult for AI, but I think it's even tougher here — language is less and less reliable the further we get from a shared experience, and this is a problem when describing our experiences of our own bodies, and much worse when describing our own minds.

For example, when I was coming off an SSRI, I was forewarned that I might get a sensation of "electric shocks"; the actual experience wasn't like that, though I could tell why they chose to describe it like that.

How different is the tightness in the chest during a heart attack from the tightness in the chest from exercising chest muscles?

I have no idea how doctors, GPs, and nurses manage this, though they seem to have relatively little trouble.


My experience of chatting with an internet chat-bot when trying to get some help with a product gives me little confidence we are close here.

Edit: wording


But doctors tend to be pretty good at that part of the job you're describing. It's the Dr House part that often seems lacking, difficult patients often spend years going to different doctors only to keep hearing useless suggestions. It'd be amazing if we could have a solution for those people and AI might be able to help


Sweeping generalizations about AI models as a class of objects seem deeply uninformative to me. Perhaps I would even go so far as to say misleading. These days any undergrad can go on GitHub and have a model that does some diagnosis and I don’t understand why we have to group that together with multi-year efforts from groups of PhD’s and MDs working together to produce products.

There’s obviously way more of the first than the second and so if you analyze this group as a whole it’s easy to draw the wrong conclusion about what AI as a technique is capable of.

I can’t give too many details on specific examples I’m working on at Google but an example I’m not working on would be Caption Health who have an amazing AI-based ultrasound guidance product that has great prospective evidence, and several big fans in the clinical community.

There are also several success stories using AI on pathology in order to target clinical trials.

Can you imagine if someone made sweeping statements about webpages as if they were a coherent group of objects that you could sample from and deduce properties? “The information on websites is typically not as accurate as those websites claim”


While the notion of treating these systems as “sociotechical” rather than purely technical is probably a good move wrt actually improving people’s lives, I can say from my own experience in academia that there are still way too many academics working in this field who don’t think it’s their problem. I’ve personally raised these types of issues before and been told “we’re computer scientists, not social scientists”, as if “social scientist” is a derogatory term. The biggest impediment here is, in my opinion, overcoming the bloated egos of the people who think the social impacts of their work are somehow out of scope. All is well as long as you can continue to publish.


Preach.

There are way too many people that conflate MSE or other abstract technical measurements of model performance like they actually represent the impact any model has on a problem. Even if we could somehow perfectly predict an actual realization instead of a conditional expectation that still forgets to ask the question of why we predicted that. Are we exploiting systemic biases, like historically racist policies? Almost definitely (unless we've consciously tried to adjust for them, and still we've probably done that incorrectly). I've become much less interested in models that basically just interpolate (very well I might add), and more in frameworks that attempt to answer why we see particular patterns.


Not that surprising. AI learning seems to do best with fairly predictable systems, and when it comes to individual outcomes in medicine, there's a lot of mystery involved. A group of people with similar genetic makeup and exposure history to carcinogens or pathogens won't all respond identically - some get persistent cancers, some get nasty infections, and some don't.

For example, training an AI on historical tidal data would likely lead to very good future tide timing and height predictions, without any explicit mechanistic model needed. Tides have high predictability and relatively low variability (things like unusual wind patterns accounting for most of that).

In contrast, there are some current efforts to forecast earthquakes by training an AI on historical seismograph data, but whether or not these will be of much use is similarly questionable.

https://sciencetrends.com/ai-algorithms-being-used-to-improv...


OK, AI is bad but compare it to human doctors/radiologists that are often worse. I still remember stats from some X-ray detection where AI diagnosed with 40% accuracy and the best human doctors with 38% accuracy (and median human doctors with 32% accuracy). Now what are we supposed to do?


Oh god (science for some of us) the same kind of logic for defending tesla’s fsd system. Both crappy and dangerous, but with cult like following.


Can you cite the source? Is it not possible to improve the 40% rate by AI? Obviously someone eventually figured out the 100%


They might have "figured it out" by cutting the patient open.


Almost any paper like this is worthless. It makes for good headlines. But it does nothing to further medical care


This is what freaks me out about AI.

People will use it for years in various fields, and one by one, after a decade or so of use, they'll come to find it was complete garbage information, and they were just putting their trust in a magic 8 ball.

But the damage is already done.


Same with self-driving cars. State of the art AI-based classification has an accuracy of 90%. Even if we can get it to 99%, that's still 1% error. Now imagine a car taking hundreds of decisions in a single ride.


This is entirely unsurprising and has a very simple solution: keep adding more data. Our measurements of the accuracy of AI systems are only as good as the test data, and if the test data is too small, then the reported accuracies won't reflect the true accuracies of the model applied to wild data.

Basically, we need an accurate measure of whether the test data set is statistically representative of wild data. In healthcare, this means that the individuals that make of the test dataset must be statistically representative of the actual population (and also have enough samples).

An easy solution here is that any research that doesn't pass a population statistics test must be up-front declared to be "not representative of real word usage" or something.


From the article:

> Here’s why: As researchers feed data into AI models, the models are expected to become more accurate, or at least not get worse. However, our work and the work of others has identified the opposite, where the reported accuracy in published models decreases with increasing data set size.


That's not a contradiction per se. It's easier to get spurriously high test scores with smaller datasets. It does not clearly demonstrate that the models are actually getting worse.


But if diagnosis are multimodal and rely upon large, multidimensional analysis of symptoms/bloodwork/past medical history, wouldn't adding more dimensions just increase dimensional sparsity and decrease the useful amount of conclusions you are able to draw from your variables?

It's been a long time since I remember learning about the curse of dimensionality but if you increase the amount of datapoints you collect by half you would have to quadruple the amount of samples you have to retrieve any meaningful benefit, no?


I did mean samples (n size) not the number of features. But also, no your point isn't right. If you have a ton of variables, you'll be better able to overfit your models to a training set (which is bad). However, that's not to say that a fairly basic toolkit can't help you avoid doing that even with a ton of variables. What really matter is the effect size of the the variables you're adding. That is, whether or not they can actually help you predict the answer, distinctly from the other variables you have.

Stupid example: imagine trying to predict the answer of a function that is just the sum of 1,000,000 random variables. Obviously having all 1,000,000 variables will be helpful here, and the model will learn to sum them up.

In the real world, a lot of your variables either don't matter or are basically saying the same thing as some of your other variables so you don't actually get a lot of value from trying to expand your feature set mindlessly.

> if you increase the amount of datapoints you collect by half you would have to quadruple the amount of samples you have to retrieve any meaningful benefit, no?

I think you might be thinking about standard error. Where you divide the standard deviation of your data by sqrt of the number of samples. So quadrupling your sample size will cut the error in half?


You are right, but I feel you misunderstood op.

I understood that op meant increase number of samples, not variables.


I think I did misunderstand you are right, definitely increasing the number of samples will increase the feasibility of the model I was incorrect


If there's one thing I learned with biomedical data modeling and machine learning, it's that "it's complicated". For biomedical scenarios, getting more data is often not simple at all. This is especially the case for rare diseases. For areas like drug discovery, getting a single new data point (for example, the effect of a drug candidate in human clinical settings) may require a huge expenditure of time and money. Biomedical results are often plagued with confounding variables, hidden and invisible, and simply adding in more data without detection and consideration of these bias sources can be disastrous. For example, measurements from lab #1 may show persistent errors not present in lab #2, and simply adding in more data blindly from lab #1 can make for worse models.

My conclusion is that you really need domain knowledge to know if you're fooling yourself with your great-looking modeling results. There's no simple statistical test to tell you if your data is acceptable or not.


I think this is a key point - the training set is very important, because biases, over-curation, or wrong contexts will mean the model may perform very poorly for particular scenarios or demographics.

I can't find the reference now of a radiology AI system which had a good diagnosis rate of finding a pneumothorax on a chest x ray (air in the lining of the lung). This can be quite a serious condition, but is easy to miss. Turns out that the training set had a lot of 'treated' pneumothorax. The outcome was correct - they did indeed have a pneumothorax, but they also had a chest drain in, which was helping the prediction.

Similar to asking what the demographic of training set is, is what the recorded outcome was. How was the diagnosis made. There is often no 'gold standard' of diagnosis, and some are made with varying degrees of confidence. Even a post-mortem can't find everything...


You don't know which data to add.

> statistically representative of wild data

It should be "statistically representative" wrt to the true causes, and all other factors should be independent. Instead, ML models, and certainly large NNs, allow every bit of data that correlates a tiny bit to contribute.

Since we don't know what the true causes are, nor how to represent them in the data, adding more data might just as well not work.


> This is entirely unsurprising and has a very simple solution: keep adding more data

Nope. Won't work. Biased data made bigger results only in bias confirmation. Which is the real problem.


The solution to failures of AI in heathcare is transparency of data. OpenAI's models work because they have virtually unlimited data to train on. The scale of training data for doctor bots is one millionth the size. Different countries, organizations, universities need to be as open as possible sharing and collaborating, realizing improvements in medicine benefits all of humanity with almost no downsides.


There should be a standardization committee tasked with standardizing the collection of anonymized, semi-synthetic medical data from hospitals/hospital networks. It seems like so much research is just locked up in the IMS systems the hospitals use for their patients and that never see the light of day.


I read about this researcher using GANs to create synthetic patient data, cool stuff. She had a problem she couldn't solve though: how do you validate that your synthetic patients look real?


You cannot imagine just how deep the medical data rabbit hole goes.

Already plenty of institutions have semi-standardized their collect and do multi-hospital (typically research hospitals) aggregation. Whether this data is any good as training data for supervised or unsupervised algorithms is really questionable.


I work in machine learning for digital Pathology and I think the big problem here is the divergence between publishing papers and real life helpful models. What we see is that in the literature you often see a model trained on data from a single lab and gets crazy good results. However, apply it to a different lab (not in the paper of course) and it sucks. So what we do in practice for our models is to train on many different labs at the same time and have a hold out test set with labs not covered during training. That way you get a robust model which works well in practice (but doesn't have the 99.9% metrics reported in papers). The second thing is looking at what task to let the machine do. We typically go for tasks which are boring / repetitive for the doctor, but visualize the result, so the doctor double checks it before making the diagnosis. Still saves a lot of time.


I also work in this field and what we typically do is a lot of nested cross-validations to get some bounds on a model building process and some idea of how it would perform on repeated unseen data. Data leakage is always on our mind and we do our best at all stages to avoid that. We also train on data from many sites. It can be done and it can be done properly. As you say, it is always best to collect some completely naive test set to back up the model-building process. If you design your pipeline properly, the test set should fall within the bounds you got during cross-validation. It all depends on how much data you have and I think as long as you design your pipelines with that in mind and acknowledge limitations with smaller datasets, then the research is valid and useful.


Do you have any specific example use cases where you're seeing success?

I'm asking this both with regards to ML/prediction related work and the "boring" work.

I'm working on a direct to consumer digital pathology service with a clinical background as opposed to data science/ML. Really curious as to what type of new services we could try invest in to improve what we offer.


'Brunelleschi had just the solution. To get around the issue, the contest contender proposed building two domes instead of one — one nested inside the other. "The inner dome was built with four horizontal stone and chain hoops which reinforced the octagonal dome and resisted the outward spreading force that is common to domes, eliminating the need for buttresses," Wildman says. "A fifth chain made of wood was utilized as well. This technique had never been utilized in dome construction before and to this day is still regarded as a remarkable engineering achievement.'

Brunelleschi was not an engineer he was a goldsmith. AI will advance in the same way architecture did during the Renaissance. By those with the winning ideas not with the right credentials.

https://science.howstuffworks.com/engineering/architecture/b...


It is still unclear to me exactly what data they were looking at/referring to in this article.

If you take into account bloodwork, family history, demographics, etc. then it seems like you are still only getting a few dozen data points. At this scale it seems like traditional statistics or human checks for abnormalities are going to be about as good.

Although I personally know very little (apologies for conjecturing) it does seem like there could be a lot of uses for AI for specific diagnosis. For example, when they take your blood pressure/heartbeat they only get data for one particular moment where you are sitting in a controlled environment. I would think if you had a year's worth of data (along with activity data from an apple watch) you might be able to diagnose/predict things that traditional doctors/human analysis could not.

I would also imagine anything that deals with image analyzing (like looking for tumors in scans) will be vastly better with computer AI systems than humans.


Look simple rule based algos can be just as effective. The hard part is not diagnosis. It's getting the human to comply with treatment


So a model calibrated on a backtest says nothing about its predictive capacity. Who would have thought? Well, at least I think anyone who worked even a little bit in quantitative finance. The only way to validate a model is to make predictions and test if those predictions actually happen in a repeatable way, which in certain circles is refered to as "experiment".

That's why I distrust any model built purely on backtested data unless they can be shown to predict something else than history. And AI is not the only area that blindly trusts those kind of models.


Technical (Honest) Solution: two holdouts

1. Involved in the build process

2. Never touched until paper metrics are being written, only run once

Realistically, unlikely to occur however due to the incentives causing publication bias.


Third (better) option: have a regulating body have a separate, undisclosed test set. If you can't beat it, you can't deploy your model. If you can beat it, you still need to have your models peer reviewed and scrutinized


This sounds simple yet I expect data governance will be the bottleneck.


so the models that fail this one test never get published and the models that succeed get published. And all you have done is to publish a model that predicts that particular history, in other words data fitting.


> the models that fail this one test never get published

Not necessarily -- for comparison purposes one should include, and a negative result is not a bad outcome. But consider Edison's light bulb. Most of the failures don't matter, and a few might have interesting properties to reconsider or tweak down the line. But the major one folks care about is the one that worked.

> data fitting

Yes, models trained on data and not logic alone fit data.


Color me, personally, surprised. Between publication bias and the general public ignorance of AI and its evolving capabilities, and over a decade of results in AI health being overblown before transformers, how could we have predicted that post-transformer results in AI health would continue to be overblown?


Surprise, surprise. People hugely overestimate the data retrieval capabilities of healthcare systems. And if you really put clinical 'AI' systems to the test in day-to-day settings (which is in fact never done), results would be much, much worse.

Shit data in, shit prediction out.


Look up prospective validation or clinical impact trials. These are validated in day-to-day settings somewhat often, though not as often as they should be.


I'm a professional clinical researcher. Those 'validations' would be comical if the results, when witnessed in person, weren't so sad.


They are validated though. Not as much as they should be, but it happens. IMO the real blocker is hospitals don't want to/cant spend resources on these trials.


we are working on some transformer based models trained on millions of patients across a broad set of features. Well launch our first model into a private alpha in a few weeks. Basically it uses a seq2seq transformer XL and fhir data. https://labs.1up.health/ Were hiring. U can email me directly at ricky@ We don't do any of the subset optimization or training for specific diseases or conditions. It's all or nothing like the way most llms work


So many in ai chasing software solutions when the problem is hardware. Limited power means limited learning. Mix lab grown neurons with software and you have a wining proposition.


I feel it would be safe to say that "AI models that x are not as accurate as reports might suggest" for the current hype values of x.


AI seems to be a hyped solution in search for a problem. ML, sure, for some use cases. Regardless of where I encountered ML / AI / "expert systems in my life so far they have been sub-par to actual biological intelligence. In all cases they were really hyped.


Yet.

One thing media consistently gets wrong is the rate of innovation that is happening. Media also doesn't have access to state-of-the-art models, only from the trigger-happy startups too eager to release half-baked version.

It's akin to downloading Image Generation tools from the App Store and concluding that's state of the art


It baffles me that people can watch the trendline of

"Job X can be automated in 40 years" (5 years ago)

"Job X can be automated in 10 years" (2 years ago)

"Job X can be automated in 5 years" (1 week ago)

And feel comfortable poking holes in the AI models, pointing out where it fails. Obviously? But nobody 3 years ago thought that graphic design or creative writing was on death's row either.

You have to spend a modicum of effort looking at how predictions have evolved over the past couple years, but once you do, it's very clear that mocking current AI systems makes you look like a clown.


There's also the timeline that:

"Radiology will be automatized in 5 years" (10 years ago) "Radiology will be automatized in 5 years" (5 years ago) "Radiology will be automatized in 5 years" (last year)

or

"Full self driving will arrive within 5 years" (5 years ago) "Full self driving is still a ways off" (last year)

Assuming you're referring to generative models, I don't think that anyone (knowledgable) thinks that graphic design or creative writing are on death's door. They might change with new tools, but skilled practitioners are still required. That's basically the point of the article.


Having seen some of the automation available in radiology, I’m a bit baffled as to why I still have a job as an MRI tech.

5 years ago I watched automated cardiac MRI, and it worked well. I was told about a site that were having good results with fetal cardiac MRI via a related bit of software.

These scans are hard to do, and the machines did well. In some cases they got confused and did a good functional analysis but of the stomach, not the heart. Oops, but easily fixed by almost anyone after a few minutes of explanation.

Why are basic MSK scans still done by a tech with years of training?

I don’t know the answer to that as it’s basic stuff and if I end my career without machines having taken over the basic stuff, I’ll be a bit disappointed.


But even if all these analysis would be dine automatically, I guess you won’t be out of a job soon (good news I guess). But just different: I did in the automation of the diagnostic lab and what happened is that from a detective style job, today it is more about running a factory. 24h running a business, turn around times and have less and less qualified personnel to fill the machines…


We are 18 years from the DARPA Grand Challenge and none of the vehicles finished.

Do you think a self-driving car can make it from LA to NYC by itself now?

What do you think 2040 AI will look like?


As always, AI performs poorly in the real world until they don't.


I’m shocked. SHOCKED.


The issue with data leakage can be handled through k-fold cross-validation, in which all of the data takes turns as either training data or test data.


Yet.


My humble opinion; AI is supposed to be the acronym for artificial intelligence, but marketing has usurped it to refer to machine learning, which is nothing more than a neo-language for defining statistical equations in a semi-automated way. An attempt to dispense with mathematicians to develop models.

What amount of energy is necessary for an event to be reflected in a statistic? You have a box of 2x2 meters with balls of data, and a string with a diameter of 1 meter with which to surround the highest concentration of balls possible, and those that remain outside, there they stay. Statistics and lack of precision are concepts that go hand in hand (someones say even it is not an science).


> My humble opinion; AI is supposed to be the acronym for artificial intelligence, but marketing has usurped it to refer to machine learning, which is nothing more than a neo-language for defining statistical equations in a semi-automated way.

Sure. Hardly controversial.

> An attempt to dispense with mathematicians to develop models.

What...? No. Definitely not.

> What amount of energy is necessary for an event to be reflected in a statistic? You have a box of 2x2 meters with balls of data, and a string with a diameter of 1 meter with which to surround the highest concentration of balls possible, and those that remain outside, there they stay. Statistics and lack of precision are concepts that go hand in hand (someones say even it is not an science).

I have no idea what this is saying. It sounds like you're shitting on statistics all of a sudden, which is weird, given that you seemed to favor mathematicians in the first part.


>I have no idea what this is saying. It sounds like you're shitting on statistics all of a sudden, which is weird, given that you seemed to favor mathematicians in the first part.

Mathematicians are specialized in problem solving, and as humans, their ability to predict and analyze data makes them more reliable developing models than a statistical equation. They have quite more tools than statistics one.

Someway, it is like if using the acronym AI to define statistical algorithms leads to a false sense of greater reliability than such human review, or even that it is not needed a deep human review. ML statistics takes algorithms out of the oven long before mathematicians does, at the expense of a big in accuracy difference.

The problem I think is people may take important decisions based in the result of such statistical algorithms without questioning


I don't think most mathematicians have spent a great deal of time analyzing data tbh. Unless you mean statisticians.


Having built models, I’d claim that it’s art based upon science, perhaps not too different than engineering a building. At every stage there are decisions to be made with tradeoffs. Over time, the resulting model could be invalidated or perhaps perform better. It’s remarkably difficult to approach or even define a “best” model.

What’s most peculiar to me is that somehow AI is becoming more distinct from math or stats and that there’s a notion by running pytorch one is able to play god and create sentience.


Statistics is not science- it's an application of probability theory and some other forms of math to hypothesis selection (among other things).

It's scientific. We only use stats because that's the best method for dealing with imprecise and noisy data.

Statistical thermodynamics contains all the necessary tools you need to answer your balls in a box question.


>Statistical thermodynamics contains all the necessary tools you need to answer your balls in a box question

The balls in a box example shows how ML statistics work. The string is adjustable, it can be adapted to different contours, but you have to discard data.

How do you compensate for the inclusion of data in the model without discarding others? The string has a limit in diameter by design, and you need to know the content of most of the data to make good decisions.


> Statistics and lack of precision are concepts that go hand in hand (someones say even it is not an science).

Statistics is the mathematics of being precise about your level of imprecision. It's fairly fundamental to all science, and has been for a while now.


It's not that AI has been conflated with machine learning—those are words that are supposed to refer to the same thing. The confusion is conflating either with slapdash applied statistics.


>which is nothing more than a neo-language for defining statistical equations in a semi-automated way.

That's why it's called artificial intelligence.


Of course. No surprise there. Especially the ones made with 'Deep Learning'.

At this point, Each time AI and 'Deep Learning' is applied and then scrutinised, it almost always concludes and tends towards pure hype generated by investors and output garbage unexplainable results from broken models. The exact same goes for the self-driving scam.

'AI' is slowing starting to be getting outed as an exit scam.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: