Hacker News new | past | comments | ask | show | jobs | submit login
AI Models Predict Breast Cancer with Radiologist-Level Accuracy (ibm.com)
146 points by tiagobraw on June 25, 2019 | hide | past | favorite | 106 comments



This is not exactly new. I remember seeing models that did really well many years ago. And again caught many that humans had miss.

The problem is that they fail differently than humans do, in a way that humans wind up not trusting the results.

It turns out that there are parts of the breast that are easy to spot tumors in, and parts that are hard. A human scans quickly over the easy areas, and focuses on the hard. The result is that humans make careless errors on the easy areas, and catch hard tumors. Computers make no careless errors, but can't catch the hard ones. Thus when a human sees what the computer caught that the human did not, the mistake is easily dismissed. But when the human sees the ones that the computer missed, it becomes, "It doesn't know how to do the real work."

Ideally the two would be used together for better results than either alone. But humans wind up resenting the computer...


> The problem is that they fail differently than humans do, in a way that humans wind up not trusting the results.

Not only that, but human fallibility is accepted where machine fallibility is not. There's something about being a "person" which makes it acceptable for you to just take the blame for something. A senior radiologist makes a glaring error and "it happens, people make mistakes". A computer makes the same error and it's a problem which must be fixed before the computer can be trusted.

Ultimately I believe this is a cognitive bias that we're just going to have to learn to let go of.


> Ultimately I believe this is a cognitive bias that we're just going to have to learn to let go of.

Unfortunately I don't think this is merely cognitive bias. It's actually built into our legal system at a pretty fundamental level: machines are held to a higher standard than humans when it comes to failures with grave consequences.

And keep in mind, this system only achieves the same accuracy as doctors. What is the wnd benefit, other than shifting where money flows?

Do you really think this benefit is substantial enough that we will se major overhauls of tort law in all US states and in every country in the rest of the world?


> Unfortunately I don't think this is merely cognitive bias. It's actually built into our legal system at a pretty fundamental level: machines are held to a higher standard than humans when it comes to failures with grave consequences.

Well, laws reflect culturo-cognitive biases, don't they? And also they evolve.

For the last part, that seems to generally be a wise "bias": when a machine fails, chances are good that all the identical machines will fail the same way.

When a human err, chances are good that this won't be in a way you might expect to identically happen in all its peers, although common cultural biases can happen obviously.

Also so far, our machine are far less expected to self-correct their behaviors when they err, especially regarding some untold social exceptions that were not met through some unexpected side effects.

Ultimately, machines don't care of consequences of their acts because they don't have any feeling of responsibility to someone they love, not even themselves.


I worked on one of those systems. For some screening tasks, it has been at or beating average-radiologist performance since the mid or late 90s.

There are a number of issues, but it's true that raw algorithmic performance is a small part of the whole picture.


This is often the case in almost all ml-augmented workflows. The augmentation begins and ends with people looking at things, because the machine only knows what you told it, and doesn't care.

The real work is building systems around people who do the interpretation and labeling to make their jobs easier.


Yes, very much this. And the importance of workflow impact cannot be overstated.

The (clinical) system I mention above, the ML part was a few percent of the total effort, max.


can you tell us more about the rest ?


Pigeons can do better than humans when they work as a team: https://www.scientificamerican.com/article/using-pigeons-to-... but obviously no one is going to trust pigeon diagnosis anytime soon. Computers have a bit more credibility.


> The problem is that they fail differently than humans do

That is a great argument for giving such a model as an aid to a human doctor. Together they will be better then either one alone.


In Thinking, Fast and Slow -- the author details a double blind trial where the did this. It was worse with humans and AI than with just AI. Humans think they can use AI as a guide and move it in the right direction. But the movements they made, on average, were bad.


Surely in this type of instance (looking at a scan to answer a yes/no question) the human and AI act independently, with the computer being a useful aid because it separately picks up a few of the human's false negatives. Assuming false negatives are a lot worse than false positives, this can only be a good thing.


If they lead to an unnecessary mastectomy then false positives are pretty bad. Not as bad as dying, obviously, but still a severe blow to a woman's identity and sense of self worth.

It's going to be a hard pill to swallow if you have to tell a woman "sorry, we removed your healthy breast because the computer made a mistake."


I think the idea of "screening" is that you don't just race off to a mastectomy the minute some AI model goes off. Of course, putting more false positives through a fallible process of review does run the risk you speak of.


It does cause unnecessary biopsies for sure. And some stress on the patients.


Even a false positive that leads to telling the patient that they may have cancer is bad. It leads to a life-long anxiety for many people.


It sounds like a smart hospital would run a patient through both human and AI screenings separately, and a different doctor to examine both results and evaluate the discrepancies. This way you would keep the strengths of both approaches, lowering the failure rates, and depending on the countries health care funding can be good business from the hospital's POV as they get to charge for the extra work as well as the better success rates to drive business.

And I wonder what happens if you apply machine learning to looking at the difference between AI and human screening results.


Radiologists are really bad at detection, even after many years of study. That's quite often due to coarse level of details of scans when only large tumors can be observed or recognized with some certainty. Surpassing humans there is not so difficult, but improving accuracy from e.g. 32% to 34% doesn't really sound like a win :(


2% more accuracy could still be millions of people if it's a common enough cancer like breast cancer.


> 32% to 34% doesn't really sound like a win :( We are talking about human lives here, not about beating some CPU benchmark. Detection improvement by 2% is huge in almost any sickness.


remember when ensembles were the cool word before they got erased from collective consciousness and replaced with deep things? it can't even be a decade, was it 2012 or something?


They haven't got erased, but more like subsumed? If you use dropout to train your model that is basically equivalent with using an ensemble of deep neural networks.


That is not even close to the same thing.

If you train an ensemble of models with random dropout, you have an ensemble. Models trained with dropout will still have significant variation from run to run.


> That is not even close to the same thing.

It's a common interpretation: https://arxiv.org/abs/1706.06859


There may be a paper on it, but it’s not a common view.

In particular, this paper neglected to do the obvious thing: ensemble networks trained with dropout. It improves performance over dropout alone.


Why shouldn't you employ an ensemble of deep neural networks?


Correlated errors. Naive averaging will lead to overconfidence and it is not trivial to model the correlation. Boosting is worth a shot though.


My point was that ensembles of deep neural networks are commonly used and yield higher accuracies.


More importantly, what happens if you put a radiologist opinion (or multiple) in such an ensemble?


This. The argument was never to replace doctors. These are valuable tools that augment what doctors can do.


No, the point very much is to eventually replace doctors. You just can't easily get there before first going through a doctor-machine cooperation period.

Automation is a friend of society, but is not a friend of individuals working particular jobs. I think doctors are acutely aware of that.


I don't get the relevance of comments like this which I hear all the time. Say everything in the comment is true. It's all within the noise of a decade of research from now till 2030.

The big picture is, this patchy performance is the writing is on the wall. It's over for radiologists, for the most part.

The nature of this problem is a great fit for ML and it will in short order (10 years) be superior in the vast majority of scenarios to expert level humans.

People say, but psychology, fear, unknowns, will require human supervision indefinitely. Of course that's true.

The problem is, radiologists will effectively be relegated to proofreaders. The number of minutes required of them per patient will plummet and so will their job market (unless changes allow many more untreated people to get imaged).

What about the researchers? Even they will take a hit as the imaging analysis part of radiology research moves more along the spectrum toward yet another computer science problem.


Algorithms have been competitive with humans for 20 years on radiology.

And yet humans have not yet been replaced.

Tell me what you expect to be different about the next 10 years that wasn't in the last 20? I'm open to being convinced. But you have to not just say that computers are going to be better - you have to explain why there wasn't already a switch.


Have you every done anything entrepreneurial in healthcare or just tried to introduce a new or different treatment standard?

1) It just takes longer in this field. Takes longer, but not necessarily more difficult (technically).

It's a nightmare and when I helped consult to a couple teams for a bit I was shocked at the slow speed everything gets done at. Many reasons for that, but it's hard to describe how unlike a typical fast moving startup it is (unless your not changing patient protocols).

2) ML is moving faster than it was 10 year ago. Way faster. So comparing work done 20 years ago is difficult, they just didn't have the same resources, ecosystem, and momentum.

3) It's only a hard problem in practice not in principle and well suited to ML. Unlike some problems where difficulty in practice doesn't match difficulty in principle (like warp drive space travel), the remaining engineering and open research questions don't suggest anything that will hit a wall, or prevent ultimate success on the order of the time frame suggested.


> It just takes longer in this field. Takes longer, but not necessarily more difficult (technically).

This immediately made me think "yeah they still use faxes", yet I don't think anyone doubts faxes are on the way out. Still "it just takes longer" understates the just how slow it is. It looks to me like a critical mass of older fax using doctors will have die off before the change can happen.

It does make you wonder how they get away with this. In most industries competition and cost pressures will force the change. Not so in medicine, apparently.


Over for radiologists?

Call me back when your machine can produce a comprehensive analysis of an anatomic configuration in relation to every element in the patient's file.

The real problem with those ML systems is that people don't understand what a radiologist is. Let's perhaps solve that problem first?


1) I understand there's more to it than that which is why I called out imaging analysis at some point during my post.

2) I said over "for the most part". There are still travel agents today you know.

Please understand, the argument is not that they will not exist, but that there will be a greatly diminished job market unless imaging demand grows tremendously. And also that the nature of the job will be much different and probably see salaries stagnate.

3) Given your "call me when" scenario I would need ask specifics to properly respond. However at a superficial level I'm not sure I see any kind of millennium prize class problem at first glance.


The job of a radiologist is to construct a story that holds water around the diagnosis and communicate that to other specialists, so they can use that to understand the problem at hand.

Concretely, a radiologist may say "there is a lesion compatible with your presumption of diagnosis X, however another neighbouring lesion Y speaks against diagnosis X. In light of patient history, further examination of lesion Y by modality Z is advised". That's the minimum we expect, otherwise we wouldn't need radiologists because many specialists can analyze the images relevant to their specialty themselves.

I don't deny it will be possible to automate in the future, but currently not possible. Radiologists are useful, and their job lies beyond matters of image description.


Sounds like we're in perfect agreement on the critical points and the only debate would be around the timeframe (10 years or ?).

- It will be possible to automate a large % of a radiologists job in the future, it's currently not possible.

- Radiologists are useful and their job lies beyond matters of image description


> People say, but psychology, fear, unknowns, will require human supervision indefinitely. Of course that's true.

The humans that supervise that will become the new radiologists. The "best" of those humans will have cross-field disciplines in ML model development and traditional radiology. My capitalist side sees a huge opportunity here in consulting and helping the existing radiology departments(who are interested) bridge the gap from current practice to a hybrid approach.

Adapt or die, etc.


- What about the point that the total headcount demand for these supervisors/proofreaders will likely be drastically reduced?

- The idea of cross-training radiologists to understand ML model development might be an ideal, but it's hard to see more than single digit percentages of folks who could make that transition. The jobs are way different and require years of ramp up. Even the mindset and the cultural approaches to work seems different, based on only a couple couple times I've worked on cross discipline teams.

But hey, it's still unfolding, I could be wrong about the consulting. If you figure out a model and how to sell it ping me, I'd be glad to provide feedback to help you refine it quickly.


This is a crucial point. There's a good chance these are problems that could be solved with more data. However, I think philosophically the challenge of integrating these models in a modern healthcare system is more fundamentally related to explainability. A crucial part of a radiologist's job is the "why?" - why did you make this diagnosis, why did you flag this patient, etc. While there are models (especially in image processing) that tackle this problem, I'm not personally aware of them being used in a clinical setting. It's difficult from not only a machine learning perspective, but also in terms of HCI/UX.


That's a helpful perspective. Would it be possible for IBM to create a service that allows patients to submit their own pictures for scanning?


We are at a tricky stage with this. Images are too big to be emailed and many people no longer have CD drives. There is increasing use of tomographic imaging in examinations too, and the files are pretty big.


On the other hand, cheap micro SD cards and thumb drives hold far more than CDs or DVDs.


We prefer to send the images to the PACS the patient wants them on, as for most imaging that’s all that’s wanted. Obstetric imaging is one massive exception.


Auth workflow so whomever does the imaging stores the image, and the patient can grant access and a pointer to systems consuming the imaging.


> Thus when a human sees what the computer caught that the human did not, the mistake is easily dismissed.

You mean dismissed as in not believing the result or not being impressive? (unclear if that's what you mean later by resenting and if it is tied to this statement)


I mean dismissed as in not being impressive. "Yeah, yeah, I should have seen that. It is obvious. No big deal."

And if you use the software as a backup, people don't respond well to their dumb mistakes being pointed out. And the result is that people put effort in to not making dumb mistakes...and therefore either slow down or have less time for what humans can do better than computers.

In theory it should work a lot better than it actually does.


What if you use the human as the backup? So they are looking for things the ML system missed. (As well as confirming the things it found.)


So a lot like the centaur era in chess competitions.


> humans wind up resenting the computer...

basis for this assertion?


I think this could be a new milestone, compared to what you saw, because it uses deep neural networks.

I also am not sure saying computers can't catch the hard ones is true in light of this. It seems like a deep neural network would be useful to catch the hard ones.

I agree with using the two together, but I don't see why the two can't be different AI subsystems. That seems to be what they're going for at IBM. Some of the power comes from a scientific model, while more power comes from well-trained deep learning networks.

Before Alpha Zero, advances in self-driving cars, and face detection advances, I'd have agreed with you.


Anything related to AI coming out of IBM should be viewed with a huge dose of skepticism. They're honestly one of the worst offenders in overselling the capabilities of their products, bordering out outright fraud. There is certainly a lot of promise to the application of recent computer vision algos on medical imaging data, but I wouldn't bet much on IBM being anything close to a leader in this space.


I don't doubt what you say. Just want to point out that this is published in a peer-reviewed journal, so hopefully the academic community will judge it objectively.

https://pubs.rsna.org/doi/10.1148/radiol.2019182622


OTOH, do even radiologists (or anyone else) can predict cancer at all before it happens? I thought that radiologists diagnose cancer once it is already there.


> do even radiologists (or anyone else) can predict cancer at all before it happens?

Sure, some of the time it's easy.

Let's all recall the words of my mother's medical school instructor, "There's a bit of cancer in everyone's prostate".

(The context was a lab exercise in which medical students were supposed to find which of a set of slides of prostate tissue was cancerous. The reminder was necessary because many of the slides were cancerous, just not at levels high enough to be considered medically alarming.)

Predicting that a man will develop prostate cancer is basically the same thing as predicting that he'll experience old age.


not all tumors are malignant


I was lucky to date a girl who was into math, and who was coding those "machine learning" algorithms for a radiology startup here in Shenzhen.

She had a lot of scepticism for what she did. One of biggest showstoppers she said was the unpredictability of errors.

An algo can catch 99% tumors, including tiny ones, bur can randomly pass over very obvious ones which a human radiologist will spot with his eyes closed.

They had a demo day with radiologists, and them throwing tricky edge case xrays at the computer. Edge cases were all ok, but one radiologist pulled his own xray from his bag, with a 100% obvious, terminal stage tumor, and to company's embarrassment, the algo failed to detect it no mater how they twisted and scaled the xray. The guy then just walked out.


Had a similar problem ages ago, and ended up adding a "blindingly obvious tumor" detector pass before the regular pipeline, just to avoid this cognitive dissonance.

This is one of the (many) reasons that practical classification systems, as against research systems, tend to become Frankenstein's monsterish over time. It's naive to think that a single approach and pipeline will cover your domain well.


It seems to me the use case should be to have the radiologist look at a scan for tumors. Then, the algo should look. If they disagree, then the radiologist should look at the difference.

It'll be the best of both.

And in the scans where the algo is wrong, have the scan added to the machine learning database of the algo.


Unfortunately, a lot of times hospitals can only afford one or the other. These systems are very expensive and radiologists and cytologists aren't exactly cheap either. But, I agree, both would be good, especially considering the volume a Cytologist is expected to screen in a single day.


> These systems are very expensive

Seems like a business opportunity for a cloud AI screening provider.


You point out another business opportunity: a developer who understands exactly what regulatory hurdles you need to jump in order to release medical software. I'm not sure exactly what's required in this case, but I'm doubtful there are many cloud providers who are HIPAA compliant.


I'm not sure it would need to be regulated, any more than a medical textbook needs to be regulated. The radiologist would be making any decisions. As for privacy, an x-ray is sent. No personal information whatsoever.


If the radiologist has to look at and double check every scan that algo looked at, then what is the point of the algo? Seems like a useless middleman that get in the way.


Screening is hard work and tedious, so even trained professionals regularly miss things. TP incidence rate is under 1% in the screening population.

There have been studies showing significant improvement from double-reading mammo, for example (i.e. two radiologists, independently). Using an ML approach is trying to give you some or all of this benefit without the cost of redundant reads.


>then what is the point of the algo?

The point is that the algorithm can improve results. This isn't ad placement, it's peoples' lives. Checking and double checking should be the norm.


Better to implement a system with a high rate of false positives (more importantly, low rate of false negatives) from the machine learning component, with all positive findings passed onto the radiologist. If the system can reliably (big if) filter out 98% of the chaff then the radiologist can spend a lot more time separating the false positives from the true positives. This approach has worked well for me so far.


This approach is problematic in medical screening applications. Mainly because you don't want to increase the work up rate for false positives since if they involve biopsy and a large screening population, eventually you will kill people this way (indirectly) so there is a pressure to control FP rate.


Because the scan check by the radiologist becomes a _double_ check.


I guess this falls into a category that the ML algo learns a particular type very well and cannot recognise obvious cases if they look different than the training data. Human intelligence is a mix of pattern matching and attention focus that is hard to provide with a single pattern matching ML project. Isn't there projects try to use multiple pattern matching models combined to decrease the amount of false negatives?


  1. 99% > 95%, or whatever the radiologist's accuracy is.
  2. Combine both systems for obvious gains.


The reason I'm skeptical of this is that there is no actual comparison to human level performance. I.E. they didn't have radiologist actually read their images to compare against the model. Notice that the title of the paper is "Predicting Breast Cancer by Applying Deep Learning to Linked Health Records and Mammograms" it's only in the press release that they seem to imply a comparison to radiologists was actually done.


https://pubs.rsna.org/doi/full/10.1148/radiol.2016161174

This is their comparison point for actual radiologists. Citation number 6. It doesn't look comparable, though. Radiologists are around 90% specificity and sensitivity, which varies a good amount from the model's 77.3% and 87%, respectively.


This is not on this dataset though (right?), so not really a solid comparison point. Plus lik you mentioned, they seem to be doing worse than this benchmark.


No positive predictive value reported, imbalanced test data, IBM. Garbage.


What makes this an "AI Model" instead of just a "Model"? That is, in what way does it have "artificial intelligence"?


"Model" ---> [[marketing department]] --> "AI Model"


[Student in school] Implemented MinMax algorithm in checkers ---> [student looking for work] Implemented state of the art AI algorithm, which successfully will beat the human opponent EVERY time. ---> [HR/Marketing dept at some corp] Wow you are HIRED!!!!!!!! ---> [Lead dev] Oof this guy can't program for shit.


---> [student about to get fired from work] Why on Earth did they put "AI expertise" in the job requirements if all they want me to do is to shovel CSS and JS, and the closest thing to AI they have in the office is a 1960s thermostat?


Can I call myself an AI specialist if I successfully fed a plain support vector machine once or twice for diagnostic support? Feels like driving an old timer here...


"AI" has been used routinely as a synonym for machine learning for the last decade or more, hasn't it? This employed neural networks.


Probably because it was obtained using some kind of machine learning.


Yes, the AI part is marketing/fluff


The real problem here is when the society will allow a machine to diagnose them and if the society is ready to believe that most diagnostics are probabilistically made.

Up to date we allow humans to be at a 70% error level without problems, but we ask machines to be 100% effective.

The very same happens with autopilot, the big numbers say they drive better than humans but...


I remember seeing a statistical analysis here on HN that said the numbers for Tesla autopilots are neither great compared to drivers of Teslas nor do they seem to be fair. (They found a case where a human driver had a crash in what would have been counted as 0 miles in the analysis, indicating that something is inflating the "crashes per miles" metric)


Using autopilot in parking?


This isn't about group think. It is when individuals will allow diagnosis. If you give a woman two options, diagnosis by machine with a record of 80% or a human with a record of 70%, it is a really easy decision to make. The desire to not suffer cancer is strong enough to override almost all emotional arguments. And if you can afford it, you will likely choose both or get a second opinion.


These type of algorithms don't give a confidence interval for their predictions so I don't believe these diagnostics are base on probability at all.

Having a confusion matrix for what the model predict correct or not is not the same as having a CI for the model's prediction.


Isn't it possible to derive a CI from the confusion matrix? https://stats.stackexchange.com/questions/363382/confidence-... or using bootstrap? The authors of this research also provide CIs.


Old news from a major source of AI hype.

Here's some previous results https://med.stanford.edu/news/all-news/2018/11/ai-outperform...


@moderators: would it make sense to change the link to the journal article rather than IBM's article? It's free access.

https://pubs.rsna.org/doi/10.1148/radiol.2019182622


Is this new AI Model under Watson?


There’s no such thing as “Watson”. IBM have put the Watson name on basically everything, to the point where its information content was reduced to zero bits.

Watson for IBM is like the i-prefix for Apple.


Watson is a brand, so that doesn't really mean anything. If Watson refers to anything it would be the NLP functionality that IBM sells, and that's not relevant here.


What's the False Positive Rate?


FP = system (or person) flags this as true when it is not

TP = ... flags as true and it is

FN = ... flags as false but it is true

TN = ... flags as false and it is false

To turn these into rates, you normalize them.

e.g. TPR = TP/P = TP/(TP + FN) = 1 - FNR

etc.

These are characteristics of a classification system

You will also hear sensitivity (TPR) vs. specificity (TNR) often, particularly in medical contexts. In other contexts you'll hear Type I (FP) vs. Type II (FN) error.

In most cases you a set of trade offs in your algorithm, and will need to pick a balance between sensitivity and specificity.

c.f. ROC: https://en.wikipedia.org/wiki/Receiver_operating_characteris...


I think the OP is asking about the _value_ of the FPR instead of the definition.


Ah, if I misread then from the results section of the linked paper:

For the malignancy prediction objective, the algorithm obtained an area under the receiver operating characteristic curve (AUC) of 0.91 (95% CI: 0.89, 0.93), with specificity of 77.3% (95% CI: 69.2%, 85.4%) at a sensitivity of 87%.

I haven't read the papers methods, but the data set size is small-ish for this sort of analysis.


The value of the false positive rate is that it lets you know the probability of a true-miss. Depending on the classification exercise, you may be concerned with false positives, where the consequence of a missed call is significantly greater than an unwarranted checkup from a human doctor.


The numeric value



Thanks for linking. If I'm reading that correctly, it's pretty bad in comparison to an average human radiologist at ~6% false positive rate [1]. There's probably a bias factor in there, however, where humans are hesitant to predict potential cancer due to the cost/time involved in follow-up screening.

[1] https://www.ncbi.nlm.nih.gov/pubmed/21643887


Doctors will be some of the first to be replaced by AI. My physicians walk around with a computer already checking all the boxes for symptoms and seeing what it says. I wish I could find one with a true intuition for medicine


Apparently you're not familiar with the documentation burden in the medical field. EHR's don't diagnose for you.

There is no "true" intuition in medicine, just years of study and practice leading to quick recognition of common problems like any other field.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: