Hacker News new | past | comments | ask | show | jobs | submit login
Microsoft Finds Cancer Clues in Search Queries (nytimes.com)
148 points by hvo on June 7, 2016 | hide | past | favorite | 66 comments



For those curious about what kind of queries the researchers were interested in : "it typically produces a series of subtle symptoms, like itchy skin, weight loss, light-colored stools, patterns of back pain and a slight yellowing of the eyes and skin that often don’t prompt a patient to seek medical attention." [1]

The article name is: J. Paparrizos, R.W. White, E. Horvitz. Screening for Pancreatic Adenocarcinoma using Signals from Web Search Logs: Feasibility Study and Results, Journal of Oncology Practice, June 2016.[2]

[1]https://blogs.microsoft.com/next/2016/06/07/how-web-search-d...

[2] http://jop.ascopubs.org/content/early/2016/06/02/JOP.2015.01...


First, this is a very scary first step in the assault on privacy. It is an opening for people to argue that a person who fits a certain search profile should be de-anonymized. After all, wouldn't you want to know if you had cancer? In addition, as it is not about law enforcement but public health, there are a lot fewer limitations on what information the government can access.

Second, after seeing the type of queries, I do not think that this is all that helpful. If a person has unexplained weight loss or yellow skin or eyes, they should always go see their doctor right away. My guess is that most of the specificity of this study comes from those two terms (weight loss combined with yellow skin). Just getting out that message will do a lot more to save people's lives than violating their privacy in this manner.


Not chumming for sympathy here, but I've mentioned before on HN that my own pancreatic cancer first manifested itself as a vague feeling of unease at the sternum, and persistent nighttime acid reflux. No other symptoms. My GP didn't recognize it for what it was, and treated the reflux symptom.

It would sure be spiffy if mining queries could identify early symptoms, working backwards if you will, rather than be used to find cancer patients by working forward from symptoms already identified by the medical establishment.


I have exactly the same symptoms and a history of pancreatic cancer in my family. Not sure how to try to escalate this further or if that seems ridiculous. What did you do to get further exams?


I became severely anemic after a few months and the e/r did an MRI, looking for internal bleeding. They found the tumor then.

My GP had x-rayed me, looking for a cardiac anomaly - he suspected an aortic aneurism. What he should've done is order some soft tissue radiology instead. Having your GP consider a CT or MRI is my advice.

Hope in your case it's all for naught. Best to check though.


I don't think I have anemia so maybe it is just GERD, but I will mention what you said on my next visit.

I really hope you are better.


Agreed. Perhaps this could be addressed by a personal medical 'agent' that could access condition profiles & statistics on your behalf, rather than having your data mined en masse by a search engine, ISP, etc.


> they should always go see their doctor

GPs often do not catch rare conditions or determine the underlying cause of milder symptoms.


I had a collapsed lung and went away from my GP with an inhaler for asthma (I've never had asthma). Luckily I went in after a few days and got x-rays and then got treated immediately, and I'm fine (I hope:-) 20 years later. Still though, it reduced my trust in doctors dramatically.


You're a cyclist, so you probably have very good lungs. I was in a similar situation (saxophone player, avid cyclist) so too ended up with a misdiagnosed collapsed lung. If you're not showing the regular symptoms (such as very low oxygen levels in your blood) then they are bound to assume something else is the cause. Sharp pain in your chest is a good reason to suspect a collapsed lung or a partial detachment of the lung from the chest wall if you're otherwise in very good condition and should be taken very seriously.


My 4 year old was treated for asthma by his ped then admitted to hospital for pneumonia a few days later. Two surgeries and 30 days in hospital later, he is all better. It is frustrating how subjective medicine still is.


The fact that these cancers go undetected until it's too late at present is evidence enough that GPs are not adequate for this purpose.


The rule GP's are taught to follow is "when you hear the horses running, don't think of zebras". Symptoms are frequently either non-present or unspecific for such a long time that GP's stick to what fits and is most propable and only look further if current approach isn't working.


The squeaky wheel gets oil. I would say that people who are less acquiescent will get more tests, sooner, especially in a scenario where demand outstrips supply. Women tend to be better in that way in my experience, men are more easily fobbed off.


This is Occam's razor. The horses saw is new to me.


This is unfortunately a systematic problem with medicine that probably arises from the fact that the signal to noise ratio is so low. If you see 100 patients/year who have indigestion but think they have stomach cancer, then the 1 patient that really has cancer is impossible to detect (without also spending unreasonable resources on the other 99).


I don't think they necessarily have to deanonymize the data, it could be something like an alert that pops up when the search engine detects you may have a medical condition.


> Just getting out that message

How do you suggest getting out the message? Blanket public health advertising would be very expensive and would be ignored by most people.

I think search histories are potentially very useful but privacy is important too. It would be interesting to see if there's a way of using search history information in a way that preserves privacy.


We should get the word out that jaundice is a serious medical condition? I'm pretty sure that work has been done. "If your eyes turn yellow, you need to see a doctor" is an easy sell for anyone.


OTOH, there were, as recently as a few months ago, ads in Toronto's subway system exhorting people to go to the doctor if they found blood in their urine. And this is in Canada, where there's no financial barrier to going to see your doctor for "maybe it's nothing"-style ailments. I fear you might be overestimating the average person.


Should be an iPhone app


Thank you! Your comment is 10x more informative than that NYTimes article. NYTimes used to be much better but now it's just no different than other small newspapers.


It's a fantastic concept, I search for medical symptoms all the time.

But it's funny, I was thinking the same thing the whole time I was reading the article... what are the search queries?!?


The insight gleaned into user health from this is phenomenal, the question is how does a search engine provide helpful feedback in a non alarming (with all the interpretations of that) manner in order to help the user seek professional help in a purely disinterested way (healthful with well being in mind rather than commercial interest, even if incidental).


> "it typically produces a series of subtle symptoms, like itchy skin, weight loss, light-colored stools, patterns of back pain and a slight yellowing of the eyes and skin that often don’t prompt a patient to seek medical attention."

But these symptoms map closely to other serious diseases as well, notably liver disease or liver failure. (I've had them; I know.)

While it might make sense to flag such symptoms as serious and requiring professional advice, in this case I don't see how you could possibly distinguish between the two given the vagueness of the input data.

That raises a real question about how you inform the user/potential patient without freaking the person out, or driving him or her down the wrong path.

Some other commenters have mentioned that they have been mis-diagnosed by a GP, and that perhaps this system would help. I think mis-diagnosis is a different problem. This system's real potential value is in driving more people to get professional help, who likely is a GP.

Tragic exceptions aside, most GP's are very good at what they do, which most days involves keeping "regular" people healthy. Perhaps the biggest problem in health care (at least in the US) is the lack of enough GPs in many communities and the inability of people to get access to them conveniently and affordably.


I don't understand why this function is needed if the search engine performs its purpose. Shouldn't searching on any of those symptoms return results that suggest the possibility of pancreatic cancer? As designed, it returns those results to the very person with the most vested interest in both the results and their own privacy.

How is this project anything more than "other people snooping around in my search queries," or any better than simply tuning the search engine to highlight those results more if they are believed under-represented?


Looking up any symptoms at all on the web is pretty much the worst possible thing a health anxiety sufferer can do.

As it is, nearly any symptom you put in will bring up the possibility of cancer, so it’s all just noise.

Being able to connect disparate symptoms that the patients don’t connect themselves is a good thing.

Snooping on searches isn’t necessarily the greatest way of achieving that given all the privacy implications, but it may be reasonably effective.


It could find correlations based on search that were previously unknown. Imagine a world where the link between tobacco and cancer isn't known: perhaps it emerges that there's a correlation of people 10 years before searching for cancer also searching of where to buy cigarettes.

I think it's interesting that this research can happen.


Well, I admit that does sound like a promising concept, but it's not quite the same one described in the article. By my reading, this project seems oriented more towards correlating some number of searches and connecting the dots in a present-tense, Clippy "Your search patterns indicate you might have pancreatic cancer. Can I show you patient reviews of several good Oncologists in your area?" - sort of way.

That seems rather different than "Hey, 10 years ago you searched on some stuff that indicated you might have had pancreatic cancer. Sorry we didn't catch it sooner, since the 10 year survival rate is below 5%..."

To be clear, either one seems potentially quite cool and interesting. But if someone else besides me and Clippy are privy to these results - unless I explicitly shared them - then it seems a little creepy.

In retrospect, I suppose I wouldn't be too terribly offended if I died already, but my results were later able to help prevent someone else's untimely demise...


How long before Google and Microsoft put up an automated warning:

Your recent search queries suggest you may have cancer, please seek advice from your doctor.

Or some time in the future...

The pattern of your searches seems to indicate you are a terrorist, we aren't telling you and we have called the thought police.


What if they put this warning up to their (actual) customers, like the US insurance companies that could hike premiums or ditch their customers before they realized they had costly conditions.


I would be quite likely willing to pay Google for their services more than they are making out of me currently, if:

1.They agree legally bindingly not to give any of my data to anyone.

2. They do not show me a single ad anywhere.

Obviously, that would mess the current business model of everyone counting on adwords revenue, so I am not holding my breath here.


They would quickly have a potential revenue ceiling that would likely decline over time.

At the moment it's the opposite.


Google is trying this with their YoutTube Red, which removes all advertising from YouTube.

I believe it is actually kore profitable than advertising. With ads, an average person is worth between $0.01 and $1 a month (depending on what type of ads), much less than the $9 for YouTube Red.


I had an humorous image of that information being relayed to you by Clippy.


You appear to be writing a hacker news comment, do you want me to really fuck that up for you?


Methods: We identified searchers in logs of online search activity who issued special queries that are suggestive of a recent diagnosis of pancreatic adenocarcinoma. We then went back many months before these landmark queries were made, to examine patterns of symptoms, which were expressed as searches about concerning symptoms. We built statistical classifiers that predicted the future appearance of the landmark queries based on patterns of signals seen in search logs.

Results: We found that signals about patterns of queries in search logs can predict the future appearance of queries that are highly suggestive of a diagnosis of pancreatic adenocarcinoma. We showed specifically that we can identify 5% to 15% of cases, while preserving extremely low false-positive rates (0.00001 to 0.0001).


It would be really interesting if after a series of queries, the search engine displayed a little text block that said something like:

"We don't often do this, but did you make the following searches regarding the health of yourself or a loved one? ... SEARCHES FOLLOW ...

Studies show that a large proportion of people making these searches for medical purposes should talk to a doctor about these symptoms. Here's a number to call if you do not have a personal physician: (555) 555-5555"


So, pancreatic cancer incidence rates are about 10 in 10 000, and they detect ~10% of that.

Meanwhile, their false positive rate is as high as 1 in 10 000. Can anyone weigh in on whether that's per user in total?

If so, half their warnings are right, half are wrong. Which is not bad at all, but it is quite important for the wording of your warning.


  "The data used by the researchers was anonymized, meaning
   it did not carry identifying markers like a user name,
   so the individuals conducting the searches could not be
   contacted."
As when AOL or Yahoo released their anonymized data set, it is often easy to take someone's search history and work backwards to find out who they are. How can they ensure that personally identifiable information has been scrubbed 100% from all queries? Maybe a user searched a courier tracking number, and that info can now be looked up on the courier's site and tracked back to their home or office address. Each additional piece of info gets you one step closer to identifying who they are.

Yet one more reason to use DuckDuckGo for your general search needs.


Not sure why you are getting down voted?

Predicting users health data based on simple searches is terrifying, I'm not sure why anyone would be happy with Google or Microsoft having this information, especially when their customers could use this against you.


google and microsoft already has this information


Target actually did something similar, though their intent was to eventually find better ads for families who were expecting. In their case, fully deanonymizing and being straight forward - straight out advertising baby products - turned out to be a nightmare, and they eventually turned to more subtle advertising by inserting baby products into the weeklies.

Could we see a case where, when someone searches for one thing, instead of seeing results that pertain to that immediate query we see results that match common future searches?

http://www.nytimes.com/2012/02/19/magazine/shopping-habits.h...


I was just reading this on mobile, and I got redirected to another doesn't site that told me my device had "problems"!

The NYT can't even keep their site clean from virus malware laden advertisers. Ridiculous.


The NYT on desktop has generally good ad experience. The mobile website is awful awful awful with ads for some reason. Use their app instead if you can


Isn't the real story here that Bing is keeping track of user's search queries for months?

Google Flu is working different. They try to predict a flu epidemic by counting related search queries.

But Microsoft is predicting the health of a single person based on his search history.

Edit: Thinking about it: ofcourse Google, Facebook and others could do the same because they also gather user data.


Everything you do is logged by all these sites, and tied to your identity there, no matter what they say. They have so much storage space to spare that it's more compelling to save the data for potentially analyzing it later than to throw it away.

A few years back, Google decided to recalculate YouTube view and subscriber counts to counter bot usage. They store so much information about every request that they have been able to detect views made by bots in the past, from patterns in this data.


The thing is, this is illegal in the EU.

If you collect data for one purpose, you can't use it for any other purpose, unless you explicitly and in easily readable language told the user about it before.

You cant retroactively get permission to use data for other purposes either.

And currently medical or research purposes are not listed in Microsoft's or Google's ToS.


If people are searching for medical symptoms on a search engine, aren't they already ending up at WebMD or whatever and finding possible diagnoses?

This would have been a lot more interesting if the keywords were a lot more subtle - like a change in behavior marked by a sudden craving for salty foods or whatever.


This is searches over a period of time before any diagnosis was actually made. Not "light stool, eyes slightly yellowed, cancer, " etc.


Assuming sites like WebMD are tracking users and analyzing symptom searches over a period of time.

I expect people would find this more of an invasion of privacy that the search engines doing it.

Hopefully this is a wake up call to people about how simple searches quickly build very personal profiles about you that you yourself may not even be aware of (for better of for worse).


I'm not sure that this is a good way to demonstrate data mining skills. The survival rate for pancreatic cancer is abyssal:

> While five-year survival rates for pancreatic cancer are extremely low, early detection of the disease can prolong life in a very small percentage of cases. The study suggests that early screening can increase the five-year survival rate of pancreatic patients to 5 to 7 percent, from just 3 percent.

WP claims 20%[1] though a glance at the referenced source suggests that the WP summary is bogus.

So the only ones who benefit from this data mining would be health insurances who could get rid of people who'll incur very high treatment cost with low expectancy of success.

[1] https://en.wikipedia.org/wiki/Pancreatic_cancer


You mean abysmal.


Indeed, I do. Thank you.


Cancer can arise anywhere. Symptoms are dependent on where it comes from. I have see too many doctors succumb to cancer, with little warning.

If mri's were faster, I think whole body mri's would be a decent screening tool. Problem is 1. They are expensive. 2. They take a long time to do and are uncomfortably loud 3. generate heat in the body.

There are also tumor markers for many cancers.

Screening guidelines unfortunately have to be doable to populations (lowest common denominator). More informed people with resources can do better if they take initiative (with some trade offs of time, risk).

In general, our bodies can use a lot of tuning. The more you look, the more you find. Some tuning has trade offs.

If you want to be proactive, you also have to ask your doctor for trials of particular tests or treatments. Doctors are conservative, and the first thing they will want to try is wait and see. That leaves you with may be 100 experiments you can do on yourself in a lifetime. We need to be able to do hundreds / week, to get significant progress towards making our bodies have 99.99999% up time.

The future is Star Trek type doctor, but a personal one for everyone. The major hurdles are economic and regulatory. Some physical.


To translate the false-positive rate into concrete numbers:

According to the American Cancer Society (http://www.cancer.org/cancer/pancreaticcancer/detailedguide/...), about 53,070 people will be diagnosed with pancreatic cancer this year. The abstract says this method detects 5% to 15% of cases: that's about 2,700 to 8,000 correct detections. Assuming there are 100 million people using Bing (https://www.quantcast.com/bing.com), between 1,000 and 10,000 cases will be wrongly detected (0.00001 to 0.0001 false positive rate).


It wasn't entirely clear from the article, but I assumed that the false positive rate referred to the ratio of people with matching search queries, not of all Bing users. In that case the absolute number of false positives would be much lower.

Also, the detection of 5% to 15% of cases would seem to me to refer to only Bing users; I doubt they're claiming to be able to detect 5-15% of all cases of pancreatic cancer.

Would've been nice if these things were actually spelled out in the article.


It wasn't entirely clear from the article, but I assumed that the false positive rate referred to the ratio of people with matching search queries, not of all Bing users. In that case the absolute number of false positives would be much lower.

If that's true, how did they have enough real positives to measure such a low false positive rate?

Also, the detection of 5% to 15% of cases would seem to me to refer to only Bing users; I doubt they're claiming to be able to detect 5-15% of all cases of pancreatic cancer.

Yeah, I'm being dumb, there's no way that percentage is out of all cases!


Insurance companies would just love this data...


The final confidence of the "diagnosis" is about 50% (you can use the Bayes formula to get that, see here http://www.visualab.org/index.php/cyberchondria-microsoft-ba...). Yes, 50% is better than nothing, but the NY Times article does exactly what serious newspapers should not do –let people think that "the Internet" is a great place to diagnose yourself. It is not.


No mention of what type of queries they believe after associated :/


> " We showed specifically that we can identify 5% to 15% of cases, while preserving extremely low false-positive rates (0.00001 to 0.0001)."

Back several years ago Google Flu Trend also claimed to have 97% accuracy compared to CDC data. But later on it just found to be way off to the real data. Did the author compare their study to the Google Trend.

Also it's not clear how they achieve the conclusion of low FP. Did they randomize their sample pool and run their predictability model several round?


Can't one already run ad campaigns based on a users search history? What if a cancer foundation or similar organisation ran a targeted ad campaign?


Any links to the non-paywalled technical paper? I'm curious as to the learning models they built to run the actual predictions.


Wait ... they used Bing for such a long time and lived to tell the tale? :)


I was going to make a joke about Microsoft violating privacy as they scan your search queries and are able to ID you but I see HN posters, here, have already taken up that torch.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: