Hacker News new | past | comments | ask | show | jobs | submit login
Covid-19 Open Research Dataset (semanticscholar.org)
219 points by shinryudbz on March 19, 2020 | hide | past | favorite | 47 comments



Call me cynical but isn't this a bit... Redundant? I mean sure, nlp, parse some papers and make some sort of search/q&a type of thing? Fine, whatever.

Nice work on organizing the data and to a fair degree taking care of the tedious pre-processing associated with any ml task but... Idk, I fail to see how this could help. Personally I'll fiddle around with it in my spare time but mostly for fun, I don't expect anything significantly useful to come out of it.

On a higher level I feel like the success of the IT industry has made a lot of people feel like it is the answer to every possible question. And sure, bioinformatics is an incredible subject and I've invested a lot in studying it in my spare time in the last couple of years but __purely__ out of curiosity. But my 2 cents on the subject is that us(developers, ml engineers, etc) cannot provide the answer to the meaning of life. In this situation our safest course of action is to work from home as much as possible and avoid making decision that could potentially make the situation worse. Basically protect ourselves and those around us as much as possible. And by doing so, let people who have the adequate training, knowledge and experience take care of the situation. Sure, play around with it in your spare time, and if you come up with something - share it. But the whole "stand back boys, let us men handle it" mentality is what really bugs me. If anything history has taught us that this goes badly 990 out of 1000 times(to be more in lines with Bayesians). We all want the underdog to win the game but... Come on...


> I mean sure, nlp, parse some papers and make some sort of search/q&a type of thing? Fine, whatever.

I think you severely underestimate the effort and expertise required for getting decent content-aware search, let alone a fully functional question-answering pipeline.

These are their own subfields of research in text mining. The fact that you conflate these as if they were some trivial task on a new dataset shows your lack of understanding of the field.

Biomedical text mining [1] is wide a subfield with plenty of open datasets and competitions such as the bi-annual ACL BioNLP workshop [1]. Furthermore existing knowledge-base creation and information extraction pipelines such as protein-protein interaction extraction, NER, event extraction, drug-drug-interaction minin, etc. could be applied to this novel dataset and provide useful insights for researchers and staff.


> These are their own subfields of research in text mining. The fact that you conflate these as if they were some trivial task on a new dataset shows your lack of understanding of the field.

On the contrary, I've built many, but in this context I see it as a waste of time. As I said in another reply, it's a resume/portfolio task with no real world application.


Yeah, some critics about this feeling improvised, but guess what? it IS improvisation.

Remember what hickey says about improvising? (https://www.infoq.com/presentations/Design-Composition-Perfo...)

It is art and years of preparation right there performing real-time.

So that guy criticizing is only auto declaring he is incapable of improvising on this matter.. and it's ok it's only for the best ones after all. It is called a challenge for a reason.


Redundant how? You say you don’t see how this could help, but the opposite - not doing it - is certain not to help. I don’t understand your criticism and cynicism here.


Redundant as in anything that comes out of it will be nothing more than unsolicited help. In 3 key points:

1. Building something no one has asked for(the people trying to fight the virus).

2. Unlikely for someone to use whatever comes out of it.

3. It's a build-a-portfolio type of challenge at best.


I very much agree here. Have recently ran part of the data through some NLP (incl. AWS Comprehend) and nothing signficant came out. Ended up doing simple free-text or keyword search and only landed on one interesting finding so far: https://twitter.com/yazijys/status/1240465780715683841


It takes downloading one of these files, gunzipping them, extracting the tar and opening up a JSON file at random to really understand just how distant the title of the project is from its contents. I fully realize this is about natural language processing, but... this is beyond reach. You'd have to teach a computer to become a doctor first.

Feels like where AI (and computing in general) should go in case of Covid-19 and other illnesses is analyzing and simulating how the heck we work to begin with. Take a few minutes to watch these videos of supercomputer simulations that show -- fragments -- of the fundamentals of how we exist.

Multi Scale Modeling of Chromatin and Nucleosomes - https://www.youtube.com/watch?v=4Z4KwuUfh0A

DNA animation showing realtime DNA replication - https://www.youtube.com/watch?v=7Hk9jct2ozY

Somewhere in these mind-boggling processes is where the disruption called Covid-19 puts a stick in our wheels. Compared to the complexity of the simulations, abstract ideas contained in these files are so macroscopic by comparison. It's both humbling and awe-inspiring.


Even though the Multi Scale Modeling video is extremely impressive - it is still an MD simulation that uses classical mechanics. A full atomistic quantum simulation of such a large system is out of reach for even the largest super computers. We barely know anything about biology.


They're already using MD simulation in certain areas of drug discovery (computer aided drug design). Naturally, as you mentioned we're compute-limited so the models are simpler and the questions we ask of it are also 'simple'. But that is not stopping us at all. While not exactly MD, My S.O. works in drug discovery and they use simple computational screening models to simulate the chemistry and screen out tens of thousands of possible drug candidates.


How can applying a machine learning algorithm to this data (a collection of research papers) help fight Covid? It’s possible most papers here are bogus or low quality. Garbage in, garbage out?


Yeah I don’t see any current AI models that can combine all the knowledge from a subject and do something really profound with it. I do see you can generate more AI covid19 articles from it.


I imagine they have some algorithm that takes an input consisting of the title, abstract, and authors with the information about how impactful their previous work has been (this is probably the most important factor) and outputs some ranking or a likelihood that the research will be cited/clicked on.


That's feasible, it's not even a hard thing to do (maybe hard to do very accurately), I worked for a company that offered exactly that service, as a small project.

Still, that probably doesn't answer the question, how does that really help the fight.

I suppose we'll learn what they have in mind when those questions are asked in the competition.


Might be useful to cluster the papers into various topics. For e.g. a person interested in drug discovery needs a different (biochem biased) set of papers than one doing vaccine development (immunology biased). *(caveat obvious overlaps)


Create a generative model that makes more covid-19 papers. Bonus points if it's possible to manipulate the model into creating papers that tell us how to create a cure or vaccine /s.


Where are the most up to date, most reliable case numbers? I'm tracking US day-by-day case growth.

These all have different numbers:

http://covid19.fyi/

https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_t...

https://coronavirus.1point3acres.com/


For global numbers, the WHO daily situation reports:

https://www.who.int/emergencies/diseases/novel-coronavirus-2...

The same data is also presented graphically here:

https://www.who.int/redirect-pages/page/novel-coronavirus-(c...


I recently had to deal with this problem when building out Covidly (www.covidly.com)

Initially I tried using WHO and JHU, but quickly found their data to be riddled with discrepancies, occasional bugs, and direct contradictions with official statements from various countries.

I ended up aggregating multiple sources (including WHO/JHU/etc), performing some sanity checks to remove outliers, then doing my best to merge the remaining results.

Happy to share this data publicly if there's interest!


"Explosive growth" and "Virus is largely out of control" is dangerous risk communication.

How often are you updating the data? If there's manual deconfliction, do you clearly indicate how old data for a country or state/province is, or how accurate the reporting is that your massaged summary comes from?

If you're meaning to put this out in the world as a source of information please get some feedback first from people that do this sort of thing for a living. Inaccuracy or excitable language can do more harm than good in emergencies.


You make a very good point about risk communication. I already had to make a few updates (e.g. hiding the mortality rate) that were causing unnecessary panic. I'll work on optimizing the existing language as well.

Regarding the data, it's updated and processed automatically every 10 minutes.

I really appreciate the feedback! Let me know if anything else stands out to you.


Here, I would recommend that mortality rates are not bad. The goal in risk communication is to instill a level of concern equal to the current threat. It's all about context.

If you show infection, death and recovery rates, you have to provide context and help people understand what a thing means.

1-10 scales can make parsing difficult (3 and 4 have the same description right now, for example). Governments, militaries and emergency aid orgs put a lot of effort into color and coding systems.

Give Peter Sandman a Google, and check out his site here:

https://www.psandman.com/

He's an expert in how to talk about scary, hard-to-visualize things (like a viral pandemic).

Also, how old is the data being drawn from, what algorithm do you use to de-conflict the sources, and how do you disclose this to your audience (other than the general about page)? If a source has different refresh rates for countries that it tracks, how are you reflecting that to your audience?

A note, China is missing from your nifty "First 20 days" graph, which maybe you should just call "First 20 days after 200 cases" or something like this to make it clearer what's being tracked.


I would love to see what you're doing here. I'm pulling it from JHU, it's been pretty consistent, but not as up to date as I would like. But I'm looking into aggregating data from other places such as Wikipedia. It would be great if there was more of a group effort here.

I've seen some efforts such as: https://github.com/covid19-data/covid19-data which is looking to separate out the data aggregation from the dashboard. However, they are scrubbing out the state-based information which I rely on.


I am a researcher, and I fail to find detailed data, please help! In all datasets we see cumulative confirmed, recovered and death numbers organized by day. We would need culumative confirmed by onset time (first symptoms), and confirmed by test time (the current value). We would also need recovered and death by onset time and confirmation time. Where are these numbers?


Unfortunately the existing public data sets I've seen lack this information. The level of detail for each confirmed case is largely dependent on which country is reporting the data (plus it's often in an unstructured and inconsistent format). I would love to know if you find any data source with more details.



What about by city?


I have been using NYT: https://www.nytimes.com/interactive/2020/us/coronavirus-us-c...

They update it many times a day.



I think https://www.coronawiki.org/ uses that data to visualize it.


Here are the CDCs #s - they're released @ noon for up to 4pm the previous day.


It might be fun to checkout https://covid19.doctorevidence.com/ they have loaded the CORD dataset with a dashboard like interface and a query language. E.g. https://search.doctorevidence.com/search?query=ss(6f1da786-6... (user/pass covid19/covid19) for a direct link, and it provides integration with the other medically relevant feeds.


Without proper search assists like Elastic search/lucene etc, it’s not useful for most people trying to read it. Maybe someone can set up a site with elastic search with articles on there?


Checkout https://covid19.doctorevidence.com/ they have loaded the CORD dataset with a dashboard like interface and a query language. E.g. https://search.doctorevidence.com/search?query=ss(6f1da786-6... (user/pass covid19/covid19)


I thought so too, I’m working on this. If anyone wants to collaborate please let me know how I can get in touch email, Twitter.


I'm not sure what the cost would be to have ES/Solr hosted (is there a free option?).

I'm too poor/cheap to do so, if it were up to me I'd set this up with a static js page that uses something similar in the browser (eg. lunr js) and allows you to download that dataset on the spot.


I have a few apps that I host with a cloud provider (scalingo.com) and initially I can cover the costs. If there’s a lot of traffic because it proves to be helpful there are a few groups I could ask to help with some funding.

For now I’m building a custom index with mongoDB - if anyone is familiar with building ES queries, would be keen to chat about how that might be better.

I'll post a link here later today.


A lucene based solution would probably scale better and would allow you to implement more complex behavior. You get analyzers, stopword filters, synonyms etc. out of the box, and you can express things like "virus within 5 words of lung" or "covid or virus but covid is way more important". I believe given the rather small dataset and the fact that you won't even expose more complex queries on your interface, the main benefits are the analyzers and various filters you can use.

If you are already doing stemming, stopword filtering and maybe synonyms you're probably fine.

(My email is in my profile if you want to discuss this in further detail)


Lucene does sound good, but I don't have any experience with it so I can't get anything happening quickly. But very happy to have help if you can?



CEO at Scalingo here. Scalingo may help you cover the cost cf our latest announcement https://twitter.com/ScalingoHQ/status/1240282733659774979 Contact the support for more details!


Thanks Yann, much appreciated!


Maybe algolia could help?



Many of the criticisms voiced in this thread stem from a lack of expertise in biomedical Natural Language Processing and text mining.

Various annotated datasets and models already exist within the field which can extract potentially useful information and be used in downstream task for targetted document and information retrieval. Biomedical text mining [1] is wide a subfield with plenty of open datasets and competitions such as the bi-annual ACL BioNLP workshop [1].

- Biomedical Named Entity Recognition: extract names of proteins, drugs, diseases, symptoms, etc. and classify their biomedical category [3]. Extracting the terms of symptoms is a crucial in document discovery and modeling and knowledge-base creation. Several open datasets can be found here [4].

- Biomedical relation and event extraction: Traditionally focused on extracting protein-protein interactions, which are crucial for virtually every process in a living cell. Information about these interactions provides the foundations for new therapeutic approaches. Recently interest have been shifted to the extraction of complex relations such as biomolecular events. [2] These methods can detect and classify the causal relations between the genes and proteins in a sentence like "TNF-alpha is a rapid activator of IL-8 gene expression by...".

- Document retrieval: Helping researchers and medical staff find relevant topic-specific papers by improving search with topic modeling, document similarity, named entities, etc.

These are only some examples of common biomedical text mining tasks and there are plenty more. Now of course, relying on previous annotated data is an issue because the tagged categories might not relevant for many of the issues related to COVID19. However, even unsupervised modeling like using SciBERT to create topic models or document clusters of related documents can be helpful for scientific discovery.

1. https://en.wikipedia.org/wiki/Biomedical_text_mining

2. https://aclweb.org/aclwiki/BioNLP_Workshop

3. https://www.hindawi.com/journals/cmmm/2015/571381/

4. http://gcancer.org/clstmdata/

5. https://bmcbioinformatics.biomedcentral.com/articles/10.1186...


This is would be the perfect time for IBM to apply all Watson technologies and resources to develop new insight into Covid-19.


Watson is basically a brand name for IBM's data analytics consulting services. My understanding is they're not that great at it, they haven't scored any major wins outside of that Jeopardy run. I don't have any articles on hand but i seem to recall reading about some failures with a medical partner in particular, but then that's been a tough field for other big names like Google, too.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: