Fighting the global coronavirus pandemic will take a collaborative effort like no other. Immediate, free open access to research results is vital to accelerating the global research community’s progress towards COVID-19 testing, treatments and vaccines.
COVID-19 Open Research Dataset (CORD-19), a free and growing resource with 59,000 scholarly articles related to virus, is a glimmer of hope in the quest for answers. The dataset, hosted by the Allen Institute for AI and developed in partnership with the National Library of Medicine (NLM) and others, enables researchers to apply novel artificial intelligence and machine learning strategies to identify new knowledge to help end the pandemic. The White House Office of Science and Technology Policy kicked off the CORD-19 initiative as it looked for ways to leverage AI and machine learning to address COVID-19.
“The real value will be to get new insight out of the corpus of text that leads to new knowledge about COVID-19 and advances the response to it” says Jerry Sheehan deputy director of NLM. “There is urgency around doing it for COVID now. I’m hoping this experience can help us develop approaches and techniques that can be applied in other scientific areas as well.”
The dataset leverages work that the NLM launched following another call from the White House OSTP to expand access to coronavirus-related publications and associated data. Many publishers were already posting COVID-related articles on their individual websites. But Sheehan says for the information to be most useful it needed to be not only open, but also formatted in a standard way, and easily accessible in one location.
Under NLM’s COVID-19 Initiative, publishers can voluntarily deposit articles that are made available as quickly as possible after publication for discovery in NLM’s PubMed Central (PMC). Inclusion of an article as part of the COVID-19 initiative requires an article-level license that allows for reuse and secondary analysis. Articles under traditional copyright restrictions are typically ineligible for this type of redistribution and use.
To date, 50 publishers have made their coronavirus content available in PMC. Within the first two weeks, articles from the COVID-19 subset had been accessed 2 million times. By mid-May, articles in PMC’s coronavirus collection have been accessed more than 8 million times.
As with PMC’s overall collection, Sheehan anticipates those using the COVID-19 subset include researchers, clinicians, students, educators, and the general public.
When OSTP subsequently asked about participating in the CORD-19 project to make text available for machine processing, NLM jumped in. “We saw an opportunity to partner with others to expand the corpus of coronavirus text that was publicly available and make it available to the machine learning community.” From our earlier work with OSTP and PMC “We knew the flood gates were about to open and we were about to have a sizable addition to our own content around coronavirus,” Sheehan says.
Within about 10 days of OSTP floating the idea of developing a corpus of coronavirus text for machine learning, CORD-19 launched with 29,000 articles on March 19. Other partners in the initiative included Microsoft Research, Chan Zuckerberg Initiative (CZI), and Georgetown University’s Center for Security and Emerging Technology (CSET).
CORD-19 demonstrates the value of having content that is easily discoverable, text mineable, and in a consistent, standardized, machine-readable format. “To really unlock value of open science, we need to make things FAIR–findable, accessible, interoperable, and reusable,” says Sheehan.
The PMC and CORD-19 initiatives have benefitted from the goodwill of the publishing community. Publishers are voluntarily providing content they would not otherwise open up for access and reuse. As an archive, PMC can continue to provide perpetual access to all articles deposited under the COVID-19 Initiative for which the copyright holder provides such permission.
However, it remains unclear what will happen when the pandemic is over. While some publishers have made their articles permanently available, others have contributed content under customized licenses that provide only temporary access.
In the meantime, NLM is hardening its information systems to support changes to submission processes needed to handle the influx of coronavirus articles. “We did things quickly under a public health emergency rubric,” says Sheehan. “As much as we hope there won’t be another crisis, sooner or later there will be, and we want our systems to be prepared.”