Although government agencies manage massive amounts of information, there is little known about exactly how it is used and by whom: Turns out, there is little data on federal data.
A new effort is underway to leverage artificial intelligence to better track what research is being done with what data. Ultimately, it could result in a data usage scorecard that could make it easier for researchers to find who else has used datasets in similar research, enabling them to reproduce results, advance science, and support government in the push to evidence-based decision making.
“Currently, the only way to find which databases have been used is to try and figure it out from reading publications – and there are millions,” said Julia Lane, co-founder of the Coleridge Initiative, a nonprofit started in 2018 as a spin off from New York University, where Lane is on faculty.
By building a modern machine learning (ML), natural language processing (NLP) approach to find what datasets are used in what publications, agencies can break down barriers to the access and use of public data. “The approach could demonstrate the value of data as a strategic asset,” she says.
New presidential executive orders push the use of evidence to address health, jobs and economic mobility, social justice and climate change issues. But you can’t make evidence bricks without data straw, Lane says: “What we really want to do is figure out to find how the data are used and then how to make it more useful.”
The Coleridge Initiative and Kaggle recently sponsored a competition challenging data scientists to develop a model to assess the use of publicly funded data. More than 1,600 teams from around the world entered and $90,000 in prize money was divided among the top seven winners in late June.
The Kaggle data science competition was focused on building an open and transparent approach to help the government make smarter investments. The challenge to develop algorithms to automate the search and discovery of references to data sparked tremendous global interest. The information is all open source and winners share their solutions to move the project forward.
First place went to a team of two Tuấn Khôi Nguyễn, a senior data scientist and Nguyễn Quán Anh Minh both of VNG Corporation in Vietnam; second place: Chun Ming Lee, an NLP data scientist at various start-ups companies in Singapore; and third place to Mikhail Arkhipov, a data scientist in Russia.
“The original goal of the competition was to help U.S. government agencies, but I hope the model results will help the data science community to keep track of papers and extract information on a large scale, faster. I know it’s a long shot, but I hope it will have some impact,” says Nguyễn, 27, who worked on the project at his home in Ho Chi Minh City. After entering several competitions, he says he was pleased to add the title “Kaggle Grandmaster” to his credentials and take the top prize of $30,000.
Lee says there was no single “ah-ha” moment in writing code for his submission, but rather like solving a puzzle and a compilation of ideas—with some coming while he was taking a break from his computer and cycling. The problem was among the most difficult and ambitious Lee says he’s tackled, working at it as a hobby late at night on top of his regular job.
In Moscow, Arkihov says he was intrigued by the competition and felt the task was an important one to address. “There’s a huge amount of data available on the web and it’s really easy to get lost,” he says. “It would be super great to have a system that highlights some of the data sets that are influential or analyze how the data is used.”
Agencies are required by the 2018 Foundations of Evidence-based Policymaking Act to modernize their data management. Lane outlined just how ML and NLP can achieve this in a paper, Using AI and ML to Document Data Use.
Nancy Potok, former chief statistician of the U.S. and chair of the Scientific and Technology Advisory Board for Coleridge, says that, for agencies, implementing the Evidence Act and getting feedback on their datasets has been very difficult. As a result, she was excited to see Coleridge help find a way to address this requirement in the law.
“The purpose of the legal mandate is to engage the agencies in making data more accessible and more useful to the public,” Potok says. “It’s important for agencies to collect data to run their own programs. But for much of the data, if you really want evidence-based policy, you’ve got to make it more accessible to the public and researchers to gain valuable insights from the data.”
“The beauty of the scorecard is that by using these algorithms that have been developed, you have a starting point, scientific journals and publications, to understand how researchers have been using agency data sets, and this information is updated in real time,” says Potok.
The scorecard could, for example, be useful to the Census Bureau as they work to release privacy protected data sets from the American Community Survey, or to the National Oceanic and Atmospheric Administration as they wrestle with managing and releasing climate data. Government agencies like the U.S. Department of Agriculture and the U.S. Department of Commerce can learn which of their datasets have been used to examine, for example, racial disparities, the digital divide, or the economic impact of coastal inundation.
Potok says, for agencies, it’s valuable to know what data are being used so they can better manage their scarce resources and improve datasets that aren’t generating much interest or are lesser known, Potok says.
The scorecard has potential benefits for the users as well. It could be helpful for researchers to be able to click on a dataset and get a list of others who have previously used these data in their analyses. Early career researchers starting out can both find what areas are understudied, and identify emerging research hot spots The public can see how their data are being used, improving transparency.
Another partner on the Show US the Data initiative and October conference is CHORUS, a not-for-profit public-private partnership to increase public access to peer-reviewed publications. Executive Director Howard Ratner says open data is a key project for scholarly communication and it’s expected that the Coleridge Initiative will deliver an impactful reporting system for federal agencies to assess the value of their research datasets. Since much of the content being analyzed will come from CHORUS publisher members, Ratner says the network is promoting the collaborative effort.
Just how the winning ideas can be used by agencies to develop a system to document usage and citations will be discussed further at the Show US the Data conference at the National Academies’ Keck Center in Washington, D.C., October 20. The event will include some of the challenge winners, along with scientific journal publishers, the philanthropic foundations, government agencies, and members of the broader research community.