Open Data Transition Report: An Action Plan for the Next Administration

Goal III: Share scientific research data to spur innovation and scientific discovery
Recommendation 22: Identify and publish large, high-quality datasets across all fields for use in machine learning to support advances in artificial intelligence.
First Year
National Science and Technology Council & Intelligence Advanced Research Projects Activity
Action Plan:
  • • The National Science and Technology Council (NSTC), in partnership with Intelligence Advanced Research Projects Activity (IARPA), should review feedback from the recent IARPA solicitation on artificial intelligence (AI) to identify key government datasets that can be opened up for use in machine learning to support AI development across all fields by January 2018.
  • • The NSTC should work with IARPA to develop a prioritized list of datasets for release.
  • • The NSTC should then convene members of the AI community to provide input on these priorities, finalize a plan for releasing datasets, and develop a timeline and resources to do so.

Artificial intelligence (AI) is already playing a significant role in the daily lives of Americans, including voice activated personal assistants in smartphones, website translation, and automated driving features. Additional AI advances are around the corner, and the technology industry is planning for the next wave.

Researchers believe AI will have a great impact on the future economy, including public benefits in the medical field, transportation, public safety, and more.  Eventually AI could help doctors diagnose patients and suggest treatments tailored to the individual or serve as a tool for teachers to customize lesson plans for each student’s personal needs.  Al could also help to efficiently allocate government funds.  

Machine learning—a process in which computers continually improve analytic capacity as they work with more and more data—drives AI. Machine learning requires large, unbiased datasets to create accurate models within the domain of interest. For example, developers used ImageNet, an annotated database of over 14 million pictures,  to “train” computers to properly classify images. Other datasets may be valuable for geospatial analysis, language analysis, or other aspects of machine learning.

The National Science and Technology Council (NSTC), in partnership with Intelligence Advanced Research Projects Activity (IARPA) and other interested agencies, should launch a project to ensure that relevant, government owned “training datasets,” as identified by AI researchers and other stakeholders, are made readily available. This would address a current bottleneck in AI development: the relatively small amount of public data available to train AI systems and enable them to reach their full potential.

As a first step, the Machine Learning and Artificial Intelligence subcommittee of the NSTC should work with IARPA, which should have strong indicators of demand from responses to a recent solicitation  on overarching questions in AI. The NSTC subcommittee and IARPA should draft recommendations for prioritizing datasets for use in AI training. The NSTC should then convene a group of government and industry AI specialists to determine the datasets in greatest demand, analyze the barriers to opening those datasets, and develop plans to open them.

If privacy concerns restrict data opening, the NSTC should develop a data enclave similar to the Centers for Medicare & Medicaid Services data enclave within the Department of Health and Human Services or the National Renewable Energy Laboratory’s Secure Transportation Data Center. Research and industry partners would have to submit an application detailing their use case before being granted access to the enclave and would take legal liability for protecting the sensitive information.

Opening "training datasets" will be an important step toward encouraging scientists to open the data generated through AI research. The AI community should consider this topic as the field advances. This discussion should address developing open AI benchmarks, open learned representations, open learned parameters, and even open code when the research is publicly funded.

Additional Reading: