Open government data is most useful when it is accurate, timely, and of high quality—but many federal datasets fall short. In 2016, the White House Office of Science and Technology Policy (OSTP) and the Center for Open Data Enterprise hosted an Open Data Roundtable focused on data quality with over 75 attendees. Several expert participants confirmed that most government data requires considerable cleaning and improvement to make it useful. Additionally, the Government Accountability Office (GAO) has issued hundreds of reports with recommendations on improving data quality across dozens of federal agencies, such as the Census Bureau and Internal Revenue Service.
The federal government can improve data quality by applying a concept that has been demonstrated as an effective practice across the tech community: direct user participation.
Similar to citizen coders addressing bugs in open source software, data users can help identify and fix government data quality problems. OSTP should draft a memo encouraging all agencies administering major open data portals, including data.gov, to provide channels for feedback and proactively invite the public to evaluate and improve data according to basic quality requirements. These channels would go well beyond the “report a bug” option that is currently available on many websites but seldom used. More effective feedback channels can give government data stewards input ranging from simple data corrections (for example, correcting an address or name in a database) to expert insights on addressing deeper quality issues. The memo should include methods for successfully accomplishing this goal. Several government agencies and initiatives have developed effective models for using public input to improve data quality.
One such model is the Department of Health and Human Services’ (HHS) Demand-Driven Open Data project. This project provides users a pathway to tell HHS what data they need and creates a transparent feedback loop that ensures follow-up and follow-through. This ongoing project has had a range of positive effects on HHS data quality, including improving machine-readability, helping identify and eliminate manual mistakes, and surfacing opportunities for standardization.
Streetwyze, a tool built through the White House’s Opportunity Project, provides another useful engagement model. Streetwyze collects local neighborhood information from residents and pairs that data with existing government datasets to form open maps and actionable recommendations—allowing citizens to re-engage to make corrections. For example, citizen input can clarify that a building marked as a grocery store is actually a liquor store.
The USAID Loan Guarantee Map used a third model, convening a crowd of experienced geospatial data volunteers in the Standby Task Force and GIS Corps to review 117,000 loan records and clarify 10,000 difficult-to-identify data points. The volunteers accomplished this task in only 16 hours. Working in partnership with the private sector, USAID customized the event to share data with a community of experts before opening the data to the public.
The private sector has already embraced user feedback initiatives, including crowdsourcing, to improve data quality. Google’s mapmaker program allows any Google Maps user to share information about places he or she knows, identifying errors and thus boosting the quality of Google map data. The platform gives increased moderation and editing authority to users who have regularly submitted accurate information. This two-tiered system allows average users to build expertise, rewards power users, supports engagement, and results in timely, accurate data.
The challenge model, in which government agencies hold competitions inviting the public to solve problems, can also improve data quality.
The U.S. Patent and Trademark Office (USPTO) launched a public competition to provide solutions for data disambiguation, addressing the “ambiguous” instances where the identities of inventors or organizations are not clear. This problem results from repetition or overlap when a single inventor or organization appears in the database under slightly different names, or, conversely, when different inventors share the same name (there are several different inventors registered as “Steve Jobs” and “Steven Jobs,” for example). The winning team created a solution that “uses discriminative hierarchical core reference as a new approach to increase the quality of PatentsView data.” The USPTO publishes the improved data on PatentsView, a prototype platform to open and visualize U.S. patent data.
The USPTO has also developed a channel and tools for ongoing feedback on its data resources. The USPTO’s Developer Hub tool, which provides access to USPTO’s extensive data collection and APIs to improve accessibility, includes an online community to gain demand-driven requirements from its users, as well as resources for data visualization. Another project, the USPTO Open Data and Mobility program, is advancing how the USPTO provides data, promotes transparency, and empowers data-driven decision making.