Federal Government Sets Its Sights on Data
by Richard Huffine
The U.S. federal government is making a significant investment in a next-generation workforce that can take advantage of the information deluge created by data collection in every aspect of government operations today. The most important example of this investment was the White House announcement on Feb. 18, 2015, that DJ Patil was joining the Office of Science and Technology Policy as deputy CTO for data policy and the chief data scientist for the Obama administration.
Patil is credited, along with Jeff Hammerbacher, with coining the term “data scientist.” At the time, Patil and Hammerbacher were forming data science teams at LinkedIn and Facebook, respectively. In his ebook, Building Data Science Teams, Patil succinctly defines data scientists as “those who use both data and science to create something new.” He lists the traits that make a good data scientist: technical expertise, curiosity, storytelling, and cleverness. Patil clarifies that a data scientist doesn’t necessarily need a background in computer science. Rather, he says that the top data scientists “are interested in understanding many different areas of the company, business, industry and technology.” Based on this definition, it would seem that librarians would be great additions to any data science team.
“Data science has several very different components and I believe that librarians are capable in working in all of them,” says Amy Affelt, author of the recent book The Accidental Data Scientist. “While there are certainly librarians and information professionals who do computer programming, I think that our skill sets are more directly transferable to roles in data verification, data analysis, data policy, data curation, and algorithm accountability.”
The federal government is an excellent place for data scientists to practice. Patil has some history with the federal government already. He was a Science and Technology Policy Fellow with the American Association for the Advancement of Science (AAAS). During his fellowship at the U.S. Department of Defense, he directed efforts to leverage social network analysis to anticipate emerging threats to the U.S. As a doctoral student and faculty member at the University of Maryland, Patil used open datasets from the federal government and others to improve weather forecasting. In addition to LinkedIn, Patil has also held positions at Skype, PayPal, and eBay. His most recent position was at RelateIQ, a data intelligence company.
The U.S. federal government has started to take data science seriously in recent years. The National Science Foundation (NSF) has committed to improving the nation’s capacity in data science by investing in the development of workforce skills and infrastructure. Based on objectives articulated on the performance.gov website, NSF has pledged to support the training and workforce development of future data scientists, increase the number of partnerships that address Big Data challenges, and increase investments in current and future data infrastructure. The NSF commitments demonstrate an understanding by the administration that data represents a transformative new currency for science, education, government, and commerce. It is acknowledging that data is everywhere and is being produced in rapidly increasing volume and variety.
“Working with data is something that we have always done as librarians, but the volume, velocity and variety of data has changed,” Affelt says. “In recent years, there has been a clarion call for us to add value to information, whether textual or numeric. The fact that there are a lot more sources of data, and that data is available in so many different formats, provides us with more opportunities to demonstrate the value of our skills. We can use data to create visual deliverables and tell stories to address our constituents’ challenges and at the same time, market ourselves as critically important information professionals.”
Patil is not the only new leader in the federal government who’s charged with managing data assets and improving the use of government data. The federal government’s Chief Information Officers Council has started to maintain a list of chief data officers. In March 2015, the U.S. Department of Commerce followed the White House by hiring its first chief data officer, Ian Kalin. He is the former director of open data for the startup Socrata and was a Presidential Innovation Fellow for the U.S. Department of Energy. At WorldAffairs 2014, Kalin, who at the time was director of open data at Socrata, said, “[O]pen data is not new. It’s as old as the very first library in human civilization. Public money being used to do something, build something that informs people.” Having someone lead a government data program who understands the concepts of data, information, and libraries is potentially a truly revolutionary thing.
Of the working relationships between data scientists and librarians in the future, Affelt says, “My hope is that we will become more embedded in data science teams and that data scientists will look to us as experts in data verification and data analysis. It is up to us to market our skills so that we are the first people that come to mind when organizations are looking to fill these roles.”
Kalin’s role is to help unleash the power of the Department of Commerce’s data to strengthen the nation’s economic growth. He is working to make commerce data easier to access, understand, and use. Kalin oversees the development and implementation of a vision for managing the diverse data resources across the department. Additional agencies that have added positions for data scientists or data officers include the Federal Reserve System, the Environmental Protection Agency, and the Federal Communications Commission.
Other agencies within the U.S. federal government are making significant investments in data science, especially since the future may hinge on large questions that data science may be able to answer. The National Institutes of Health (NIH) has created an Office of Data Science, which is working to take advantage of the exponential growth of research datasets to drive biomedical research. It developed an initiative called Big Data to Knowledge (BD2K), which focuses on helping the broader scientific community update knowledge and skills in the areas of data science and data management. BD2K works to meet data science challenges, including the storage, management, sharing, and analysis of biomedical Big Data.
NIH has also created the bioCADDIE (biomedical and healthCAre Data Discovery and Indexing Ecosystem) initiative to engage a broad spectrum of stakeholders in biomedicine to make data discovery easier. These stakeholders include researchers, publishers, librarians, library scientists, and technologists, across all domains. As Affelt observes, “The US federal government is the gold standard and the go-to source when using data. Their data is rock solid. The problem lies in knowing how to find it.”
Within the U.S. Department of Energy, the Los Alamos National Laboratory has established a team focused on managing data science at scale. That team, similar to those involved with the NIH initiatives, is working with “[e]xtremely large datasets and extremely high-rate data streams. … [S]ensors, embedded computing, and traditional high-performance computing” are producing the data. The Data Science at Scale team is developing strategies for interactive analysis with these datasets. It considers the effort a new frontier at the intersections of information science, mathematics, computer science, and computer engineering. The Data Science at Scale initiative aims to provide tools capable of making quantifiably accurate predictions for complex problems through the use of data and computing resources.
The Board of Governors of the Federal Reserve System created an Office of the Chief Data Officer (OCDO) in 2013 and has established a mission, scope, and organizational structure. The OCDO provides the board with support for architecture and data management. The board has also created a separate Board Data Council that is composed of key enterprise stakeholders and supports enterprise data governance policies, processes, definitions, standards, and metrics. Together, the two entities develop data standards that support monetary policy, financial stability, consumer protection, and economic research, while striving to meet the needs of the research and supervision communities within the Federal Reserve System.
It will be up to Patil within the Office of Science and Technology Policy to make sure that the outcomes of all of these efforts inform and instruct future government initiatives to collect and analyze data and ensure that government data is available to the public to do its own analysis. In a recent interview on the website FiveThirtyEight, Patil said his new role is to “responsibly unleash the power of data for the benefit of the American public and maximize the nation’s return on investment in data[.]” That includes the ability for the public to find, get, and use government data.
Available but Lacking Findability
The Obama administration’s effort to make government data more accessible to the public started back in 2009 when Vivek Kundra, the first federal CIO, announced the creation of data .gov. The effort was continued in 2013 with the release of the administration’s Open Data Policy. It called for “agencies to collect or create information in a way that supports downstream information processing and dissemination activities. This includes using machine-readable and open formats, data standards, and common core and extensible metadata for all new information creation and collection efforts.”
In March 2015, Patil told The Washington Post, “We’ve got data.gov, which has really changed the game. Think about the billions of dollars that rest on open data infrastructure. People do research on that data, that research turns into insight, that insight turns into wisdom and that wisdom is put back into models and scientific results. The foundation of all this is open data.” Currently, data.gov identifies more than 130,000 datasets that are available for reuse by the public.
However, as Affelt explains, “findability is a problem. I know that great federal data exists, but if it is difficult or impossible to find, it is worthless. It would be great if data.gov was a robust, intuitive search engine that searched across all agency and department datasets, along with providing an option for transparent drilldowns, instead of a webpage that housed datasets.” But data.gov has been changing and hopefully improving since its debut in 2009. In 2013, Data.gov announced the open source Open Government Platform (OGPL), a collaboration between the U.S. and India. Data.gov is also now using the Comprehensive Knowledge Archive Network (CKAN), a web-based open source data management system for the storage and distribution of data.
Another 2013 memo on public access to federally funded research requires that scientific data “be stored and publicly accessible to search, retrieve, and analyze.” The data addressed in that memo, however, will largely be collected and made available by academic and research institutions that have been funded at some level by the federal government to conduct that research. It is unclear at this time whether those datasets will be included in Data.gov in the future. In recent months, federal agencies have been releasing their public access plans to address these requirements. The proposed strategies for managing scientific data don’t specify exactly how data will be available. Several of the plans, such as that of the National Institute of Standards and Technology (NIST), provide for three components: data management plans (DMPs), an enterprise data inventory (EDI), and a common access platform (CAP) providing a public access infrastructure.
A Common Platform
According to the NIST Public Access Plan, the EDI is a “catalog of the datasets that are generated via NIST-sponsored research.” It is part of the comprehensive public listing of agency data that was required by the 2013 Open Data Policy. The EDI must include datasets used in an agency’s information systems. The NIST Public Access Plan extends the purpose of the EDI to include all datasets produced using NIST funding. The EDI is intended to be an index containing information that describes datasets and information about where and how to access the data, not a repository of study data. NIST has established an interagency technical advisory group to provide input and ensure that the NIST EDI meets the needs of a wide range of stakeholders.
The CAP that NIST is recommending will put in place a technical infrastructure and populate it with persistent identifiers and metadata for all publicly available NIST data. It will provide interoperability among datasets within NIST and potentially with data from other federal agencies. NIST will assess the long-term needs for the preservation of scientific data in fields that the agency supports and outline options for developing and sustaining repositories for scientific data in digital formats, taking into account the efforts of public- and private-sector entities. NIST expects to have the CAP operational by October 2015.
Some key aspects of this new data revolution are considerations and concerns about the challenge of ensuring that government-funded data is used responsibly. In March 2015, Patil told The Washington Post, “Privacy is essential, but there’s also the question of bioethics—the issue of what is the acceptable use of that data. There’s a whole rich field within bioinformatics on the ethical use of data and genetics.” To address that concern, the public access plans have caveats for their release of scientific data that state it will be made available “[t]o the extent feasible and consistent with applicable law and policy; agency mission; resource constraints; [and] U.S. national, homeland, and economic security. …”
Federal libraries often provide critical support for agencies to acquire and use commercial information resources, including data. They are also one of the most accessible and helpful organizations within an agency for academic and corporate researchers who are looking to identify and access data that is collected by an agency. In many cases, however, compliance with mandates such as open data and public access to federally funded research is addressed by very different components of the organization. Establishing data science teams that integrate program specialists, information architecture and technology experts, and librarians would ensure that what is built delivers the most value to the broadest group of stakeholders.
One municipal government has determined that the most qualified organization to help the public find, get, and use its data is its public library. In January, the Knight Foundation announced that it had given Boston a $475,000 grant to hire a new team of librarians who will take the data that the city releases and make it usable and understandable for everyone. The program will include training and the development of reference materials for residents. It is unclear to what extent U.S. federal government agencies are using the skills of librarians in their efforts to make their data available and useful both internally and externally. It is clear, however, that the new staff that is leading data science within the U.S. federal government is looking for curious, clever, adept, and savvy partners to drive data science and change the way government works. Librarians are an excellent place to start to find these skills among the federal workforce today.