The Rise and Rise of Open Data
by Daniel Hook
When did data enter our daily lives? Sometime in the last 20 years, data transcended our work lives and started to become part of our every day. In the context of our online personal lives, such as our banking details and medical records, most of us would probably be deeply suspicious of open data. However, it is open data that is changing the way that research is done at a fundamental level and is fueling innovation, job creation, and the technologies of tomorrow. It is open data that is changing both how research is executed and how it is being communicated.
|We are in an exciting time, when technological barriers are being pushed back but ethical structures have to respond quickly.
The standard format of a scholarly publication has changed little in the past 350 years—a reliance on an article-based explanation of research carried out, with details of methods, context for the work, and references to related works. The presentation of a new result, effect, or observation is the critical ingredient to secure publication in an influential journal. Peer review has been the method establishing the veracity and validity of the result(s) and hence the authors’ claim to an idea or result. This format has driven the research enterprise—careers have been made and lost based on the publication of key results in a timely and accurate manner.
The research environment is highly competitive. Presenting results first and then publishing in an influential journal is a main factor in attracting an academic’s next round of funding, next job, and advantageous access to people, equipment, or travel funds. But this high-pressure environment leads to adverse effects—results are not documented transparently, and data is often not shared, so it can be almost impossible for a researcher to reproduce the work of a peer.
The Reproducibility Crisis
Striving for reproducibility in science is at the heart of the whole enterprise. If a result cannot be reproduced, it’s not scientific. This is known as the reproducibility crisis. A high-profile study by the Board of Governors of the Federal Reserve System attempted to replicate results from 67 papers published in 13 prestigious economics journals, but could reproduce only 49% of the results (federalreserve.gov/econres data/feds/2015/files/2015083pap.pdf). Reporting on these types of activities is Retraction Watch, which tracks papers with results that have been shown to be inaccurately or improperly reported, including papers that have been removed by the journal.
Why is this only coming to light in the last 20 years? Well, in part, the reproducibility crisis is a modern problem: There are more people in research than ever before, there is a lot more data, and there are insufficient tools and insufficient time to properly check all of the results unless a piece of work turns out to be on the critical path for another result. It’s a complex ecosystem with many players. But how is the research enterprise helping to combat the reproducibility crisis?
A pivotal change in the research world began with the OA movement, which started to gain traction around 2006. Inspired by changes the internet brought to the music and newspaper industries, the motivation for OA was essentially not related to improving reproducibility in science, but rather the freeing of content. The open data movement is the natural extension of the OA revolution, but it is different in character. Whereas OA sought to democratize access to the libraries of research, open data seeks to empower researchers and democratize research itself.
The Challenges of Open Data
If we agree that having more minds to solve a problem is a good thing, then sharing research data makes sense. But life is not so simple. As a researcher, I might be skilled at knowing how to run an experiment and collect data, but not so talented at processing the data and interpreting it. So I should turn to a colleague who complements my skill set. But due to the aforementioned complex research environment, my colleague may be in a different country with different regulations regarding research ethics, data sharing, personal privacy, data storage, and other issues. Making many types of research data open can be extremely challenging—and even impossible.
There are also infrastructure and ownership issues. If my experiment creates vast volumes of data, then the costs of sharing the data could be colossal. Other questions arise: How long should the data be made available? Should it be curated to allow others to access it more transparently and easily? Should it be documented and come with a user manual complete with warnings about potential issues and deficiencies? Who owns the data at the end of the day? Should it be the researcher who produced it? The researcher who analyzed and processed it? The institution that paid the most toward the experiment? The funder who contributed? The industry partner that sponsored the original work, without whom the current experiment wouldn’t have been possible? Let’s assume an enlightened model, in which attribution is shared or given away under a permissive license: Who is responsible for the cost of maintaining the data, version control, and potentially continued updates? Our system of funding research simply isn’t set up to deal with any of these issues, even on a “simple” institutional or national level, let alone in a complex international collaborative setting, which is more and more frequently the standard case.
Separating the signal from the noise is an old research problem, and with the volumes of data the research community now produces, it is a more pertinent problem than ever. If all data could be made available (while accepting ethical standards, issues of confidentiality, and technological limitations), there would be more than we could possibly consume. Even with excellent data provenance coming from smart machines and APIs, the documentation challenge to enable data discovery is a pressing one.
So where does all this leave us? Open data is moving forward. Research institutions, funders, and startups are all creating infrastructure to support researchers and data sharing. We are in an exciting time, when technological barriers are being pushed back but ethical structures have to respond quickly. National policies need to align, as research collaboration will become more and more international. Heterogeneity in policies will lead to difficulties that will cause researchers to move to countries that respect the most sensible policies.
Research culture has a long way to go until it is socially unacceptable to do anything butshare your data openly (and appropriately). But researchers as drivers of change themselves will change more quickly than people generally imagine. We predict open data will alter the execution and communication of research, for the benefit of us all, but how that will happen, we can’t yet tell you with clarity.