The Dark Side of Open Data

The environment around research data management and open data has become incredibly complex—and the evolution doesn’t appear to be slowing down at all. At the core of many of today’s challenges are machine learning, natural language processing, and predictive analytics—the methods used for processing tremendous quantities of data for a variety of intended purposes.

On a daily basis, the news is full of stories about private sector and government agencies that are mining massive, internally collected sets of data for all sorts of outcomes. Technology is making it easier for organizations to become proactive in response to patterns in data. For example, with early alert systems, it is now possible for universities to identify students who might be on the cusp of dropping out in time for an advisor to intervene. Companies want to mine their customer data to achieve greater profitability and inventory data to forecast demand for products in a timely manner.

But these types of methods aren’t restricted to closed data. In fact, one of the ad vantages of open data is that it allows data from disparate datasets to be combined—remixing or merging many “small data” sets to convert them into “big data.” From the perspective of funding agencies, this type of reuse is one of the intended benefits of open data. If datasets use common variables, include well-structured and organized data elements, are deposited into interoperable repositories that can be found by harvesters, and include Creative Commons Attribution (CC-BY) licenses (or another similar license allowing for reuse), other researchers are encouraged to find, access, and reuse these datasets without restrictions.

Reuse without restrictions is what sparks fear in many researchers. Once data has been published and is out in the world, you lose all control over your dataset. It can be used, combined, and repurposed in all sorts of ways—including ways you never considered, ways that could potentially put someone else in harm’s way, or for more morally ambiguous purposes.

Although we’re proponents of open data, it’s useful to know about some incidents in which open data has led to problems.

Fitness Tracking and Military Intelligence

If you or someone in your family is serious about running or cycling, or you have a friend who is, you probably know about a wildly popular fitness tracking tool called Strava. Strava is a smartphone app that allows runners and cyclists to track their workouts, including generating GPS maps of workout routes. One feature that makes Strava so popular is that it functions as a social network for fitness enthusiasts, allowing them to share data on their training.

On Nov. 1, 2017, Strava announced a new feature—more than 1 billion fitness activities from 10 million users on a world map, creating a map with bright spots where most activity was taking place.

By January 2018, security analysts were taking to Twitter to report an unintended consequence of making all this data freely available. It turns out many members of the military are devoted Strava users, and so it was possible to locate military installations—even secret facilities—by looking for light regions on Strava’s heatmap. The New York Times on Jan. 29, 2018, reported on the phenomenon (“Strava Fitness App Can Reveal Military Sites, Analysts Say,” Richard Pérez-Peña and Matthew Rosenberg; nytimes.com/2018/01/29/world/middleeast/strava-heat-map.html). Once this unintended use came to light, Strava responded by making it easier for individuals to opt out of having their data included on the heatmap, as reported by Paul Snyder in Runner’s World (March 2, 2018; runnersworld.com/news/a20866081/strava-changes- heat-maps-settings).

Another fitness application that was exploited for security intelligence was identified by Polar, a Finnish company that also makes fitness-tracking devices. In July 2018, Dutch researchers were able to use data freely available through the Polar app, cross-referenced with other data freely available on the internet, to find the names and addresses of more than 6,000 intelligence and military employees located in 69 countries, as reported by Maurits Martijn in de Correspondent (July 8, 2018; tinyurl.com/y8q2aubk). Shortly after announcing its findings, Polar suspended its global activity map, according to Andrew Liptak in The Verge (July 8, 2018; theverge.com/2018/7/8/17546224/polar-flow-smart-fitness-company-privacy-tracking-security).

Endangering Endangered Species

In July 2015, a ranger in the Knersvlakte Nature Reserve, located in the Western Cape, South Africa, which is the home to hundreds of unique and endangered plant species, stopped a suspicious-looking pickup truck. A search revealed a trove of endangered plants, along with materials used to find and collect them and information on a website where they were being offered for sale. Authorities pieced together the information and concluded the poachers were using GPS coordinates harvested from public websites—JSTOR Global Plants (plants.jstor.org) and iSpot (ispotnature.org)—to locate rare plants that were then collected and sold.

Researchers are just beginning to talk about the issues related to the proliferation of GPS tools now used in conservation biology. In a 2016 paper, published in Conservation Biology, Steven J. Cooke, a biologist at Carleton University in Canada, and his co-authors summed up the problem: “Animal tracking can reveal animal locations (sometimes in nearly real-time), and these data can help people locate, disturb, capture, harm, or kill tagged animals” (fecpl.ca/wp-content/uploads/2016/12/Cooke_et_al-2017-Conservation_Biology-1.pdf).

‘Re-identification’ of Anonymous Medical Data

Because of the sensitive nature of the data they collect, researchers in the field of medicine have been especially cautious in adopting open data policies. Recently, much of the debate has been focused around a policy proposal announced by the International Committee of Medical Journal Editors (ICMJE) that would require making data public as a requirement for publication.

While there are various arguments encouraging caution in data sharing, as Shelley Wood explains in her tctMD article (July 4, 2018; tctmd.com/news/open-questions-cardiology-editors-and-academics-mull-data-sharing-pitfalls-and-potential), there is one that particularly raises red flags. Even when data is anonymized, it may be possible to cross-check a dataset with freely available data from other sources—social media accounts, geospatial data, online directories—to re-identify data, linking it back to the participants in the clinical trial. (Indeed, such a “re-identification” process was used to pinpoint the identities of the military and intelligence community members in the Strava example.) What types of organizations might be interested in going through the effort of re-identifiying clinical trial subjects? One possibility is the health insurance industry, which could use the information to make decisions about providing coverage to individuals who participated in a medical trial.

In 2017, a group of Australian researchers proved that this type of patient re-identification through public medical records is indeed possible. The group analyzed a publicly available dataset made up of government billing records and claimed they were able to positively identify a group of prominent Australian citizens and link them to their medical histories, according to Chris Duckett’s article in ZDNet (Dec. 18, 2017; zdnet.com/article/re-identification-possible-with-australian-de-identified-medicare-and-pbs-open-data).