Assessing the Impact of Open Data
by Elaine M. Lasda
Increasingly, research funders and journal publishers require or recommend that researchers make their data open and shareable. Making raw data available can provide a level of transparency to research that is increasingly necessary for the scientific ecosystem.
Retraction Watch (retractionwatch.com), a great source for understanding all sorts of research transgressions, frequently spotlights research papers retracted due to falsified or misleading data. Even 2019 Nobel Prize winner Gregg Semenza had four publications retracted because they contained misleading data manipulations. “Together, the papers have been cited more than 750 times, according to Clarivate’s Web of Science. From 1998 until 2013, Semenza was principal investigator on NIH grants totaling more than $9 million” (retractionwatch.com/2022/09/03/nobel-prize-winner-gregg-semenza-retracts-four-papers/#more-125592).
Many other papers itemized on Retraction Watch demonstrate that experiments and data are often fudged. This may be understandable. After all, competition is stiff for research dollars. The number of peer reviewed journal articles (PRJAs) has increased exponentially across time, while research funding has remained flat. Thus, transparency, replicability, and reuse are all great reasons to make research data open and shareable. The open science movement’s success hinges on open, shareable data. Maintaining the integrity of scientific inquiry is contingent on having the materials to back up researchers’ claims.
It doesn’t always happen. A 2014 article in Current Biology, “The Availability of Research Data Declines Rapidly With Article Age,” found that a stunning 90% of raw data from biology studies published between 1991 and 2011 was inaccessible (cell.com/action/showPdf?pii=S0960-9822%2813%2901400-0). The situation has not improved. A June 2022 article in Nature, “Many Researchers Say They’ll Share Data — but Don’t,” cited as reasons lack of informed consent or ethics approval to share, misplaced data, and researchers no longer involved with the project (nature.com/articles/d41586-022-01692-1).
Before open data is reused or shared, other researchers must discover it. Discoverability of open data, like other information objects, is contingent largely on metadata. But datasets do not always come with standardized associated metadata. Proper attribution of data reuse, or any research reuse for that matter, is contingent on standard citation practices. FORCE11, a grassroots research advocacy group, offered up a Joint Declaration of Data Citation Principles in 2014 (force11.org/info/joint-declaration-of-data-citation-principles-final). DataCite (datacite.org), a nonprofit, provides persistent identifiers (DOIs) for research data and other research outputs to enhance standardization and increase discoverability of data. DataCite also advocates for responsible practices with data sharing and attribution.
Descriptive metadata can be more complicated for datasets than journal articles, which generally conform to a standard structure: abstract, introduction, lit review, methods, analysis, discussion, conclusion, and references. The components of datasets are less standardized, and the means for describing a dataset are even less consistent. However, research communities are developing best practices and standards for data description. The National Institutes of Health and other medical research entities have standardized variables, known as Common Data Elements, or CDEs (nexus.od.nih.gov/all/2021/06/24/common-data-elements-increasing-fair-data-sharing). Although many other funders are taking strides to standardize how to describe and share data, in other venues, data structure, format, and content don’t always have a prescribed standard to follow. Poor variable labeling, lack of file-naming conventions, failing to include a data dictionary, and vague descriptions of the collection methods are common areas where greater documentation and standardization are needed to enhance discoverability and shareability of open data.
Metadata and description, formatting, DOI assignment, and best practices all contribute to a more open data environment that works to better an open research ecosystem. Inevitably, however, there will be a need for measuring the reuse, sharing, and impact of data. Although metrics for PRJAs are certainly far from perfect, at least impact indicators do give some sense of the value of published research. But if citing research data is still not fully standardized, and determining what metadata should be included in a record for shareable data is likewise not standardized, measuring the impact of data through bibliometric and related methods is extremely problematic.
WHAT ABOUT IMPACT?
Efforts are underway to identify citation counts, usage, and other metrics around research data. Two tools are moving toward demonstrating impact: Google’s Dataset Search (GDS; datasetsearch.research.google.com) and Clarivate’s Data Citation In dex (DCI; clarivate.com/webofsciencegroup/solutions/webofscience-data-citation-index).
GDS searches more than 25 million datasets. Out of beta since 2020, this tool has not seen wide publicity. GDS is a focused search engine that crawls the web for appropriately web-formatted content in the form of datasets. It crawls datasets that use the Schema.org format, a web formatting schema for all types of online searchable objects, not a data-specific schema (blog.google/products/search/discovering-millions-datasets-web).
I did a search for “health disparities poverty united states” in the GDS search box. Unlike Google Scholar, GDS does not have an advanced search. Instead, it has filters, more akin to a Google Image search. The search turned up 31 results. The first is from Statista (statista.com) and only covers Pennsylvania and Alabama. Four results are varying records that appear to all be for the same study, titled “RAND Center for Population Health and Health Disparities (CPHHD) Data Core Series: Decennial Census Abridged, 1990–2010,” which come in hits from both RAND and ICPSR. Redundant items from different sources are also typical of what we see in Google Scholar results for journal articles. For example, a record from a journal publisher and the same content in a repository are separate hits in Scholar. Dataset results are from a wide variety of sources such as ArcGIS, Kaggle, figshare, dot-gov sites, Harvard Dataverse, and Statista. Eight of the results were concerned with countries other than the U.S., which although bothersome, sometimes happens when searching bibliographic databases as well. So out of 31, let’s say about half are somewhat relevant.
I took one example that looked to be fair to middling in terms of descriptive content, “County Poverty Concentration and Dis- parities in Unintentional Injury Deaths: A Fourteen-Year Analy- sis of 1.6 Million U.S. Fatalities,” and dug in a bit. The GDS result shows it was used in 20 publications. Looking at the original data source, which is figshare, I see that it has an Altmetric “donut” and an Altmetric Attention Score of 18. At first, I was amazed to see that there was an impact indicator for the dataset, but the kicker is in the fine print. The Altmetric data listed for the dataset comes from one PRJA in PLOS ONE. Also, the DOI included on the GDS result is the DOI for that one publication (doi.org/10.1371/journal.pone.0153516).
When I clicked through to figshare, displayed are three tables with no data dictionary or other supporting documentation, save for the Altmetric data and some usage data. Again, that maps to the article usage, not to the data itself. What about the impact of the other 19 articles that are based on this data? Well, GDS is quite misleading here, as the other 19 articles cite the PLOS ONE article that uses this data. The more I dug in, the less useful I felt the information in this data record turned out to be.
Clarivate’s DCI is a tool to facilitate discovery and mark impact of datasets, repositories, data studies, and software. Found on the Web of Science platform, DCI was released in 2012. It is astounding to me that this tool has been around for 11 years. Although I have been aware of it since its inception, I have not heard any scuttlebutt to the effect that DCI is a fantastic tool.
DCI’s selection process is at the repository level, and, as expected, the selection criteria are touted to be more rigorous than what an automated web crawler would include. I did the same “chicken scratch” search in DCI as a Topic search: “health disparities poverty united states”. This time, I got 32 results. Results come from eight repository sources, including figshare. I easily found the same study about unintentional injuries. The DCI indicated the dataset had no citations and no reuse beyond the PLOS ONE article. The listed source URL for the data does go to the figshare page (figshare.com/articles/dataset/County_Poverty_Concentration_and_Disparities_in_Unintentional_Injury_Deaths_A_Fourteen-Year_Analysis_of_1_6_Million_U_S_Fatalities/3248488), where there are more metadata and classification (both subject headings and author-supplied keywords).
Overall, the results from the DCI search were more relevant than those from the GDS search. I did not see a single result that did not cover a U.S. population, although there are some duplicate-looking versions of a couple of titles (including that RAND title above). The overlap was very low.
Using these two sources, I searched for some unique data impact metrics to evaluate. Citation count is probably the easiest metric to grasp, provided the attribution is consistent in citing a given dataset. GDS is downright misleading with its inclusion of publication metrics instead of data metrics. DCI has enhanced metadata, particularly in the subject and keyword fields, and sticks to the citation count of the data, while at the same time allowing you to click through to the single associated publication and see that work’s metrics. Neither are super robust, but at least DCI is straightforward.
TO BE CONTINUED …
When it comes to measuring the impact of research data, we are not quite ready. Best practices are maturing for both data attribution/citation and discoverability. In 2016, NISO released the NISO RP-25-2016 Alternative Assessment Metrics Project (niso.org/publications/rp-25-2016-altmetrics). This project included a working group dedicated to the thorny problem of data citation and impact. Results focused mainly on citation practices. The report states that there is little interest in altmetric indicators for data (also not much for PRJAs either).
The motivation for researchers to make their data open could be more directly tied to acknowledgment of impact in the research sphere brought about by some research metrics. As Make Data Count has described, researchers make their data open to “tick a box and meet compliance requirements” (makedatacount.org/data-metrics-2), but the impetus to share data may not provide a practical reward for the researcher. With an increase in computational social science and humanities, research data sharing and management are only going to grow in importance and, well, impact.
Even though NISO did not identify demand for altmetrics to measure research data impact, I think its best practice criteria for altmetrics, which centers on transparency, accuracy, and replicability, should be considered for any metrics used to indicate the impact of research datasets. As the amount of open research data increases, reuse becomes more common, and citation practices evolve and become more edified, we should expect to see more of a demand for effective measurement of research data impact.