Making the Case for ‘Open’ in Libraries

Going beyond the article, researchers will also need to seek out any original components of previously conducted re search—their own or others—in order to either reproduce or reuse work. Researchers need to consider the computational environment that was used to conduct the analysis, the data, or methods. The pathway to these components is often the article, which—in theory—includes references to the code, data, and methods that were used in the research. However, it is precisely here that a problem occurs. As a 2017 article in The Atlantic (theatlantic.com/science/archive/2017/01/what-proportion-of-cancer-studies-are-reliable/513485) points out:

The hardest part, by far, was figuring out exactly what the original labs actually did. Scientific papers come with methods sections that theoretically ought to provide recipes for doing the same experiments. But often, those recipes are incomplete, missing important steps, details, or ingredients. In some cases, the recipes aren’t described at all; researchers simply cite an earlier study that used a similar technique.

The solution lies in the ways in which we create the underlying research components and the tools used to openly share and disseminate output.

Accessing and Reusing Code and Data

Back in 2017, Simon Adar, a Runway postdoc awardee at the Jacobs Technion-Cornell Institute, observed the following (medium.com/codeocean/one-postdocs-path-to-repro ducibility-d165cb9a8065):

‘Everything could have been done much faster.’ This was my main reflection, just after finishing my Ph.D. Like many scientists, I relied on previously published works and tried to build upon them. If you have ever tried to reuse somebody else’s research, chances are it was a challenge.

Adar proceeds to comment on the roadblocks that he en countered trying to reuse someone else’s code: obtaining the right computing environment and language, finding possible missing files and dependencies, and endlessly troubleshooting errors. Noting the incredible waste of time, Adar set out to devise a solution which would make the material available to any user. To do so, he founded Code Ocean.

Code Ocean is an online open access code execution platform that allows authors to find, create, and share code and data. With Code Ocean, authors can publish their code, data, and computing environment in a compute capsule. The Code Ocean compute capsule can be freely accessed, embedded in an article or web page, cited, deposited in a repository, and/or preserved.

To illustrate the point, take a look at the graph shown be low, which is from an F1000 research article by Yoav Gilad and Orna Mizrahi-Man (“A Reanalysis of Mouse ENCODE Comparative Gene Expression Data,” https://f1000research.com/articles/4-121).

Other than illustrating the analysis, the graph itself does not provide insight into underlying code and data or the exact computational environment that allows anyone to reproduce the work. The F1000 article, however, links to the Code Ocean capsule, shown on page 28, while also embedding the capsule with in the article, making it publicly accessible and ensuring that a researcher gains access to precisely those components.

Through the compute capsule, the research community has access to an executable instance of the analysis that will always run—today, tomorrow, and in the future. Researchers can then create, share, collaborate, and curate the findings in the article itself or preserve them in an approved repository. This will ultimately allow researchers to build upon the original findings without concerns about reproducibility or wasting time setting up the environments.

Accessing and Reusing Research Methods

Computational code and data are only two important aspects of research. Of additional significance are the re search methods: the steps, techniques, or processes used in the course of research. These methods too must be readily found, accessed, and shared to support the advancement of science. But again, at times, the process is far from flawless. Lenny Teytelman, who, in 2012 was a postdoctoral researcher at MIT, grew increasingly frustrated trying to correct a single step of a previously published method. Moreover, as he was correcting rather than developing a new technique, he did not have a way to share the correction with anyone else who would be using the method. It was this frustration that led Teytelman, along with Alexei Stoliartchouk and Irina Makkaveeva, to co-found protocols.io.

protocols.io is integral to the reuse and reproducibility of research. An open access repository, protocols.io supports the sharing, discovering, and discussing of research methods and is recommended by publishers and funders. Research methods can be shared, cited, deposited, and preserved. Plus, re searchers can collaborate on methods in real time.

Any researcher can find and view the method or previous versions of the published method. The method can be copied, forked by others, and kept up-to-date by the author through versioning. And researchers can comment on the method—on individual steps or on the method in its entirety.

From Research Creation to Preservation

Thus far, I’ve considered only the tools that researchers require to find, create, publish, and reuse research output, specifically code, data, and methods. The institution—including the library—may deliver these tools and offer its constituents centralized platforms in order to work more efficiently and advance open science. The provisioning of these platforms to constituents carries an added benefit—the ability for the library to gain important stewardship over the output: to understand its impact, collect, safeguard, and preserve it.

Recognizing that we must safeguard and preserve research output for generations to come, we witness different approaches, or rather, different levels of, digital preservation. Services such as LOCKSS ensure the persistence of digital information by keeping lots of copies of the generated research output. But we may need to go further as we consider a myriad of additional factors. Matthew Addis, founder and CTO of Arkivum Ltd., and previously a research group leader at the University of Southampton IT Innovation Centre, discusses the distinctive dimensions of long-term data management and digital preservation in a blog post (ebsco.com/blog/article/long-term-data-management-automating-digital-preservation). He writes:

Long-term data management includes stewardship of digital content over decade timescales or longer. Digital curation and digital preservation are key activities and help ensure that content is properly protected and safeguarded, it remains findable and discoverable, and it remains accessible and usable for whoever needs to access it in the future.

Indeed, we can start with lots of copies kept across geographic locations and then go on to consider automated workflows that best serve our preservation and usability objectives. We may consider processes to ensure that files can always be read, to recognize and convert formats, to better organize the files we seek to preserve, to automatically extract metadata, and to support policies for secure sharing of data. Addis founded Arkivum, which initially provided a data archiving service, primarily for higher-education institutions that need to safeguard their research data. To day, Arkivum provides Perpetua as a SaaS solution for long- term data management and digital preservation of a wide range of content types that fall within the remit of libraries. This includes the outputs of research projects, such as research data and publications, as well as curated digital resources within the library, including special collections and archives, that are used in research projects. Institutions may connect Perpetua to external platforms where content is created—such as Code Ocean and protocols.io—or deposited in the institutional repository and put the content through a safeguarding and preservation process to ensure that, in the end, it can be consumed by anyone at any given time.

Tamir Borensztajn is a librarian and VP of SaaS strategy, EBSCO Information Services.