DATA-C: Five Steps to Better Data Curation
by Matthew Benzing
Research findings have long been communicated through journal articles. By publishing their articles, scholars have been able to share their methodology and their findings, with perhaps some samples of data in tables or graphs. But until recently, the full datasets were difficult, if not impossible, to share. Still, sharing data is important because it not only permits the results to be duplicated and verified, but it can lead to new discoveries. With the advent of the internet, it has become relatively simple to share data across networks. This has led to a new era of scholarly communications and a new role for librarians as data curators.
Data curation is not some esoteric practice. Digital photos are an example of how technology has contributed to a proliferation of data, scattered it, and created the need to curate it. Most of us are experiencing data curation firsthand as we attempt to manage our own digital photo collections. The issues we face in doing so are not that much different from the issues involved in managing research data that, as librarians, we are now tasked with curating.
The past few decades have seen drastic changes in the way that we take and store photographs. Many of us used 110mm or 35mm cameras. The resulting photos had to be developed and printed to be useful. We often kept the negatives, so that we could get a perfect copy in the future if need be, since without the negative, we could only get an inferior, but adequate, copy from a print. The same issue applies in the research data-curation space. One of the first decisions to be made is whether to keep a lossless file that can make perfect reproductions or a lossy file that can make adequate reproductions.
Our personal photo collections also reveal another point about the current data-curation landscape: Digital datasets quickly become huge, and the ability to capture data expands with every new device we use. Once the first point-and-shoot digital cameras became available, photographs became data; as digital files, they could easily be copied, transmitted, and stored. Instead of being limited to a 24- or 36-exposure roll of film, thousands of pictures could be stored on a smart card. So people began taking a lot more pictures. Some of those pictures were printed, but most were just filed away. As smart cards filled up, the photos were often transferred to a computer hard drive. As scanners became widely available, it became possible to make digital copies of old prints, and these could be added to one’s digital photo collection. Ultimately, smartphones happened, which allow all of us to photograph everything.
By now, many of us are deeply into data-curation decision making as we face the question of how to deal with photos that are scattered among old hard drives or flash drives or are burned onto CDs. The process is complicated by multiple file types used by different devices, multiple file-naming conventions, many duplicates, and little-to-no metadata to tell us the details of the subject, time, and place of the photo. Multiple online repositories (such as Google Photos, Amazon Photos, iCloud, and others) exist to provide a place to gather our images, but if we really want to be able to use them in the future, we will want to organize them, throw out the duplicates and the bad photos, and add descriptions.
These are essentially the same issues we must face when, as librarians, we go about helping faculty members manage their research datasets.
Research Data Curation
I tend to think of data curation as the human side of data management. Curation suggests human decision making. When online music services claim that their playlists are curated, they typically mean that a person, not an algorithm, chose those songs. In the library context, I think of data curation as the point at which a human curator takes data from a researcher and uses various tools and techniques to turn it into something that can be handled by a machine. There are many things that curators might find themselves doing, but one of the most important and most human things they must do is ask questions. Research data curation is the process of asking questions about datasets and optimizing that data based on the answers. It is a value-added process that takes data and enhances its usability for the research community.
The Data Curation Network (DCN) is an organization that was formed to facilitate sharing data-curation resources and staff across six major academic libraries (Johnston, et al., 2016), with an eye toward involving more institutions over time. In 2017, the DCN did a study of the current state of data curation among researchers at universities (Johnston, et al., 2018). The study identified a set of data-curation activities that are desired by researchers (datacurationnetwork.org/data-curation-activities), but revealed, “Our study found gaps in support for data curation activities that are very important … but that are either not happening or not happening in a satisfactory way. …” and that “These may be areas of opportunity for libraries to invest in new services and/or heavily promote services that may already exist but are not reaching the researchers who value them” (Johnston, et al., 2018).
Many researchers understand the need for these practices but do not have the time or the support staff to properly curate data. This opens up an opportunity for librarians to provide these services. If libraries do not, others will; many publishers are beginning to offer data-curation services as part of the publication process.
Some of the areas in which libraries can supply curation support include the following:
- Creating adequate documentation
- Providing secure storage
- Performing quality assurance for data
- Creating or applying metadata
- Visualizing data
- Educating researchers
Faculty might not know the importance of a data-management plan, adequate documentation, security, open data, or privacy. Offering workshops, presenting at departmental meetings, or engaging in one-on-one training sessions are some of the ways that librarians can offer support.
Unfortunately, even among faculty members who recognize the need for data management, many do not realize that this is an area in which librarians can help. I am a liaison at my university for computer science, computer engineering, electrical engineering, statistics, analytics, and physics. I used a survey to find out where the faculty in those disciplines at my institution were in terms of data management; the survey also acted as a way to let them know that, as a librarian, I have data skills that could be of use to them. The findings were interesting. One of the questions had to do with whether or not they had engaged in any data-archiving activities.
While many of the respondents were required to provide data-management plans in the past, most had never used the data-management plan generator DMPTool, a service that our university makes available to them; in fact, in follow-up comments, several noted that they had never heard of it. It was also surprising that none of the respondents had used Dryad, Mendeley, figshare, or Open Science Framework (OSF), either as repositories or as research resources.
We also asked researchers if they knew where their research data was being stored. The results are in the chart below.
The fact that so much data is being stored in a physical medium that can be easy to lose—and is prone to failure—is an indicator that more needs to be done in terms of education.
Steps in a Data-Curation Workflow
Once research faculty members are persuaded of the need to properly curate and deposit their data in a reliable repository, it is necessary to follow a standardized curation workflow. There are many competing models for doing this. I use one that I have developed, DATA-C, which stands for the following:
- Documenting the data
- Asking questions
- Translating into open formats
- Assessing FAIRness
- Cleaning and validating
1. Documenting the Data
The first thing we need to do is make sure that the data is adequately documented. Is there metadata attached? Is there a readme file or an abstract to explain the data—or a DOI for any articles that the data informs? We also have to make sure the documentation makes sense. As curators, we are often the first person other than the researchers who created the data to use it and possibly spot errors. In many respects, it’s the same as reviewing a document. When any of us have our written work proofread, the proofreader discovers errors that we miss because our mind knows what we were trying to say and unconsciously fills in the gaps for us. It is the same with data. The labels and descriptive metadata that make perfect sense to the researcher who generated the data might be total gibberish to someone trying to utilize the data for a new project. Part of the curator’s role is evaluating how much sense the data makes. Then the curator is in the position to improve the data’s intelligibility and ease of reuse.
2. Asking Questions
Curation is not just a technical profession; you do need people skills. Start by developing a list of questions that researchers need to answer about their dataset. Since most researchers are pressed for time, this list is advantageous to them; it saves time by avoiding a lot of back-and-forth. It also lets you get your work done efficiently. Based on the answers, your job as curator is to make whatever changes to the metadata that are necessary in order for the data to be intelligible to another researcher. Every dataset does not need to be intelligible to the average person on the street; however, it should be at least comprehensible to a graduate student in the discipline for which it was created, since the audience is other researchers in the same field.
In this process, librarians often end up having to work with subjects they are not terribly familiar with. Fortunately, there are two sets of resources that can help you provide curation services for unfamiliar disciplines. The first are Data Curation Profiles—series of documents that explain the data-collection and usage practices for a number of disciplines. They can be useful in understanding how to discuss data issues and ask questions of researchers. The Data Curation Profiles are housed at Purdue University and can be accessed at datacurationprofiles.org.
Second, Data Curation Primers are reference resources provided by the DCN that explain the curation issues surrounding various file types. They are especially useful for curators dealing with the need to curate output from unfamiliar software. The Data Curation Primers can be found at datacurationnetwork.org/resources/data-curation-primers.
3. Translating Into Open Formats
Datasets need to be compiled in formats that are open and reliable. Proprietary formats are usually not desirable. For one thing, the whole point of having open data is to make it available to anyone who wishes to use it for research. Having that data in a format that can only be read by proprietary software limits the audience to those who can afford the necessary product. Also, proprietary formats have a history of becoming obsolete when competing products prevail in the marketplace. If data is stored in open formats (such as CSV), there is a much better chance that it will be readable well into the foreseeable future. The Library of Congress has a page that suggests the best file formats for a variety of data types (loc.gov/preservation/resources/rfs/index.html).
4. Accessing FAIRness
What do we mean by FAIRness (findable, accessible, interoperable, reusable)? The FAIR Data Principles were established in 2016 by a consortium of publishers, librarians, and academics as best practices to follow in optimizing the usefulness of data. They have no binding power but have been quickly adopted by many curators and are therefore recommended as a checklist for them to follow (Wilkinson, et al., 2016).
Findable— Is it possible to find the data? Does it have a unique identifier such as a DOI or handle? Most of the time, these codes are assigned by a publisher when the article is published. If not, there are services that will provide identifiers, such as DataCite (datacite.org/dois.html) and OSF (help.osf.io/m/sharing/l/524208-create-dois).
Accessible— In order to be used, the data needs to be kept within an open repository that does not limit access through a paywall or other means. It also has to be indexed so that a reasonable search strategy will lead a researcher to it, and it should be properly cited so that future researchers who use the data can give credit to the researchers who compiled it.
Interoperable— In order to make sure that the data can be used properly, it needs to be in an open format (as was discussed previously) and it has to have appropriate metadata attached. There are many metadata schemas, and applying the most suitable will increase the data usefulness. A list of schemas can be found at the RDA Metadata Directory (rd-alliance.github.io/metadata-directory).
If the data is computer code, you should decide if an operating environment and dependencies have to be included (for example, as a docker file) or if the code can be archived separately in a depository that provides those services.
Reusable— Has a Creative Commons license been approved by the researcher? In order for the data to be useful, future researchers need to have the legal rights to utilize it. The different types of Creative Commons licenses are detailed here (creativecommons.org/share-your-work).
5. Cleaning and Validating
This is the last element of the DATA-C workflow. Checking for duplicates, misspellings, clerical errors, and other low-level issues is the last step before files are ready to be ingested into a repository. OpenRefine (openrefine.org) is a free, open source tool that takes much of the tedium out of this process.
Once the data has been cleaned, it will be ready for ingestion. Whether it’s uploaded to your institutional data repository or to one of the many open repositories on the internet is dependent on your institution’s policies.
It’s Up to Us
The massive shift in scholarly communication that is being brought about by the data revolution has increased the responsibilities of researchers, while opening up new avenues of investigation. It also allows for new opportunities for librarians as curators of the new collections of data that will change the nature of research. It is an exciting time to be in the librarian profession, especially if we take the initiative and position ourselves as the best people to fill this gap.