Submissions that include data and code for journal publications

Hi IM friends,

I recently received a data submission that includes raw data files + multiple R scripts used to conduct a complex workflow of analyses on those data for an upcoming journal submission. Submissions with data + code I’ve handled until now have been fairly simple (with just a data file or two and one R script). Those submissions were then published in EDI. The submitter in this case would like to make the data + code available for journal reviewers to run in a reproducible way, and I’m not sure just uploading everything to EDI would be the best practice. Do any other IMs have experience with data + code submissions, and how would you suggest making such data and code available for reviewers?

I found one strategy used by Christensen et al. where raw JRN data in EDI were cited, and the authors made all data + code available in a Zenodo copy of a GitHub repository. Would publishing the raw data to EDI and making the files + code available through Zenodo be a good practice to follow if a researcher is proficient in GitHub?

Thanks

1 Like

We at the LNO had a working group with a similar issue that did the following:

  1. Archived all data in a public data repository
  2. Edited their scripts to assume data were downloaded from the data repository in step 1
  3. Created a GitHub “release” to tag that version of the scripts
  4. Put the ZIP file created by that GitHub release in a public code archive

The editing of the scripts phase can be cumbersome though depending on how the scripts are originally written so that may not be the best solution in this case. Hope this helps as one option though!

1 Like

Is there a good public code archive out there? Zenodo is convenient if using GitHub, but are there discovery or access advantages compared to archiving the code in EDI?

1 Like

I think there are different advantages of putting code in different places. We support various combinations of:

  1. Code in same data package with associated data in KNB, ADC, EDI, Dryad, Zenodo, etc.
    • code close to data, but harder to maintain reusable software package structure and releases, easy to include “literate” computational notebooks that help tie computation to the data and work
  2. Code in its own package in any of those same repositories
    • works fine, code has its own release stream, can be cited independently, but harder to mirror to the code release process (except Zenodo)
  3. Code in its own package in Software Heritage, mirrored from git
    • code has its own release stream, code package structure mirrors packaging structure from the code repository, software releases/tags from code repo automatically create new releases in software heritage, repository supports CodeMeta software metadata standard

We generally recommend a combination of (1) for simple scripts and notebooks that are tightly tied to an analysis or particular paper/report, and (3) for reusable software that is nicely packaged, such as R packages or python packages.

Another issue is how data are linked in to analytical code, and we recommend either useing a content-based identifier, or at least using a stable repository URI that the repository commits to supporting over decadal time periods. This is often not the URI of the dataset landing page. We have a discussion of approaches to reproducible data access, illustrated with some EDI data packages: NCEAS Open Science Synthesis for the Delta Science Program - 11  Reproducible Data Access

Matt

2 Likes