The case for reproducibility
Explanation of reproducibility and resources to start practicing it
Science is hard. Why make it harder?
Scientists and researchers spend a lot of time on data preparation and analysis, and some of these analyses are quite computationally intensive. The amount of time required to conduct an analysis grows for the increased complexity of calculations, amount of data, and number of datasets that will be analyzed. Many researchers use spreadsheets to conduct their analysis workflows, and apply the same tasks to each data set manually. A majority of the time in this type of workflow could be spent copying and pasting equations from one spreadsheet or column to another.
What if I told you there was a better way?
By focusing on reproducibility from the start of a project, you can quickly re-run analyses, easily share your methods with colleagues and collaborators, apply the same methods to multiple datasets with little effort, and reduce errors.
What does “reproducibility” really mean?
The term “reproducibility” is referring to the ability for your work to be easily recreated by others, and your future self. You should be able to send one or two files and a few instructions for completing your analysis. There shouldn’t be a laundry list of items to change, necessary directories, or old versions of software required.
The best way to accomplish reproducibility is to start scripting your analyses. Scripting is the practice of writing code to a file in order to perform a certain task or calculation. Rather than producing “one-off” anlayses, script your work so you can reference the exact method in the future, re-run the same method for other data, and easily share your processes with colleagues, collaborators, and the public. Any scripting language will do; however, USGS Water is using R.
Tips and tricks to having reproducible workflows
- script your work!
- comment the steps in your code
- use relative (not absolute) filepaths
- limit the number of applications/programs being used when possible
- keep up to date with software versions
What can you do?
If you don’t know where to start, try learning some basic R. There are many resources: tryR from Code School , swirl , and the USGS Introduction to R Course.
Next, try to script just one piece of your analysis. Pick a set of tasks that need to be applied to several similar datasets or need to be run repeatedly. Write a script that automates one iteration through those tasks. Then reduce your analysis time and mistakes by applying that same script to all of the datasets or runs. Better still, move your code into a loop so that the script automates the repetition, too.
Reproducibility can go beyond your local files. Maybe your plots and tables are scripted, but you’re still having to copy and paste into slides or a manuscript. R Markdown can automate the process of inserting figures and tables into PDFs, Word documents, and slides.
Shuffling around files between contributors and peer reviewers is time consuming and can get confusing quickly. Version control is a way to avoid this mess - it tracks every deletion, every addition, and every contributor that interacts with your code. It is especially useful when there are multiple contributors because you never have to pass around files at varying stages through email. In fact, this post is created using version control. Our group uses Git and GitHub as version control tools, but that’s not the only choice.
And finally, encourage colleagues and collaborators to strive for reproducible science!
In conclusion…
Watch this video on the “horrors” of non-reproducible workflows by Ignasi Bartomeus and Francisco Rodríguez-Sánchez.
Categories:
Related Posts
The Hydro Network-Linked Data Index
November 2, 2020
Introduction updated 11-2-2020 after updates described here . updated 9-20-2024 when the NLDI moved from labs.waterdata.usgs.gov to api.water.usgs.gov/nldi/ The Hydro Network-Linked Data Index (NLDI) is a system that can index data to NHDPlus V2 catchments and offers a search service to discover indexed information.
Seasonal Analysis in EGRET
November 29, 2016
Introduction This document describes how to obtain seasonal information from the R package EGRET . For example, we might want to know the fraction of the load that takes place in the winter season (say that is December, January, and February).
Calculating Moving Averages and Historical Flow Quantiles
October 25, 2016
This post will show simple way to calculate moving averages, calculate historical-flow quantiles, and plot that information. The goal is to reproduce the graph at this link: PA Graph .
Using the dataRetrieval Stats Service
October 5, 2016
Introduction This script utilizes the new dataRetrieval package access to the USGS Statistics Web Service . We will be pulling daily mean data using the daily value service in readNWISdata, and using the stats service data to put it in the context of the site’s history.
Using Leaflet and htmlwidgets in a Hugo-generated page
August 23, 2016
We are excited to use many of the JavaScript data visualizations in R using the htmlwidgets package in future posts. Having decided on using Hugo , one of our first tasks was to figure out a fairly straightforward way to incorporate these widgets.