Reproducible Data Analysis Workflow
Work in Progress Updated Feb 6, 2026
Notes from https://raps-with-r.dev/
Quotes
“Any human interaction with your analysis is a source of errors”
“copying and pasting is forbidden”
Motivation
How can you put in tests to make sure that updates to data got smoothly?
How easy to reuse code for another project?
For reusability, nothing beats structuring your code as function and ideally even package them.
Main Tools/Skills
Version control with git
Functional programming
Literate programming
Docker = Snaphot/Freeze computational environment
Fusen = turn Rmds into a package
testthat = for unit testing
targets = pipeline package
Main Files for Project
two .Rmd files
save_data.R
analysis.R
R
domain specific (statistical)
interpretable = results return immediately when execute in console
this is different than compiled (C) which require code be compiled into binaries before execution
text editor
R Studio
packages/library
extend base R capabilities
library = collection of packages installed
dplyr = package for data manipulatin
purrr = package for functional programming
stringr = package for maninpulating strings
readxl = reads Excel workbooks
janitor = rename, common tasks such as columns in dataframe to “snake case” (ex: word_word_word)
objects
datasets, plots, models = objects
saved in global environments
see list of global objects with ls() typed into console
do not save workspace
Minimal Reproducible Example (MRE)
should be able to run code by copying/pasting into R console
reprex package can help write one
sessionInfo()
built in datasets in R (mtcars)
Quarto
qmd/rmd is a flavor of markdown
include = FALSE (run but not show code/output) echo = FALSE (run but show output but not code) eval = FALSE (not run)
1 = footnote
results=“asis” (dont’ want parser to worry about this bit of code, it’s already good)
plots need child documents
helpful packages
flextable modelsummary
Version Control with git
Keep track of changes to text files
Keep commit messages as short and explicit as possible
- short, clear messages, do a commit after each change
Not a good idea to code all day then push one, single big fat commit
push = upload
pull = download
git log message + unique identifier = hash
git blame
git revert < hash
ssh public/private key for github
Main tabs on github website = Issues, Pull Requests, Actions, and Settings
Branches = copy of current project - if experiment not work can be discarded
“Trunk-based development”
Github Actions = “continuous integration service”
Do not put tracked projects on Git in DropBox, OneDrive… - They do not deal gracefully with conflixts
trunk-based coding
Small branches, merge everyday
Merge as soon as possible even if feature wanted is not done yet
simplified trunk based development
new branch for new feature
at end of day at latest, branches need to all get merged
conflicts need to be taken care of at that point
main branch always contains, working, production grade code
to enforce discipline it may be worth making opening pull requests mandatory for merging back to the trunk and require review
how to unstage from git
git rm –cached < files
example code/how to create and merge branch
git checkout -b “branch-name”
git add .
git commit -m “fixed #1”
- The #1 refers to the number of the issue in the repo. Automatically gets closed when push done
git push origin branch-name
“Compare and pull request” on github
Review
Merge pull request
delete branch
Git = by line
Dropbox/OneDrive = by file
Functional Programming
R is a functional and object oriented programming language
Loops that require you to interact with the global environment = bad
functional programming is a programming paradigm that relies exclusively on the evaluation of functions to achieve the desired result
two main elements are functions and lists
state of your program
f <- function(name) { print(paste0(name, “likes lasagna”)) }
f(“Bruno”)
when running a function, the state of the program doesn’t get altered.
avoid functions that change state
Predictable functions
Referentially transparent - does not use any variable that is not also one of inputs
pure functions - doesn’t interact with the global environment - doesn’t write to or require anything from the global environment
functions = first-class objects
- can define a function that takes anothe function as input
if you don’t know how many arguments a function that you are wrapping uses, you can use “…” as a place holder for any additiona arguments
- ?dots in R console to learn more
Function Factories - functions that return functions
make arguments optional with NULL
lists
to make functions work together by sharing a common interface = lists
data(mtcars) = list
model = list
ggplot = list
“lists all the way down”
lists can hold different types of objects, can even hold other lists
reduce()
lapply()
purrr = helps abstract loops away
before diving into loops, check if functions you are using are vectorized
- or if there is a simiple way to express that computation in terms of matrix multiplication
recursion function - a function that calls itself
write functions that done one thing and do it well. write functions that work together. write functions that handle lists because that is a universal interface
a |> f() |> g() |> h()
is better than big_function(a)
*can reason smaller functions more easily. “eat elephant one bite at a time”
data frame
data frame = a special type of list of atomic vectors
can use lapply
unlike a list, the elements of a data frame must be of the same length
dplyr
group_nest() = creates column with dataframe list column
helpful packages
purrr map() reduce()
with r
Literate programming
A mix of code and prose
Packaging code
goal is not to pubish on CRAN
{fusen}
if you have written an Rmd, you have almost written a package
need to give adequate names to code elements
{roxygen2}
#’ these comments automatically built into documentation
“.” = right here
{testthat}
Freezing packages
{renv}
reproducible environments
creates a per project library (isolate from main default library)
file = renv.lock (json file listing all packages and dependencies)
- “Docker Blueprint” lists version of R used to record packages
init = renv::init()
update = renv::snapshot()
.Rprofile = files that get read by R automatically at startup
- System wide or per project
You can have as many .gitignore files as necessary
You need to install the right version of R yourself
include dataset - need package to already be built
Testing
Unit testing = written while developing and execute while developing
assertive: executed at runtime
Test driven - use when need to write a function but don’t know where to start
Docker
Build automation with {targets}
“recipe”
prebuilt R images from the Rocker project. R ready Docker images
Write Dockerfile -> build an “image” -> run “containers”
{renv} takes care of installing the right R packages but not system-level dependencies
R -e <- running R non-interactively
build process takes some time
Docker images can only be updated at build time, not run time
Volume = a shared folder between Docker container and the host machine
RUN = at build time
CMD = at run time
“Dockerize pipelines” or “dockerize the dev environment and use for many pipelines”
Use same version of R on your computer and Docker
Use same package library on your computer and Docker with renv.lock
Unlimited public repos on Docker
if you work in research but cannot push the data to github, you could always work on the code and the infrastructure using synthetic data