Reproducible Data Analysis Workflow

Work in Progress Updated Feb 6, 2026

Notes from https://raps-with-r.dev/

Quotes

“Any human interaction with your analysis is a source of errors”

“copying and pasting is forbidden”

Motivation

How can you put in tests to make sure that updates to data got smoothly?

How easy to reuse code for another project?

For reusability, nothing beats structuring your code as function and ideally even package them.

Main Tools/Skills

Version control with git

Functional programming

Literate programming

Docker = Snaphot/Freeze computational environment

Fusen = turn Rmds into a package

testthat = for unit testing

targets = pipeline package

Main Files for Project

two .Rmd files

save_data.R

analysis.R

R

domain specific (statistical)
interpretable = results return immediately when execute in console

this is different than compiled (C) which require code be compiled into binaries before execution

text editor

R Studio

packages/library

extend base R capabilities

library = collection of packages installed

dplyr = package for data manipulatin

purrr = package for functional programming

stringr = package for maninpulating strings

readxl = reads Excel workbooks

janitor = rename, common tasks such as columns in dataframe to “snake case” (ex: word_word_word)

objects

datasets, plots, models = objects

saved in global environments

see list of global objects with ls() typed into console

do not save workspace

Minimal Reproducible Example (MRE)

should be able to run code by copying/pasting into R console

reprex package can help write one

sessionInfo()

built in datasets in R (mtcars)

Quarto

qmd/rmd is a flavor of markdown

include = FALSE (run but not show code/output) echo = FALSE (run but show output but not code) eval = FALSE (not run)

¹ = footnote

results=“asis” (dont’ want parser to worry about this bit of code, it’s already good)

plots need child documents

helpful packages

flextable modelsummary

Version Control with git

Keep track of changes to text files

Keep commit messages as short and explicit as possible

short, clear messages, do a commit after each change

Not a good idea to code all day then push one, single big fat commit

push = upload

pull = download

git log message + unique identifier = hash

git blame

git revert < hash

ssh public/private key for github

Main tabs on github website = Issues, Pull Requests, Actions, and Settings

Branches = copy of current project - if experiment not work can be discarded

“Trunk-based development”

Github Actions = “continuous integration service”

Do not put tracked projects on Git in DropBox, OneDrive… - They do not deal gracefully with conflixts

trunk-based coding

Small branches, merge everyday

Merge as soon as possible even if feature wanted is not done yet

simplified trunk based development

new branch for new feature
at end of day at latest, branches need to all get merged
conflicts need to be taken care of at that point
main branch always contains, working, production grade code
to enforce discipline it may be worth making opening pull requests mandatory for merging back to the trunk and require review

how to unstage from git

git rm –cached < files

example code/how to create and merge branch

git checkout -b “branch-name”

git add .

git commit -m “fixed #1”

The #1 refers to the number of the issue in the repo. Automatically gets closed when push done

git push origin branch-name

“Compare and pull request” on github

Review

Merge pull request

delete branch

Git = by line

Dropbox/OneDrive = by file

Functional Programming

R is a functional and object oriented programming language

Loops that require you to interact with the global environment = bad

functional programming is a programming paradigm that relies exclusively on the evaluation of functions to achieve the desired result

two main elements are functions and lists

state of your program

f <- function(name) { print(paste0(name, “likes lasagna”)) }

f(“Bruno”)

when running a function, the state of the program doesn’t get altered.

avoid functions that change state

Predictable functions

Referentially transparent - does not use any variable that is not also one of inputs

pure functions - doesn’t interact with the global environment - doesn’t write to or require anything from the global environment

functions = first-class objects

can define a function that takes anothe function as input

if you don’t know how many arguments a function that you are wrapping uses, you can use “…” as a place holder for any additiona arguments

?dots in R console to learn more

Function Factories - functions that return functions

make arguments optional with NULL

lists

to make functions work together by sharing a common interface = lists

data(mtcars) = list

model = list

ggplot = list

“lists all the way down”

lists can hold different types of objects, can even hold other lists

reduce()

lapply()

purrr = helps abstract loops away

before diving into loops, check if functions you are using are vectorized

or if there is a simiple way to express that computation in terms of matrix multiplication

recursion function - a function that calls itself

write functions that done one thing and do it well. write functions that work together. write functions that handle lists because that is a universal interface

a |> f() |> g() |> h()

is better than big_function(a)

*can reason smaller functions more easily. “eat elephant one bite at a time”

data frame

data frame = a special type of list of atomic vectors

can use lapply

unlike a list, the elements of a data frame must be of the same length

dplyr

group_nest() = creates column with dataframe list column

helpful packages

purrr map() reduce()

with r

Literate programming

A mix of code and prose

Packaging code

goal is not to pubish on CRAN

{fusen}

if you have written an Rmd, you have almost written a package

need to give adequate names to code elements

{roxygen2}

#’ these comments automatically built into documentation

“.” = right here

{testthat}

Freezing packages

{renv}

reproducible environments

creates a per project library (isolate from main default library)
file = renv.lock (json file listing all packages and dependencies)
- “Docker Blueprint” lists version of R used to record packages

init = renv::init()

update = renv::snapshot()

.Rprofile = files that get read by R automatically at startup

System wide or per project

You can have as many .gitignore files as necessary

You need to install the right version of R yourself

include dataset - need package to already be built

Testing

Unit testing = written while developing and execute while developing

assertive: executed at runtime

Test driven - use when need to write a function but don’t know where to start

Docker

Build automation with {targets}

“recipe”

prebuilt R images from the Rocker project. R ready Docker images

Write Dockerfile -> build an “image” -> run “containers”

{renv} takes care of installing the right R packages but not system-level dependencies

R -e <- running R non-interactively

build process takes some time

Docker images can only be updated at build time, not run time

Volume = a shared folder between Docker container and the host machine

RUN = at build time

CMD = at run time

“Dockerize pipelines” or “dockerize the dev environment and use for many pipelines”

Use same version of R on your computer and Docker

Use same package library on your computer and Docker with renv.lock

Unlimited public repos on Docker

if you work in research but cannot push the data to github, you could always work on the code and the infrastructure using synthetic data

Footnotes

↩︎