Lab 02: Data and Reproducible Workflow

Week 5

Maghfira Ramadhani

Sep 17, 2025

Reproducible workflow

The perils of bad data cleaning

Published in the American Economic Review (2007):

The perils of bad data cleaning

DG’s baseline climate measure (dd89_7000) has a value of zero degree days for 163 counties. If correct, this measure implies temperatures do not exceed 8°C (46.4°F) in those counties during the growing season of April through September. Temperatures this low would seem implausible in any state, yet many of these counties are in warm southern states such as Texas.

The perils of bad data cleaning

Contrary to the results in DG (2007), the corrected data suggest that an immediate shift to the projected end-of-the-century climate would reduce agricultural profits.

Another example

Originally reported “the intervention, compared with usual care, resulted in a fewer number of mean COPD-related hospitalizations and emergency department visits at 6 months per participant.”
There were actually more COPD-related hospitalizations and emergency department visits in the intervention group compared to the control group
Mixed up the intervention vs. control group using “0/1” coding

Transparency and reproducibility

Avoiding errors is only the first step. It’s also critical to make your work reproducible.

In the private sector, the benefits may be more obvious.

Your code has to work together with other people’s code.
Eventually, someone else will take over your code.

In academic research, it’s equally important.

To trust the results – many research findings fail to replicate.
To build on your work and collaborate with others.
Many journals now require a full “replication package” of data and code.
The push for transparency and reproducibility is known as the open science movement.

Reproducibility: Can someone else run your code and get the exact same results?

Replication: If another analyst attempts the same question, do they get the same answer?

Transparency: Can everyone see what choices you made and how you got your results?

Reproducibility checklist

What does it mean for an analysis to be reproducible?

Near term goals:

✔️ Can the tables and figures be exactly reproduced from the code and data?

✔️ Does the code actually do what you think it does?

✔️ In addition to what was done, is it clear why it was done?

Long term goals:

✔️ Can the code be used for other data?

✔️ Can you extend the code to do other things?

Toolkit

Scriptability \(\rightarrow\) R
Literate programming (code, narrative, output in one place) \(\rightarrow\) Quarto
Version control \(\rightarrow\) Git / GitHub

R and RStudio

R is a statistical programming language
RStudio is a convenient interface for R (an integrated development environment, IDE)

Source: Statistical Inference via Data Science

RStudio IDE

Quarto

Fully reproducible reports – the analysis is run from the beginning each time you render
Code goes in chunks and narrative goes outside of chunks
Visual editor to make document editing experience similar to a word processor (Google docs, Word, Pages, etc.)

Quarto

How will we use Quarto?

Every application exercise and assignment is written in a Quarto document
You’ll have a template Quarto document to start with
The amount of scaffolding in the template will decrease over the semester

Version control with git and GitHub

What is versioning?

with human readable messages

Why do we need version control?

Provides a clear record of how the analysis methods evolved. This makes analysis auditable and thus more trustworthy and reliable. (Ostblom and Timbers 2022)

git and GitHub

git is a version control system – like “Track Changes” features from Microsoft Word.
GitHub is the home for your git-based projects on the internet (like DropBox but much better).
There are a lot of git commands and very few people know them all. 99% of the time you will use git to add, commit, push, and pull.

Caveat

Image from xkcd (source)

Ostblom, Joel, and Tiffany Timbers. 2022. “Opinionated Practices for Teaching Reproducibility: Motivation, Guided Instruction and Practice.” Journal of Statistics and Data Science Education 30 (3): 241–50. https://doi.org/10.1080/26939169.2022.2074922.