• 4. Reproducibility summary

Links to: Summary. Chatbot tutor. Questions. Glossary. R functions. R packages. More resources.

Chapter Summary

Reproducibility is a cornerstone of good science, ensuring that research is transparent, reliable, and easy to build upon. This chapter covered best practices for collecting, organizing, and analyzing data in a reproducible manner.

Before collecting data, establish clear rules for measurement, naming conventions, and data entry to maintain consistency. Field sheets should be well-structured, and tidy. Additionally, creating a data dictionary and README document ensures that variables and project details are well-documented. Finally, storing data and scripts in public repositories supports transparency and open science.

In analysis, using an R Project helps keep files organized, and loading data with relative paths avoids location issues. Writing well-structured R scripts with clear comments makes workflows understandable and repeatable. By prioritizing reproducibility you not only strengthen the integrity of your work, but also make future analyses smoother for yourself and others.

Chatbot tutor

Please interact with this custom chatbot (link here) I have made to help you with this chapter. I suggest interacting with at least ten back-and-forths to ramp up and then stopping when you feel like you got what you needed from it.

Practice Questions

Try these questions!

Q1) What is the biggest mistake in the table below?

ID should be lower case Its perfect, change nothing the column name, weight is not sufficiently descriptive, it should include the units. date_colleted_empty_means_same_as_above is too wordy, replace with date Values for date_collected_empty_means_same_as_above are implied. date is in Year-Month-Day format, while Month-Day-Year format is preffered.

ID	weight	date_collected_empty_means_same_as_above
1-A1	104	2024-03-01
1-1B	210
3-7	150
2-B	176	2024-03-15
1-A5	110

While some of these (like the long name for date) are clearly shortcomings, spreadsheets should never leave values implied.

Q2) What would you expect in a data dictionary accompanying the table above? (select all correct)

The units for weight. A statement that date is in Year-Month-Day format A statement explaining that in the date colleted column, empty means same as above.

Q3) How do you read data from a Excel sheet, called raw_data in an Excel filed named bird_data.xlsx located inside the R project you are working in?

You cannot load excel files into R. You must save it as a csv, and read it in with read_csv(). Assuming the readxl package is installed and loaded, type read_xlsx(file = “bird_data.xlsx”, sheet = “raw_data”). While you can read excel into R, you cannot specify the sheet.

Q4) What should you do to make code reproducible? (pick the best answer)

Specify the working directory with setwd() Show the packages installed with install.packages() Restart R once your done, and rerun your script to see if it works

Glossary of Terms

Absolute Path – A file location specified from the root directory (e.g., /Users/username/Documents/data.csv), which can cause issues when sharing code across different computers. Using relative paths instead is recommended.

Data Dictionary – A structured document that defines each variable in a dataset, including its name, description, units, and expected values. It helps ensure data clarity and consistency.

Data Validation – A method for reducing errors in data entry by restricting input values (e.g., dropdown lists for categorical variables, ranges for numerical values).

Field Sheet – A structured data collection form used in the field or lab, designed for clarity and ease of data entry.

Metadata – Additional information describing a dataset, such as when, where, and how data were collected, the units of measurement, and details about the variables.

R Project – A self-contained environment in RStudio that organizes files, code, and data in a structured way, making analysis more reproducible.

Raw Data – The original, unmodified data collected from an experiment or survey. It should always be preserved in its original form, with any modifications performed in separate scripts.

README File – A text file that provides an overview of a dataset, including project details, data sources, file descriptions, and instructions for use.

Reproducibility – The ability to re-run an analysis and obtain the same results using the same data and code. This requires careful documentation, structured data storage, and clear coding practices.

Relative Path – A file path that specifies a location relative to the current working directory (e.g., data/my_file.csv), making it easier to share and reproduce analyses.

Tidy Data – A dataset format where each variable has its own column, each observation has its own row, and each value is in its own cell.

Key R functions

clean_names(data) – Standardizes column names (from the janitor package).
drop_na(data) – Removes rows with missing values (from the tidyr package)).
read_csv("file.csv") – Reads a CSV file into R as a tibble (from the readr package).
read_xlsx("file.xlsx", sheet = "sheetname") – Reads an excel sheet into R as a tibble (from the readxl package).
rename(data, new_name = old_name) – Renames columns in a dataset (from the dplyr package).
pivot_longer(data, cols, names_to, values_to) – Converts wide-format data to long format (from the tidyr package).
sessionInfo() – Displays session details, including loaded packages (useful for reproducibility).

R Packages Introduced

readr – Provides fast and flexible functions for reading tabular data (here we revisited read_csv() for CSV files).
dplyr – A grammar for data manipulation. Here we introduced the rename(data, new_name = old_name) function to give columns better names.
tidyr – Helps tidy messy data. Here we introduced pivot_longer() to make wide data long.
janitor – Cleans and standardizes data, including clean_names()](https://sfirke.github.io/janitor/reference/clean_names.html) for formatting column names.

Additional resources

R Recipes:
- Read a .csv: Learn how to read a csv into R as a tibble.
- Read an Excel file: Learn how to read an excel file into R as a tibble.
- Obey R’s naming rules: You want to give a valid name to an object in R.
- Rename columns in a table: You want to rename one or more columns in a data frame.

Other web resources:

Data Organization in Spreadsheets (Broman & Woo, 2018).
Tidy Data: (Wickham, 2014).
Ten Simple Rules for Reproducible Computational Research: (Sandve, 2013).
NYT article: For big data scientists hurdle to insights is janitor work.
Style guide: Chapter 9 of Data management in large-scale education research by Lewis (2024). Includes sections on general good practices, file naming, and variable naming.
Data Storage and security: Chapter 13 of Data management in large-scale education research by Lewis (2024).

Videos:

Data integrity: (By Kate Laskowski who was the victim of data fabrication by her collaborator (and my former roommate) Jonathan Pruitt).
Tidying data with pivor_longer (From Stat454)