• 4. Reproducibility summary

Links to: Summary. Chatbot tutor. Questions. Glossary. R functions. R packages. More resources.

Chapter Summary

Reproducibility is a cornerstone of good science, ensuring that research is transparent, reliable, and easy to build upon. This chapter covered best practices for collecting, organizing, and analyzing data in a reproducible manner.

Before collecting data, establish clear rules for measurement, naming conventions, and data entry to maintain consistency. Field sheets should be well-structured, and tidy. Additionally, creating a data dictionary and README document ensures that variables and project details are well-documented. Finally, storing data and scripts in public repositories supports transparency and open science.

In analysis, using an R Project helps keep files organized, and loading data with relative paths avoids location issues. Writing well-structured R scripts with clear comments makes workflows understandable and repeatable. By prioritizing reproducibility you not only strengthen the integrity of your work, but also make future analyses smoother for yourself and others.

Chatbot tutor

Please interact with this custom chatbot (link here) I have made to help you with this chapter. I suggest interacting with at least ten back-and-forths to ramp up and then stopping when you feel like you got what you needed from it.

Practice Questions

Try these questions!

Q1) What is the biggest mistake in the table below?
ID weight date_collected_empty_means_same_as_above
1-A1 104 2024-03-01
1-1B 210
3-7 150
2-B 176 2024-03-15
1-A5 110

While some of these (like the long name for date) are clearly shortcomings, spreadsheets should never leave values implied.

.


Q2) What would you expect in a data dictionary accompanying the table above? (select all correct)

Q3) How do you read data from a Excel sheet, called raw_data in an Excel filed named bird_data.xlsx located inside the R project you are working in?

Q4) What should you do to make code reproducible? (pick the best answer)

Glossary of Terms

Absolute Path – A file location specified from the root directory (e.g., /Users/username/Documents/data.csv), which can cause issues when sharing code across different computers. Using relative paths instead is recommended.

Data Dictionary – A structured document that defines each variable in a dataset, including its name, description, units, and expected values. It helps ensure data clarity and consistency.

Data Validation – A method for reducing errors in data entry by restricting input values (e.g., dropdown lists for categorical variables, ranges for numerical values).

Field Sheet – A structured data collection form used in the field or lab, designed for clarity and ease of data entry.

Metadata – Additional information describing a dataset, such as when, where, and how data were collected, the units of measurement, and details about the variables.

R Project – A self-contained environment in RStudio that organizes files, code, and data in a structured way, making analysis more reproducible.

Raw Data – The original, unmodified data collected from an experiment or survey. It should always be preserved in its original form, with any modifications performed in separate scripts.

README File – A text file that provides an overview of a dataset, including project details, data sources, file descriptions, and instructions for use.

Reproducibility – The ability to re-run an analysis and obtain the same results using the same data and code. This requires careful documentation, structured data storage, and clear coding practices.

Relative Path – A file path that specifies a location relative to the current working directory (e.g., data/my_file.csv), making it easier to share and reproduce analyses.

Tidy Data – A dataset format where each variable has its own column, each observation has its own row, and each value is in its own cell.


Key R functions


R Packages Introduced

  • readr – Provides fast and flexible functions for reading tabular data (here we revisited read_csv() for CSV files).

  • dplyr – A grammar for data manipulation. Here we introduced the rename(data, new_name = old_name) function to give columns better names.

  • tidyr – Helps tidy messy data. Here we introduced pivot_longer() to make wide data long.

  • janitor – Cleans and standardizes data, including clean_names()](https://sfirke.github.io/janitor/reference/clean_names.html) for formatting column names.

Additional resources

R Recipes:
- Read a .csv: Learn how to read a csv into R as a tibble.
- Read an Excel file: Learn how to read an excel file into R as a tibble.
- Obey R’s naming rules: You want to give a valid name to an object in R.
- Rename columns in a table: You want to rename one or more columns in a data frame.

Other web resources:

Videos: