ID | weight | date_collected_empty_means_same_as_above |
---|---|---|
1-A1 | 104 | 2024-03-01 |
1-1B | 210 | |
3-7 | 150 | |
2-B | 176 | 2024-03-15 |
1-A5 | 110 |
• 4. Reproducibility summary
Links to: Summary. Chatbot tutor. Questions. Glossary. R functions. R packages. More resources.
Chapter Summary
Reproducibility is a cornerstone of good science, ensuring that research is transparent, reliable, and easy to build upon. This chapter covered best practices for collecting, organizing, and analyzing data in a reproducible manner.
Before collecting data, establish clear rules for measurement, naming conventions, and data entry to maintain consistency. Field sheets should be well-structured, and tidy. Additionally, creating a data dictionary and README document ensures that variables and project details are well-documented. Finally, storing data and scripts in public repositories supports transparency and open science.
In analysis, using an R Project helps keep files organized, and loading data with relative paths avoids location issues. Writing well-structured R scripts with clear comments makes workflows understandable and repeatable. By prioritizing reproducibility you not only strengthen the integrity of your work, but also make future analyses smoother for yourself and others.
Chatbot tutor
Please interact with this custom chatbot (link here) I have made to help you with this chapter. I suggest interacting with at least ten back-and-forths to ramp up and then stopping when you feel like you got what you needed from it.
Practice Questions
Try these questions!
While some of these (like the long name for date) are clearly shortcomings, spreadsheets should never leave values implied.
.
Q2) What would you expect in a data dictionary accompanying the table above? (select all correct)
Q3) How do you read data from a Excel sheet, called raw_data in an Excel filed named bird_data.xlsx located inside the R project you are working in?
Q4) What should you do to make code reproducible? (pick the best answer)
Glossary of Terms
Absolute Path – A file location specified from the root directory (e.g., /Users/username/Documents/data.csv
), which can cause issues when sharing code across different computers. Using relative paths instead is recommended.
Data Dictionary – A structured document that defines each variable in a dataset, including its name, description, units, and expected values. It helps ensure data clarity and consistency.
Data Validation – A method for reducing errors in data entry by restricting input values (e.g., dropdown lists for categorical variables, ranges for numerical values).
Field Sheet – A structured data collection form used in the field or lab, designed for clarity and ease of data entry.
Metadata – Additional information describing a dataset, such as when, where, and how data were collected, the units of measurement, and details about the variables.
R Project – A self-contained environment in RStudio that organizes files, code, and data in a structured way, making analysis more reproducible.
Raw Data – The original, unmodified data collected from an experiment or survey. It should always be preserved in its original form, with any modifications performed in separate scripts.
README File – A text file that provides an overview of a dataset, including project details, data sources, file descriptions, and instructions for use.
Reproducibility – The ability to re-run an analysis and obtain the same results using the same data and code. This requires careful documentation, structured data storage, and clear coding practices.
Relative Path – A file path that specifies a location relative to the current working directory (e.g., data/my_file.csv
), making it easier to share and reproduce analyses.
Tidy Data – A dataset format where each variable has its own column, each observation has its own row, and each value is in its own cell.
Key R functions
clean_names(data)
– Standardizes column names (from thejanitor
package).drop_na(data)
– Removes rows with missing values (from thetidyr
package)).read_csv("file.csv")
– Reads a CSV file into R as a tibble (from thereadr
package).read_xlsx("file.xlsx", sheet = "sheetname")
– Reads an excel sheet into R as a tibble (from thereadxl
package).rename(data, new_name = old_name)
– Renames columns in a dataset (from thedplyr
package).pivot_longer(data, cols, names_to, values_to)
– Converts wide-format data to long format (from thetidyr
package).sessionInfo()
– Displays session details, including loaded packages (useful for reproducibility).
R Packages Introduced
readr
– Provides fast and flexible functions for reading tabular data (here we revisitedread_csv()
for CSV files).dplyr
– A grammar for data manipulation. Here we introduced therename(data, new_name = old_name)
function to give columns better names.tidyr
– Helps tidy messy data. Here we introducedpivot_longer()
to make wide data long.janitor
– Cleans and standardizes data, includingclean_names()
](https://sfirke.github.io/janitor/reference/clean_names.html) for formatting column names.
Additional resources
R Recipes:
- Read a .csv: Learn how to read a csv into R as a tibble.
- Read an Excel file: Learn how to read an excel file into R as a tibble.
- Obey R’s naming rules: You want to give a valid name to an object in R.
- Rename columns in a table: You want to rename one or more columns in a data frame.
Other web resources:
Ten Simple Rules for Reproducible Computational Research: (Sandve, 2013).
NYT article: For big data scientists hurdle to insights is janitor work.
Style guide: Chapter 9 of Data management in large-scale education research by Lewis (2024). Includes sections on general good practices, file naming, and variable naming.
Data Storage and security: Chapter 13 of Data management in large-scale education research by Lewis (2024).
Videos:
Data integrity: (By Kate Laskowski who was the victim of data fabrication by her collaborator (and my former roommate) Jonathan Pruitt).
Tidying data with
pivor_longer
(From Stat454)