• 5. Summarizing summary

Links to: Summary. Questions. Glossary. R functions. R packages. More resources.

Chapter Summary

A close-up photograph of a vibrant pink *Clarkia xantiana* flower with delicate, deeply lobed petals. The petals have a soft gradient, fading from a rich pink at the center to a lighter shade towards the edges. The reproductive structures—dark purple stamens with pollen-covered anthers and a protruding stigma—are prominently visible. The background is softly blurred, showing additional flowers and green stems in what appears to be a greenhouse or controlled growth environment. — A beautiful *Clarkia xantiana* flower.

Because they can be used to parameterize an entire distribution, the mean and variance (or its square root, the standard deviation) are the most common summaries of a variable’s center and spread. However, these summaries are most meaningful when the data resemble a bell curve. To make informed choices about how to summarize a variable, we must first consider its shape, typically visualized with a histogram. When data are skewed or uneven, we can either transform the variable to make its distribution more balanced, or use alternative summaries like the median and interquartile range, which better capture the center and spread in such cases.

Practice Questions

Try these questions! By using the R environment you can work without leaving this “book”.

The tabs above – Iris, Faithful, and Rivers – all attempt to make histograms, but include errors, and may have improper bin sizes. (Click the Iris tab if they are initially empty).

Q1) Iris, Faithful, and Rivers – all attempt to make histograms, but include errors. Which code snippet makes the best version of the Iris plot (ignoring bin size)?

ggplot(iris,aes(x = Sepal.Width, fill = 'white'))+ geom_histogram() ggplot(iris,aes(x = Sepal.Width, color = 'white'))+ geom_histogram() ggplot(iris,aes(x = Sepal.Width))+ geom_histogram(color = 'white') ggplot(iris,aes(x = Sepal.Width))+ geom_histogram(color = white) ggplot(iris,aes(x = Sepal.Width))+ geom_histogram(fill = 'white')

Before addressing this set of questions fix the errors in the histograms of Iris, Faithful, and Rivers, and adjust the bin size of each plot until you think it is appropriate. (Click any of the tabs if they are initially empty).

Q2a) Which variable is best described as bimodal?

Q2b) Which variable is best described as unimodal and symmetric?

Q2c) Which variable is best described as unimodal and right skewed?

Penguins

Q3) I calculate means in two different ways above and get different answers. Which is correct?

Q4) What went wrong in calculating these means?

n() counts the number of entries, but we need the number of non-NA entries. na.rm should be set to TRUE, not T The denominator for the mean is (n - 1), not n. Nothing – they are both potentially correct depedning on your goals.

Q5) Accounting for species differences in mean body mass, which penguin species shows the greatest variability in body mass?

library(palmerpenguins)
library(ggplot2)
library(dplyr)

penguins |>
  group_by(species)|>
  summarize(mean_mass = mean(body_mass_g, na.rm = T),
            sd_mass   = sd(body_mass_g, na.rm = T),
            coef_var_mass =   sd_mass / mean_mass 
  )

# A tibble: 3 × 4
  species   mean_mass sd_mass coef_var_mass
  <fct>         <dbl>   <dbl>         <dbl>
1 Adelie        3701.    459.        0.124 
2 Chinstrap     3733.    384.        0.103 
3 Gentoo        5076.    504.        0.0993

For the next set of questions consider the boxplot below, which summarizes the level of Lake Huron in feet every year from 1875 to 1972.

Q6a) The mean is roughly

Six One and three quarters Five hundred and seventy nine We cannot estimate the mean from this plot.

Q6b) The median is roughly

Six One and three quarters Five hundred and seventy nine We cannot estimate the median from this plot.

Q6c) The mode is roughly

Six One and three quarters Five hundred and seventy nine We cannot estimate the mode from this plot.

Q6d) The interquartile range is roughly Q6e) The range is roughly Q6f) The variance is roughly

Brooke planted RILs at four different locations, and found tremendous variation in the proportion of hybrid seed across locations. The first step in quantifying this variation is to calculate the sum of squares, so let’s do it. Use the image below (Figure 1) to calculates sum of squares for proportion hybrid seeds across plants planted at four different locations.

Q7a) The sum of squares for differences between proportion hybrids at each location and the grand mean equals:

Q7b) So the variance is

Q7c) The standard deviation is

Q7d) Accounting for differences in their means, how does the variability in the proportion of hybrid seed across locations compare to the variability in petal area among RILs? (refer to this section for reference)

They are very similar Petal area among RILs is roughly thirty times as variable The proportion of hybrid seed among sites is roughly two times as variable You cannot compare variability for different traits measured on such different scales

We can compare the variability of traits measured on different scales by dividing the standard deviation by the mean. This gives us the coefficient of variation (CV), a standardized measure of spread.

Q8) Why is it important to standardize by the mean when comparing variability between variables?

First: Think and draft an answer.
Then: Interact with this chatbot to refine your answer.

My prompt: “You are a helpful assistant for a statistics textbook. This is for college students and graduate students who have likely heard the terms mean median and mode before. This comes after our section on univariate summaries of shape, center and variability. The students have just had a homework question to ‘compare the variability (accounting for differences in means) in proportion hybrid seed set across experimental field sites (measured as proportion) and petal area of Recombinant Inbred Lines aka RILs (measured in mm2)’. Help them work through this next question – ‘Why is it important to standardize by the mean when comparing variability between variables?’. Here are some rules for you. 1) Do not give any help until they hazard an attempt to answer the question, if they ask for help without trying to answer, say something like ‘Im happy to help once you give it a first shot’. 2) Do not tell them the answer, but do guide them to the answer. 3) Be encouraging and supportive. 4) Do tell them when their answer is pretty good. 5) After their first attempt do give them hints, and work them through examples / logic. For example, you can work them through what the standard deviation in proportion hybrid seed would be if it was measured as percent rather than proportion, or what the standard deviation in petal area would be if it was measured in cm2 rather than mm2. 6) Remember ‘good enough is good enough.’ Once they have tried more than one AND have a complete sentence, which shows that they clearly understand this issue, say something like ‘great job, this is good enough to submit – but I am here to help you if you want to further refine your answer.’ – it can be annoying and dispiriting when you keep asking for a better version when the version the students have is already good. Ok enough rules, here is some additional useful context: the standard deviation in proportion hybrid seed across locations is 0.08165 and the mean is 0.15 (so the coefficient of variation is 0.566), while the standard deviation for petal area is 14.26, and the mean is 62 (so the coefficient of variation is 0.23).”

Glossary of Terms

📐 1. Shape and Distribution

Skewness: A measure of asymmetry in a distribution.
- Right-skewed: Most values are small, with a long tail of large values.
- Left-skewed: Most values are large, with a long tail of small values.
Mode: The most frequently occurring value (or values) in a dataset.
Unimodal / Bimodal / Multimodal: Describes the number of peaks (modes) in a distribution.
- Unimodal: One peak
- Bimodal: Two peaks, possibly indicating two subgroups
- Multimodal: More than two peaks

🔁 2. Transformations and Data Shape

Transformation: A mathematical function applied to data to change its shape or scale. Often used to reduce skew or satisfy model assumptions.
Monotonic Transformation: A transformation that preserves the order of values (e.g., if \(x_1 > x_2\), then \(f(x_1) > f(x_2)\)). Required for valid shape-changing operations.
Log Transformation (log(), log10()): Reduces right skew by compressing large values.
- ✅ Use for right-skewed data (e.g., area, income, growth).
- ⚠️ Don’t use with zero or negative values — log is undefined in those cases. A workaround is log(x + 1) for count data.
Square Root Transformation (sqrt()): Less aggressive than log. Preserves order while compressing large values.
- ✅ Use for right-skewed data like enzyme activity or count data.
- ⚠️ Not defined for negative values.
Reciprocal / Inverse (1/x): Emphasizes small values and compresses large ones.
- ✅ Use for rates or time-based data (e.g., reaction time).
- ⚠️ Undefined for zero values; extremely sensitive to small values.
Square / Cube (x^2, x^3): Spreads data out, emphasizing large values.
- ✅ Can reduce left skew.
- ⚠️ Squaring loses sign if data contains negatives; avoid if data include both positive and negative values.

🎯 3. Summarizing the Center (Central Tendency)

Mean (mean()): The arithmetic average. Sensitive to outliers.
- \(\overline{X} = \frac{1}{n} \sum_{i=1}^{n} x_i\)
Median (median()): The middle value of a sorted dataset. Robust to outliers.
Mode: Most frequent value or value bin.
Trimmed Mean: Mean after removing fixed percentages of extreme values. Balances robustness and efficiency.
Geometric Mean: The nth root of the product of values.
- ✅ Appropriate for multiplicative data (e.g., growth rates, ratios, log-normal data).
- ⚠️ Don’t use with zeros or negative values — the geometric mean is undefined.
- 🧠 Tip: Especially useful for right-skewed, strictly positive data that spans multiple orders of magnitude.
Harmonic Mean: The reciprocal of the mean of reciprocals.
- ✅ Useful when averaging ratios or rates (e.g., speed, population size in genetics).
- ⚠️ Very sensitive to small values and undefined for zero or negative numbers.
- 🧠 Tip: Use when the quantity being averaged is in the denominator (e.g., “miles per hour”).

📉 4. Summarizing Variability

Range: Difference between maximum and minimum. Sensitive to outliers.
Interquartile Range (IQR) (IQR()): Middle 50% of data. Robust and often paired with the median.
Mean Absolute Deviation (MAD) (mad()): The average absolute deviation from the mean or median. Robust and intuitive.
- \(\text{MAD} = \frac{1}{n} \sum |x_i - \bar{x}|\)
Sum of Squares (SS): Total squared deviation from the mean.
- \(SS = \sum (x_i - \bar{x})^2\)
Variance (var()): The average squared deviation from the mean.
- \(s^2 = \frac{SS}{n - 1}\)
Standard Deviation (sd()): Square root of variance. Easier to interpret due to linear units.
- \(s = \sqrt{s^2}\)
Coefficient of Variation (CV): Standard deviation divided by the mean. Unitless and good for comparing across traits or units.
- \(CV = \frac{s}{\bar{x}}\)

📊 5. Visualizing Distributions

Histogram: Shows frequency of values within bins. Useful for assessing shape, skewness, and modes.
Boxplot: Summarizes median, quartiles, range, and outliers in a compact visual form.

Key R Functions

📊 Visualizing Univariate Data

geom_histogram() ([ggplot2]): Makes histograms for visualizing distributions.
geom_boxplot() ([ggplot2]): Visualizes the distribution using a box-and-whisker plot.
geom_col() ([ggplot2]): Creates bar plots from summarized data.
geom_bar() ([ggplot2]): Bar plot for raw count data.

📈 Summarizing Center

mean() ([base R]): Computes the arithmetic mean.
median() ([base R]): Computes the median.
mutate() ([dplyr]): Adds new variables or transforms existing ones.
summarise() ([dplyr]): Reduces multiple rows to a summary value per group.

📏 Summarizing Variability

var() ([base R]): Computes variance.
sd() ([base R]): Computes standard deviation.
mad() ([base R]): Computes the median absolute deviation — a robust summary of variability.
IQR() ([base R]): Computes the interquartile range.
quantile() ([base R]): Returns sample quantiles. Useful for computing percentiles and quartiles.
sum() ([base R]): Used in calculating the sum of squared deviations (sum((x - mean(x))^2)).

🔁 Transformations

log() ([base R]): Natural log (base e) transformation.
log10() ([base R]): Base 10 log transformation.
sqrt() ([base R]): Computes square roots.
^ ([base R]): Exponentiation (x^2, x^3, etc.).
[1/x]: Reciprocal transformation. Beware of dividing by zero!

R Packages Introduced

ggforce: Provides advanced geoms for ggplot2. This chapter uses geom_sina() to reduce overplotting by jittering points while preserving density.

Additional resources

R Recipes:

Compute summary statistics for a table: Learn how to find summary stats within a summarise() call.
Compute summary statistics for groups of rows within a table Discover how to calculate summary stats by group.
Visualize a Distribution with a Histogram: Learn to plot histograms to visualize the distribution of a continuous variable.
Visualize a Boxplot: Find out how to create boxplots to summarize the distribution of a continuous variable and identify potential outliers.

Videos:

Data summaries from Calling Bullshit (Bergstrom & West, 2020). Fun video to help with thinking about various summaries of center, and when to use which.
The shape of data from crash course in statistics.