Motivating Scenario:
You are continuing your exploration of a new dataset. After checking its shape and making transformations you thought were appropriate, you’re now ready to explore how two numeric variables are associated.
Learning Goals: By the end of this subchapter, you should be able to:
Calculate and explain a covariance: Both as the “mean of the product minus the product of the mean” and the “mean of cross products”.
Calculate and explain a correlation coefficient and why this standardized measure can be more useful than covariance when comparing associations.
The covariance
The covariance can also be used to describe the association between two numeric variables. For example, in our Clarkia RIL data, we could describe the association between \(\text{log}_{10}\) petal area and the proportion of hybrid seeds using a covariance. As with two categorical variables, the covariance between two numeric variables reflects how much the observed association differs from what we’d expect if the variables were independent. There are two ways to calculate covariance — I introduce both here because each provides a different lens for understanding the concept, and each connects deeply to core ideas in statistics.
The covariance as a deviation from expectations
In the previous section, I introduced the covariance as the difference between the proportion of observations with a specific pair of values for two variables (e.g., pink flowers and being visited by a pollinator) and how frequently we would expect to see this pairing if the variables were independent: \(\text{Covariance}_{A,B} = (P_{AB}-P_{A} \times P_{B})\). Because we can think of proportions as a mean, we can use this same math to describe the covariance of two numeric variables, X and Y, as the difference between the mean of the products and the product of the means:
As in the previous section this formula is slightly wrong because it implicitly has a denominator of \(n\), not \(n-1\). We apply Bessel’s correction to get the precise covariance (multiplying our answer by \(\frac{n}{n-1}\)). But when \(n\) is big, this is close enough.
So, we can find the covariance between (\(\text{log}_{10}\)) petal area and the proportion of hybrid seeds as the mean of a plant’s (\(\text{log}_{10}\)) petal area times its proportion of hybrid seeds minus the mean (\(\text{log}_{10}\)) petal area times the mean proportion of hybrid seeds, which equals 0.00756 (after applying Bessel’s correction).
Alternatively, we can think of the covariance as how far an individual’s value of X and Y jointly differ from their means. In this formulation,
Find the deviation of X and Y from their means for each individual– \((X_i-\overline{X})\), and \((Y_i-\overline{Y})\), respectively (Figure 1, left).
Take the product of these values to find the cross product (the area of a given rectangle in Figure 1, left).
Sum them to find the sum of cross products (Figure 1, right, top).
Divide by the sample size minus one (Figure 1, right, bottom).
The equation for the covariance \(\text{Cov}_{X,Y} = \frac{\Sigma{(X_i-\overline{X})(Y_i-\overline{Y})}}{(n-1)}\) should remind you of the equation for the variance \(\text{Var}_{X} = \frac{\Sigma{(X_i-\overline{X})(X_i-\overline{X})}}{(n-1)}\) (compare Figure 1 to Figure 2 from 5. Summarizing variability). In fact the variance is simply the covariance of a variable with itself. See our section on summarizing variability for a refresher link. In fact you, can calcualte the variance as the mean of the square minus the sqaure of the mean.
This essentially finds the mean cross products, with Bessel’s correction:
Figure 1: An animation to help understand the covariance. Left: We plot each point as the difference between x and y and their means. The area of that rectangle is the cross product. Middle: Shows how these cross products accumulate. Right: The cummulative sum of cross products and the running covariance estimate. The lower plot (covariance) is simply the top plot divided by (x-1).
The flipbook below works you through how to conduct these calculations:
Both ways of computing the covariance — as the mean of cross-products and as the difference between the product of means and mean of products — are helpful for understanding association. But students are practical and often ask: “Which of these formulae should we use to calculate the covariance?” There are a few answers to this question — the first is “it depends,” the second is “whichever you like,” and the third is “neither, just use the cov() function in R.” Here’s how:
The use = "pairwise.complete.obs" argument tells R to ignore NA values when calculating the covariance — just like na.rm = TRUE does when calculating the mean. You can use this argument or filter out NA values first.
gc_rils |>summarise(covariance =cov(log10_petal_area_mm, prop_hybrid, use ="pairwise.complete.obs"))
# A tibble: 1 × 1
covariance
<dbl>
1 0.00756
The correlation
Much like the variance and the difference in means, the covariance is a very useful mathematical description, but its biological meaning can be difficult to interpret and communicate. We therefore usually present the correlation coefficient (represented by the letter, r) – a summary of the strength and direction of a linear association between two variables. This also corresponds to how closely the points fall along a straight line in a scatterplot: the stronger the correlation, the more the points cluster along a line (positive or negative).
Large absolute values ofr indicate that we can quite accurately predict one variable from the other (i.e. points are near a line on a scatterplot).
rvalues near zero mean that we cannot accurately predict values of one variable from another (i.e. points are not near a line on a scatterplot).
The sign ofr describes if the values increase with each other (\(r > 0\), a positive slope), or if one variable decreases as the other increases ($ r < 0$, a negative slope).
Mathematically r is simply the covariance divided by the product of standard deviations (\(s_X\) and \(s_Y\)), and we can find it in R with the cor() function:
As in Cohen’s D, what is a “large” or “small” correlation coefficient depends on the study, the question and the field of study, but there are rough guides (see table on right). So our observed correlation between \(log_{10}\) petal area and proportion hybrid is worth paying attention to, but not massive.
Coming up next
These summaries — covariance and correlation — give us tools to describe how two numeric variables relate. Later, we’ll return to these ideas in the context of linear models, where we formalize the idea of one variable predicting another.