Covariance, Correlation, and Causation: When Two Variables Dance, Who’s Leading the Show?
So, you have crunched some numbers, plotted your data, and now staring at the tangled web of relationships between variables, trying to make sense of them. Welcome to the world of covariance, correlation, and causation — a place where what you see isn’t always what you get.
Unpacking the Concepts: Covariance, Correlation, and Causation
Let’s understand with the example — the VP of Fundraising at your company needs to know how many friends members of company have. They want to include this info in their pitch to potential donors. We performed preliminary analysis on this data using mean, median, mode, variance, standard deviation, and interquartile range (IQR) in the previous blog Crunching Numbers: From Averages to Outliers and Everything In Between. Now, let’s dive deeper and find out if the number of friends is relatable to the amount of time spent by each member on social media.
# Our dataset
num_friends = [100.0,49,41,40,25,21,21,19,19,18,18,16,15,15,15,15,14,14,13,13,13,13,12,12,11,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,8,8,8,8,8,8,8,8,8,8,8,8,8,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
daily_minutes = [1,68.77,51.25,52.08,38.36,44.54,57.13,51.4,41.42,31.22,34.76,54.01,38.79,47.59,49.1,27.66,41.03,36.73,48.65,28.12,46.62,35.57,32.98,35,26.07,23.77,39.73,40.57,31.65,31.21,36.32,20.45,21.93,26.02,27.34,23.49,46.94,30.5,33.8,24.23,21.4,27.94,32.24,40.57,25.07,19.42,22.39,18.42,46.96,23.72,26.41,26.97,36.76,40.32,35.02,29.47,30.2,31,38.11,38.18,36.31,21.03,30.86,36.07,28.66,29.08,37.28,15.28,24.17,22.31,30.17,25.53,19.85,35.37,44.6,17.23,13.47,26.33,35.02,32.09,24.81,19.33,28.77,24.26,31.98,25.73,24.86,16.28,34.51,15.23,39.72,40.8,26.06,35.76,34.76,16.13,44.04,18.03,19.65,32.62,35.59,39.43,14.18,35.24,40.13,41.82,35.45,36.07,43.67,24.61,20.9,21.9,18.79,27.61,27.21,26.61,29.77,20.59,27.53,13.82,33.2,25,33.1,36.65,18.63,14.87,22.2,36.81,25.53,24.62,26.25,18.21,28.08,19.42,29.79,32.8,35.99,28.32,27.79,35.88,29.06,36.28,14.1,36.63,37.49,26.9,18.58,38.48,24.48,18.95,33.55,14.24,29.04,32.51,25.63,22.22,19,32.73,15.16,13.9,27.2,32.01,29.27,33,13.74,20.42,27.32,18.23,35.35,28.48,9.08,24.62,20.12,35.26,19.92,31.02,16.49,12.16,30.7,31.22,34.65,13.13,27.51,33.2,31.57,14.1,33.42,17.44,10.12,24.42,9.82,23.39,30.93,15.03,21.67,31.09,33.29,22.61,26.89,23.48,8.38,27.81,32.35,23.84]
The scatter plot of our dataset looks like:
Covariance: The Relationship Detective
Covariance helps us understand the direction of the relationship between two variables. Think of it as a detective who notices whether two suspects (variables) tend to move together or apart.
Let’s understand this mathematically — if X and Y are two features/variables of a dataset with n data points, and x̄ and ȳ are their respective means/averages, the covariance is written as,
Cov(X, Y) = Σ (x_i — x̄) (y_i — ȳ) / (n — 1)
(x_i — x̄) and (y_i — ȳ) are the deviations of each data point from their respective means, measuring the spread of data points. The product of these deviations is giving the direction of relationship between two variables. For example, if one deviation is positive and the other is negative, the product is negative, which means both are varying inversely. The summation captures the overall relationship between the two variables. When this is divided by (n-1), the sum is normalised, and its provides an average measure of the covariance.
Positive Covariance indicates that as one variable increases, the other does too (or both decrease together). While the Negative Covariance indicates there opposing directions. Covariance Near 0 suggests no linear relationship (though not necessarily independent).
# Python code for covariance
from scratch.linear_algebra import dot
from typing import List
def covariance(x: List[float], y: List[float]) -> float:
assert len(x) == len(y), "xs and ys must have same number of elements"
return dot(de_mean(x), de_mean(y))/(len(x) - 1)
You can find the scratch module in git repo here, which has all the methods defined from scratch. de_mean is deviation of data points from mean discussed in the previous blog.
# For our data the covariance is around 22.4
assert 22.42 < covariance(num_friends, daily_minutes) < 22.43
The positive covariance of 22.4 suggests that members with more friends tend to spend more time on social media. The relationship is linear in nature, which means the variables generally move together in the same direction. However, because the covariance is dependent on the scale of the data (number of friends and minutes), it is not easy to say whether this relationship is strong or weak just from the value of 22.4. For idea, check scatter plot in Fig.1.
What’s the limitation?
Covariance is not standardised, therefore its value is difficult to interpret directly without additional context. For example, the covariance between variables measured with different units, as in here, can be misleading.
Another problem with this is, if x becomes 2x, then covariance will also be doubled, which shows its dependency on scale.
Thus, covariance is often used in conjunction with correlation, which standardises the covariance by dividing it by the product of the standard deviations of the two variables, making it easier to interpret.
Correlation: The Profiling Expert
If covariance is our detective, correlation is the profiling expert who clarifies how tightly two variables are related. It measures how well the data points cluster around a straight line (Yes, the same line that you fit using linear regression). A higher absolute value indicates tight clusters around the fitted line and vice-versa.
Mathematically, the Pearson correlation coefficient (𝑟) between two variables 𝑋 and 𝑌 is given as:
r = Cov(X, Y) / (𝜎_x ⋅ 𝜎_y)
The division factor, which is the multiplication of standard deviation of each variable (𝜎_x⋅𝜎_y), removes the units and makes the correlation coefficient a unitless measure that ranges between -1 and 1.
1 is the perfect positive correlation (variables move together in perfect harmony). While -1 is perfect negative correlation (one variable increases as the other decreases). 0 means No correlation (variables are unrelated).
# The python code for correlation
from typing import List
def correlation(x: List[float], y:List[float]):
"""
Measures how much x and y vary in tandem about their means
"""
stdev_x = standard_deviation(x)
stdev_y = standard_deviation(y)
if stdev_x > 0 and stdev_y > 0:
return covariance(x,y)/stdev_x/stdev_y
else:
return 0 # If no variation, correlation zero
# For our data the correlation is around 0.25
assert 0.25 < correlation(num_friends, daily_minutes) < 0.252
A correlation of 0.25 indicates that there is a slight tendency for members with more friends to spend more time on social media. However, this relationship is not strong, meaning there are many members who might not follow this pattern closely. The weak value suggests that other factors might play a more significant role in determining how much time someone spends on social media.
Be cautious! Correlation alone doesn’t always tell the whole story.
Simpson’s Paradox: The Sneaky Data Trickster
Imagine you’re looking at two variables, like ice cream sales and drowning cases. You might find that as ice cream sales go up, so do drowning cases. Does this mean eating ice cream causes drowning? Not at all!
What’s happening here is something called Simpson’s Paradox. The real culprit is a third factor — temperature. On hot days, both ice cream sales and swimming (which can lead to drowning) go up. So, while it seems like there’s a direct connection between ice cream and drowning, it’s really the heat driving both.
Keep in mind — Correlation can sometimes trick us, especially if we don’t account for other influencing factors, like temperature in this example. As a data scientist, it is crucial to dig deeper and look for these hidden factors, called confounding variables.
Zero Correlation Doesn’t Mean Zero Relationship!
Correlation measures how well data points cluster around a straight line. But sometimes, data can have a relationship even if the correlation is zero. Take this example:
X = [-2, -1, 0, 1, 2]
Y = [2, 1, 0, 1, 2]
Here, the correlation between X and Y is zero, but there is a clear relationship: y is the absolute value of x.
Keep in mind — Just because two variables aren’t correlated doesn’t mean they aren’t related in some other way. Always explore the data beyond the correlation coefficient.
Correlation Shows Direction, Not Magnitude!
Let’s say you have two datasets:
# Dataset 1
X1: [1, 2, 3, 4, 5]
Y1: [2, 4, 6, 8, 10]
# Dataset 2
X2: [1, 2, 3, 4, 5]
Y2: [10, 20, 30, 40, 50]
Both datasets have a correlation of 1, it means they have a perfect linear relationship. But notice something: in the first dataset, for each increase in X, Y increases by 2 units. In the second dataset, Y increases by 10 units for each increase in Y.
Keep in mind — Correlation tells us about the direction (positive or negative) and consistency of the relationship, but not the size of the change.
Therefore, during data analysis it is important to consider both the correlation and the actual values to understand the relationship fully.
Correlation and Causation: They’re Not the Same Thing!
Just because two things are correlated it doesn’t mean one causes the other. This is a fundamental rule in data science.
Why Correlation Doesn’t Mean Causation:
- Third Variable Problem: Sometimes, a third factor (like temperature in the ice cream example) influences both variables, creating a misleading correlation.
- Reverse Causation: The correlation might exist, but the cause-and-effect relationship could be the opposite of what you expect.
- Coincidence: The correlation might just be a coincidence, with no real connection between the variables.
Keep in mind — To prove causation, you need more than just correlation — like controlled experiments or strong theoretical evidence. Correlation is a useful tool, but it’s not the whole story.
The Full Picture
So, what’s the real deal with covariance, correlation, and causation? Think of covariance as a detective figuring out if two suspects are moving together or apart. Correlation is the profiler, showing how tightly they’re clustered around a straight line. But beware! Correlation can be sneaky, like Simpson’s Paradox, which reveals hidden tricks in your data. Remember, correlation doesn’t mean causation — just because two things dance together doesn’t mean one is leading.
This statistical adventure is inspired by Joel Grus’s book Data Science from Scratch.
For more details and code, swing by my GitHub repository.