Crunching Numbers: From Averages to Outliers and Everything In Between
Ever wondered how we turn raw data into actionable insights? It’s not magic — it’s statistics! In this blog, we’ll roll up our sleeves and dive into the basics of statistics using nothing but Python code. Forget about using fancy libraries or shortcuts; we’re going old-school with raw data and pure coding. This approach will help us uncover not just what the numbers are telling us, but also how and why they matter in our data-driven world.
Let’s say the VP of Fundraising at your company needs to know how many friends members of company have. They want to include this info in their pitch to potential donors. Now, if we just throw the raw data at them, it’s like giving them a jigsaw puzzle without the picture on the box. It could be useful, but not very helpful. Here is our raw data, representing number of friends each member has:
num_friends = [100.0,49,41,40,25,21,21,19,19,18,18,16,15,15,15,15,14,14,13,13,13,13,12,12,11,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,8,8,8,8,8,8,8,8,8,8,8,8,8,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
For small datasets, we might read through the data directly. But when the data gets big, we need statistics to help us summarise and interpret it. Think of statistics as our data’s personal translator, making the complex stuff easier to understand.
To help the VP visualise the data better, we can start by creating a histogram. This simple visual tool helps show the distribution and trends in the number of friends the members have. Here’s a snapshot of what it looks like:
Now, this histogram is super helpful because it gives us a quick visual overview. It shows us where most of the action is happening — where the bulk of the members fall in terms of friend count. But while it’s great for spotting trends and distributions, it doesn’t tell the whole story.
Beyond the Histogram: The Need for Deeper Insights
A histogram showed us the distribution, but didn’t answer the questions like:
- What’s the average number of friends per member? Are members generally popular, or are there just a few with lots of friends while the rest have only a handful?
- Are there any outliers? Do we have a few members with an unusually high or low number of friends that might skew the overall picture?
- How spread out are these friend counts? Are most members close to the average, or is there a wide variety?
Here comes the statistics that will let us zoom in and understand the finer points.
Central Tendencies
So, we have got a bunch of numbers and need to describe them without sounding like a robot. Central tendencies will help us find the “middle” of our data so we can keep things simple. Let’s break it down:
Mean
The mean is like finding the average number of friends per member. To get this, we add up the total number of friends and divide by the number of members.
from typing import List
def mean(my_dataset: List[float]) -> float:
"""
Mean of my_dataset
"""
return sum(my_dataset) / len(my_dataset)
assert mean(num_friends) == 7.33 # Mean of data without outliers
But there is a problem — suppose if most members have between 1 and 100 friends, but two members have 2000 and 5000 friends, the mean will be skewed. It’ll suggest a much higher average, which doesn’t reflect the typical member’s social circle. Let’s add these outliers to our data:
num_friends_with_outliers = num_friends + [5000,2000]
assert mean(num_friends_with_outliers) == 41.3 # Highly skewed mean with outlier
Therefore, mean is best used when the data is evenly distributed without any outliers. If we can’t clean the outliers, we should check the central tendency of data using median.
Median
Median is the middle value of a dataset when sorted from smallest to largest. For even number dataset, median is average of two middle values. For odd number dataset the median is the middle value.
# For odd length dataset
# The underscores are for "private" methods under median() method
def _median_odd(my_dataset: List[float]) -> float:
return my_dataset[len(my_dataset)//2]
# For even length dataset
def _median_even(my_dataset: List[float]) -> float:
high_midpoint = len(my_dataset)//2
low_midpoint = high_midpoint-1
return (my_dataset[low_midpoint] + my_dataset[high_midpoint])/2
# Median
def median(my_dataset: List[float]) -> float:
"""
Median of my_dataset
"""
v = sorted(my_dataset)
return _median_even(v) if len(v)%2 == 0 else _median_odd(v)
assert median(num_friends) == 6 # Median without outliers
The median is not affected by outliers or extremely skewed data. Even the median of dataset with outliers stays at 6.
assert median(num_friends_with_outliers) == 6 # Median with outliers
Quantiles
Quantiles are the values that divide a dataset into intervals with equal proportions of data. Quartiles and percentiles are commonly used quantiles. Median, which divides the dataset into two equal halves, is a second quartile.
For example:
- First Quartile (Q1 or 25th Percentile): 25% of the data falls below this value.
- Second Quartile (Q2 or 50th Percentile): This is the median.
- Third Quartile (Q3 or 75th Percentile): 75% of the data falls below this value.
- Percentiles generalise this concept further. For instance, the 90th percentile represents the value below which 90% of the data falls.
# Define quantile
def quantile(my_dataset: List[float], my_quantile: float) -> float:
"""
Value from my_dataset below which (my_quantile*100)% values lies
"""
my_sorted = sorted(my_dataset)
return my_sorted[int(len(my_dataset)*my_quantile)]
assert quantile(num_friends, 0.10) == 1
assert quantile(num_friends, 0.25) == 3
assert quantile(num_friends, 0.50) == 6
assert quantile(num_friends, 0.75) == 9
assert quantile(num_friends, 0.90) == 13
Quantiles help us understand the distribution of data, especially when dealing with outliers.
Mode
Unlike the mean and median, the mode is not about averages or middle points — it’s about popularity. While the mean might be skewed by that one super-popular member with 2000 friends, and the median might just sit in the middle, the mode represents what’s most common.
from collections import Counter
def mode(my_dataset: List[float]) -> List[float]:
"""
Returns list of most common values in my_dataset
"""
count_dict = Counter(my_dataset)
max_counts = max(count_dict.values())
return [x for x,y in count_dict.items() if y == max_counts]
assert mode(num_friends) == [6,1]. # Most members have either 6 or 1 friends
Mode is especially useful while working with categorical data.
Data Dispersion
Dispersion helps us understand the “spread” of our data — how far individual data points are from the average. For our dataset, knowing the dispersion helps to grasp whether most members are hanging out in a close-knit group or if there are a few outliers with wildly different friend counts.
Range
First, let’s talk about the range. Range is difference between the highest and lowest values in the dataset. It gives a quick sense of the spread but it can be heavily influenced by outliers.
def data_range(my_dataset: List[float]) -> float:
"""
Difference between max and min values in my_dataset
"""
return max(my_dataset) - min(my_dataset)
Variance
While the range gives a quick snapshot, it doesn’t account for how the data is distributed between the extremes. Variance, on the other hand, provides a more comprehensive view by considering every data point and their deviations from the mean.
It is the average squared difference from mean. Higher the spread, higher is the variance.
# Find deviation of x from its mean
def de_mean(my_dataset: List[float]) -> List[float]:
"""
Translate dataset by subtracting its mean (so the result has mean 0)
"""
my_mean = mean(my_dataset)
return [(x_i-my_mean) for x_i in my_dataset]
# Find variance
def variance(my_dataset: List[float]) -> float:
"""
Almost the average squared deviation from the mean
"""
assert len(my_dataset) >= 2 # Variance requires at least two elements
return sum_of_squares(de_mean(my_dataset))/(len(my_dataset)-1)
At it depends on mean, it can be influenced by outliers. But it does so in a more integrated way than range. If the dataset contains extreme values, variance helps you see their impact more clearly. Also, because variance is in squared units, it can be harder to interpret directly.
Standard Deviation
The problem of squared units in variance is resolved with standard deviation. It is simply the square root of the variance. Since it’s in the same units as the data, it’s easier to interpret and understand. It tells us how much individual data points typically deviate from the mean in practical terms. For example, if we are discussing friend counts, saying “the standard deviation is 5 friends” is more immediately understandable than saying “the variance is 25 friends squared.”
import math
def standard_deviation(my_dataset: List[float]) -> float:
"""
The standard deviation is the square root of the variance
"""
return math.sqrt(variance(my_dataset))
The outlier problem persists with standard deviation, because it’s the square root of variance itself.
assert 9 < standard_deviation(num_friends) < 9.04
assert 374 < standard_deviation(num_friends_with_outliers) < 375 # Outlier problem
Interquartile Range (IQR)
As discussed, variance and standard deviation are important concepts to understand data dispersion but face serious outlier challenge. When dealing with outliers, IQR becomes a more robust measure for understanding data dispersion. The IQR measures the range within which the central 50% of the data lies — between the 25th percentile (Q1) and the 75th percentile (Q3). This middle range is less influenced by extreme values at either end of the distribution. This allows us to get a clearer picture of the “typical” spread of friends among members without letting a few extreme values skew the results.
def interquatile_range(my_dataset: List[float]) -> float:
"""
Returns the difference between the 75%-ile and the 25%-ile
"""
return quantile(my_dataset,0.75)-quantile(my_dataset,0.25)
assert interquatile_range(num_friends) == 6
assert interquatile_range(num_friends_with_outliers) == 6 # Not much affected by small number of outliers
Once we have an idea of the general spread from IQR, calculate variance and standard deviation to get a comprehensive view of data dispersion. These measures can be particularly useful if our dataset is relatively clean and we need precise statistical metrics.
The Full Picture
From visualising friend counts with histograms to understanding central tendencies and data dispersion, we’ve seen how each measure offers unique insights. While histograms give us an overview, measures like mean, median, and quantiles help pinpoint the central values. We moved from range to variance and standard deviation to grasp data spread, and finally, the Interquartile Range (IQR) provided a robust view unaffected by outliers.
This exploration into statistics has been inspired by a section of Joel Grus’s book, Data Science from Scratch.
For more details and to explore the code, check out my GitHub repository here.