Data Science and Probability: Breaking Down the Basics

15 min readAug 25, 2024

When diving into data science, probability is one of the first concepts you will encounter. Understanding it is crucial because, at its core, data science is about making predictions and informed decisions under uncertainty. Probability provides the foundation for these decisions, helping us quantify the likelihood of events and the potential risks associated with them.

Unpacking Basic Probability Concepts

To navigate through probability, we first need to grasp some fundamental terminologies:

Sample Space: The sample space is simply all the possible outcomes. Imagine rolling a die — in this case, {1, 2, 3, 4, 5, 6}.

Event: Any specific subset of the sample space. For example, rolling an even number is an event, with the subset {2, 4, 6}.

Probability of an Event (P(E)): This is the likelihood that event E occurs. Probability values range between 0 and 1, where 0 indicates the event is impossible, and 1 means it’s a certainty.

Complementary Events: What’s the probability that event E doesn’t happen? That’s where the complement of E, denoted as E’, comes in. It’s calculated as:

P(E’) = 1 — P(E).

Union of Events: What if we’re interested in the probability of either event E or V happening? This is called the probability of the union of events, and it’s given by:

P(E ∪ V) = P(E) + P(V) — P(E ∩ V).

Intersection of Events: Sometimes, we want to know the likelihood of two events happening simultaneously. This is the intersection of events, expressed as:

P(E ∩ V) = P(E) * P(V), for independent events.

But what is meant by independent events?

Independent vs. Dependent Events

Independent events are those where the outcome of one event doesn’t affect the outcome of another. For example, rolling a die twice — each roll is independent of the other. The probability space remains the same for each roll, always giving you the same {1, 2, 3, 4, 5, 6} possibilities.

On the other hand, Dependent events are those where the outcome of one event influences the next. Consider drawing balls from a bag. If you don’t replace the ball, the sample space changes with each draw, altering the probability of future draws.

For dependent events, the probability of both events E and V occurring together is given by:

P(E ∩ V) = P(E ∣ V) * P(V).

Here, P(E∣V) is the probability of event E occurring given that event V has already happened, and P(F) is the prior probability of event F occurring.

P(E ∩ V) is also represented as P(E, V) and is used interchangeably.

A Real-World Example: Predicting Child Gender

Let’s take a family with two children and explore the probabilities based on the following assumptions:

Each child is equally likely to be a boy or a girl.
The gender of the second child is independent of the first child.

Here’s the sample space: {BB, BG, GB, GG}.

P(no girl) = 1/4,

P(1G and 1B) = 2/4 = 1/2,

P(GG ∣ elder is G) = 1/2,

P(GG ∣ either is G) = 1/3.

Interestingly, the probability of the younger child being a boy if the elder is a girl becomes 2/3, not 1/2! This might sepem counterintuitive, but it’s a great demonstration of how probability can reveal surprising truths.

To further explore this, let’s simulate the gender distribution of 10,000 families with two children:

import enum, random

class Kid(enum.Enum): 
    BOY = 0
    GIRL = 1

def random_kid() -> Kid:
    return random.choice([Kid.BOY, Kid.GIRL])

GG = 0   # Both girls
GX = 0   # Older girl
GX_or_XG = 0 #Either girl

random.seed(0)

for _ in range(10000):
    # Generate random kids couple
    younger = random_kid()
    older = random_kid()
    # Check in 10000 samples how many kid couples have GX(older girls), GG(Both are girls) or GX_or_XG (Either is girl)
    if older == Kid.GIRL:
        GX += 1
    if older == Kid.GIRL and younger == Kid.GIRL:
        GG += 1
    if older == Kid.GIRL or younger == Kid.GIRL:
        GX_or_XG +=1

assert 0.5 < GG/GX < 0.51.  # Probability of Girl when elder is Girl
assert 0.33 < GG/GX_or_XG <0.34 # Probability of Girl when either is Girl

The results align closely with theoretical expectations, showcasing the power of probability in predicting outcomes.

Bayes’s Theorem

Bayes’s theorem is a Data Scientist’s best friend, which is a way of reversing conditional probabilities. Bayes’s theorem tells us about how to update our belief in the probability of A given the occurrence of B, by considering the probability of B given the occurrence of A and the prior probability of A.

We can say that Bayes’s theorem is a way of thinking. It allows data scientists to incorporate new evidence into existing beliefs, making it easier to navigate the complexities of real-world data. Whether applied to medical testing, spam filtering, or predictive modelling, Bayes’s Theorem is a crucial tool for making sense of uncertainty.

Mathematically, Bayes’s Theorem can be expressed as:

P(A∣B) = P(B∣A)×P(A) /P(B)

Where:

P(A∣B) is the probability of event A occurring given that B has already occurred.
P(B∣A) is the probability of event B occurring given that A ihas already occurred.
P(A) is the prior probability of A.
P(B) is the total probability of B occurring, which can be calculated as:

P(B) = P(B∣A) × P(A) + P(B∣-A) × P(-A)

This formula may look a bit complex, but it becomes much clearer with an example.

The derivation of Bayes’s Theorem is quite straightforward, and you can explore more detailed explanations in my GitHub repository.

Let’s understand it with an example.

An Example: Medical Testing

Imagine a scenario where we’re testing for a rare disease. The probability of having the disease P(D) is 0.1% (or 0.001). The test is highly sensitive, that means if you have the disease, the probability that the test will be positive i.e. P(T∣D) is 99% (or 0.99). However, the test isn’t perfect. There’s a 5% chance that the test will be positive even if you don’t have the disease — this is the false positive rate i.e. P(T∣-D).

Now, you take the test, and it comes back positive. What’s the probability that you actually have the disease?

First, we calculate P(T), the overall probability of testing positive. This takes into account both the true positives and the false positives:

P(T) = P(T∣D)×P(D)+P(T∣-D)×P(-D)

Substituting the given values:

P(T) = 0.99×0.001+0.05×0.999

P(T) = 0.00099+0.04995=0.05094 = 0.05094

Next, we use Bayes’s Theorem to find P(D∣T), the probability of having the disease given that the test result is positive:

P(D∣T) = P(T∣D) × P(D) / P(T)

P(D∣T) = 0.99×0.001/0.05094

P(D∣T) ≈ 0.0194

So, even with a 99% sensitive test, the probability that you actually have the disease given a positive result is only about 1.94%. This might seem surprisingly low, but it’s a powerful illustration of how Bayes’s Theorem helps us understand the true implications of test results, especially in scenarios where the event (in this case, having the disease) is rare.

Random Variables: The Backbone of Probability Distributions

Random variable (RV) is a numerical representation of the outcomes of an event. These variables can take on a wide range of values depending on the nature of the event they are associated with.

Discrete vs. Continuous RVs

RVs can be either Discrete or Continuous:

Discrete RVs take on specific, countable values. For example: The numbers 1 to 6 in the event of rolling a die, or the values 0 (for tails) and 1 (for heads) in the event of flipping a coin, or the number of heads obtained when flipping a coin 10 times.

Continuous RVs take on an infinite number of possible values within a given range. For instance, measuring the exact time it takes for a chemical reaction to complete could result in a continuous variable.

These RVs are always associated with a probability distribution, which tells us the likelihood of each possible outcome.

Expected Values of RVs

One of the key concepts associated with RVs is the expected value. This is essentially the average value you would expect a RV to take on, considering the probabilities of all possible outcomes.

The formula for the expected value (E[X]) of a RV X is:

E[X] = ∑ ( X × P(X) )

In simple terms, it’s the sum of all possible values of the RV, each multiplied by its corresponding probability.

Let’s understand this with examples:

The expected value of a RV in the event of flipping a coin is:

E[X] = 0 × 1/2 + 1 × 1/2 = 1/2

The expected value of the numbers in the range from 0 to 9 (assuming equal probability) is:

E[X] = 1/10 × (0+1+2+…+9) = 0.1×45 = 4.5

Conditional Random Variables

RVs can also be conditioned based on prior knowledge or other events. For instance, consider the example of family with two children mentioned in the section on Conditional Probability:

Let X be the RV representing the number of girls in the family. The possible values are {0, 1, 2}, with corresponding probabilities P(X)={1/4, 1/2, 1/4}.
Now, if we know that one of the children is a girl, the RV Y (representing the number of girls, given that one is a girl) takes the values {1, 2}, with corresponding probabilities P(Y) = {2/3, 1/3}

In most cases, as data scientists, we use RVs implicitly, relying on their properties without always focusing on their underlying structure.

Continuous Distributions: When Random Variables Flow Smoothly

In contrast to discrete RVs, which take on countable values, continuous RVs (RVs) can assume an infinite number of values within a given range. These are the types of variables we often encounter when dealing with measurements or any situation where values are not restricted to discrete points.

Probability Density Function (PDF)

For continuous RVs, we can’t assign probabilities to specific values as we do with discrete variables. Instead, we define the probability in terms of the Probability Density Function (PDF). The PDF describes how the probabilities are distributed across different values of the RV.

Since the RV can take any value within a continuous range, the probability of it taking any exact value is technically zero. Instead, we find the probability over an interval by integrating the PDF across that interval.

For small intervals, h, the probability that a continuous RV X lies between x and x+h is approximately:

P(x ≤ X < x+h) = h × PDF (x)

This approach allows us to handle the continuous nature of these variables effectively.

1. Uniform Distribution: Equal Probability Across the Board

One of the simplest continuous distributions is the uniform distribution. In a uniform distribution, every value within a certain range has the same probability of occurring.

Continuous Uniform Distribution: This distribution is defined by its minimum and maximum values. The PDF of a uniform distribution is constant, meaning that the likelihood of the RV taking any value within the specified range is the same.

For a uniform distribution over the interval [0,1], the PDF can be described as:

def uniform_pdf(x: float) -> float:
    return 1 if 0 <= x < 1 else 0

In this case, the probability is evenly spread out between 0 and 1. For any x within this interval, the PDF returns a constant value of 1, indicating equal likelihood across the entire range. It’s often used in simulations where each outcome is equally likely, e.g., random sampling.

Cumulative Distribution Function (CDF) of a Uniform Distribution

CDF helps us to understand the likelihood that a RV will take a value less than or equal to a specific point.

For a uniform distribution, where all values are equally likely, the CDF provides a straightforward way to calculate this probability. Let’s dive into how it works for a uniform distribution in the range [0, 1].

Mathematically, the CDF F(x) for a random variable X is defined as:

F(x) = P(X ≤ x).

This means it tells us the probability that X will be less than or equal to x.

In a uniform distribution where the range is from [0, 1], things are pretty simple. The CDF for such a distribution is defined as:

def uniform_cdf(x: float) -> float:
    """Returns the cumulative probability that a uniform random variable is <= x."""
    if x < 0:
        return 0
    elif x < 1:
        return x
    else:
        return 1

If x < 0: The function returns 0 because the cumulative probability is 0 for values below the minimum of the distribution range.

If 0 ≤ x < 1: The function returns x, which directly represents the CDF within the interval [0, 1].

If x ≥ 1: The function returns 1, representing that the cumulative probability has reached its maximum.

2. Normal Distribution

The normal distribution is characterised by its classic bell-shaped curve. It’s defined by just two parameters:

Mean (μ): This is where the center of the curve is located. It represents the average value around which the data points are distributed.
Standard Deviation (σ): This measures how spread out the values are. A larger standard deviation means a wider curve, while a smaller standard deviation results in a narrower curve.

Here’s how we can calculate the PDF of a normal distribution in Python:

import math

def normal_PDF(x: float, mu: float=0, sigma: float=1) -> float:
    """Returns the probability density function of a normal distribution."""
    SQRT_TWO_PI = math.sqrt(2 * math.pi)
    return (math.exp(-(x - mu)**2 / (2 * sigma**2)) / (sigma * SQRT_TWO_PI))

Each line represents the bell-shaped curve of the normal distribution. Changes in standard deviation affect the width of the bell. A smaller σ results in a narrower and taller curve, while a larger σ results in a wider and flatter curve. Whereas, changing the mean shifts the center of the curve along the x-axis.

Standard Normal Distribution

When the mean μ is 0 and the standard deviation σ is 1, we get the Standard Normal Distribution. This special case simplifies many calculations and is a common reference in statistics.

For any normal random variable X with mean μ and standard deviation σ, we can convert it to a standard normal variable Z using the formula:

X = σ Z + μ

This means that if we know the distribution parameters of a normal random variable, we can always standardise it to a standard normal distribution for easier analysis.

Cumulative Distribution Function (CDF) for Normal Distribution

The CDF of a normal distribution tells us the probability that a random variable X will take on a value less than or equal to a specific value x. Python’s math library provides a function erf() that we can use to compute the CDF. Here’s how we can use it:

import math

def normal_CDF(x: float, mu: float = 0, sigma: float = 1) -> float:
    return (1 + math.erf((x - mu) / (sigma * math.sqrt(2)))) / 2

This function returns the cumulative probability for a given value x, mean μ, and standard deviation σ.

The CDF at a particular point x represents the probability that the random variable X is less than or equal to x. It’s the cumulative area under the PDF curve up to x. This cumulative area gives us insights into the distribution’s behaviour and is useful for calculating percentiles and quantiles.

Inverse CDF: Finding the Value for a Given Probability

Sometimes, we need to find the value x that corresponds to a given cumulative probability p. This process involves computing the inverse of the CDF. Here’s how to do it —

The inverse CDF can be approximated using the binary search method, which involves:

Defining a range within which the value might lie.
Iteratively adjusting the range based on comparisons with the target probability.
Narrowing down to the desired precision.

Here’s a Python function to perform this calculation:

def inverse_normal_cdf(p: float, mu: float = 0, sigma: float = 1, tolerance: float = 1e-5) -> float:
    """Find approximate inverse of CDF using binary search."""
    if mu != 0 or sigma != 1:
        return mu + sigma * inverse_normal_cdf(p, tolerance=tolerance)

    lo_z = -10
    hi_z = 10

    while hi_z - lo_z > tolerance:
        mid_z = (lo_z + hi_z) / 2  # Find midpoint
        mid_p = normal_CDF(mid_z)
        if mid_p < p:
            lo_z = mid_z
        else:
            hi_z = mid_z

    return mid_z

For Standard Normal Distribution:, when μ = 0 and σ = 1, we can use this function directly to find the Z-value corresponding to a given probability p.

For General Normal Distribution, for any μ and σ, adjust the result from the standard normal case using the formula X=σ⋅Z+μ.

Central Limit Theorem

The Central Limit Theorem (CLT) is a cornerstone of probability and statistics. It reveals an amazing property of sample means and helps us make sense of data, no matter its original distribution.

Assumptions:

Independence and Identical Distribution (i.i.d.): The random variables we are dealing with must be independent of each other and follow the same probability distribution. This means that each variable in our sample space (e.g., X1, X2, X3,…) should have the same statistical properties and should not influence one another.
Large Sample Size: For the CLT to hold true, the sample size should be sufficiently large, typically around 30 or more. The larger the sample, the closer the sample mean’s distribution will be to a normal distribution.

Central limit theorem says, no matter what the original distribution of the data looks like, if we take a large enough sample and compute the sample mean, the distribution of these means will approach a normal distribution. This normal distribution will have the same mean as the original distribution but a reduced standard deviation.

Let’s understand it with some examples —

1. Rolling a Dice

Imagine rolling a fair six-sided die twice. Let’s say X1 is the outcome of the first roll and X2 is the outcome of the second roll. Even though each roll is uniformly distributed (with each face having an equal chance of landing), the average outcome of multiple rolls will follow a normal distribution as the number of rolls increases. This is a direct application of the CLT in action.

You can find the detailed codes in my GitHub repository.

2. Bernoulli Trials and Binomial Distribution

To grasp the CLT more deeply, let’s explore Bernoulli trials and their extension, the binomial distribution.

Bernoulli Trial is a fundamental concept where each trial results in a binary outcome, such as success or failure.

Each trial has two possible outcomes, often labeled as “success” and “failure.”
The outcome of one trial doesn’t affect others.
The probability of success p and failure 1−p remains constant.
All trials follow the same probability distribution.

How a large sample of Bernoulli RVs form a normal distribution?

Suppose you flip a coin once. This is a Bernoulli trial with two outcomes: heads (success) and tails (failure). The distribution is simple and binary.

Now, imagine flipping the coin 10 times. The number of heads you get follows a binomial distribution. You can calculate probabilities of getting different numbers of heads using the binomial formula.

If you repeatedly conduct this experiment (e.g., flipping the coin 10 times) and calculate the average number of heads each time, you end up with a set of sample means. The CLT tells us that if you take a large number of these sample means, their distribution will approximate a normal distribution, even though the original data (number of heads in each set of 10 flips) is binomial.

We can define Bernoulli trial and Bernoulli distribution as—

# If random value is under given p, we consider the trial as success

def bernoulli_trial(p: float) -> int:   
    """
    To generate a random sample with specified p
    input: float
    output: int
    """
    return 1 if random.random() <p else 0

def binomial(p: float, n: int) -> int:
    """
    returns the sum of n bernoulli trials
    """
    return sum(bernoulli_trial(p) for _ in range(n))

# Get a large sample of size 'num_points' of binomial RVs
# Each variable represents the number of successes in n trials with probability p

data = [binomial(p,n) for _ in range(num_points)]

When we plot a histogram of data based on 10,000 experiments, each consisting of 100 coin flips (where each flip has a 75% probability of being heads), we visualise how the distribution of the number of heads approaches a normal distribution.

To sum it up, mastering probability is like having a superpower for navigating uncertainty. Whether you’re rolling dice, updating beliefs with Bayes’s Theorem, or embracing the Central Limit Theorem’s magic, you’re equipped to tackle predictions and data analysis with confidence. Just remember, understanding these concepts transforms you from a data novice into a seasoned pro.

For a deeper dive into these principles, complete with code and detailed notes, swing by my GitHub repository.

This article is inspired from the book Data Science from Scratch by Joel Grus.