Balancing Metrics for Informed Decisions in Hypothesis Testing

9 min readSep 28, 2024

“Without data, you’re just another person with an opinion.” — W. Edwards Deming

As data scientists, one of our main jobs is to see if what we believe about the data is actually true. This is called hypothesis testing. It’s like a way of using numbers to test our ideas and see if they hold up when we look at real data.

Imagine we have a hospital dataset showing patients’ blood pressure before and after taking a new medicine. Our starting belief (called the null hypothesis, or H₀) is that the medicine doesn’t change blood pressure at all. But we want to check if it actually lowers blood pressure, which would be our alternative idea (H₁).

To find out, we analyse the data and look for patterns. If the change in blood pressure is big enough compared to what we’d expect by chance, we reject our starting belief (H₀) and say the medicine probably works. If not, we stick with our original idea. This process helps us decide if we should use the medicine more or if we need to do more testing. Hypothesis testing is a practical way to use data to make smart decisions in the real world.

Contents

Understanding Hypotheses in Statistical Testing
Types of Errors in Hypothesis Testing
Significance Level (α) and Type I Error
Type II Error (β), p-Value, and Decision Making
Power of a Test (1−β)
Confidence Intervals: An Intuitive Approach
Balancing the Metrics

Understanding Hypotheses in Statistical Testing

In hypothesis testing, we start with two competing hypotheses:

Null Hypothesis (H₀): This is the default assumption that there is no effect, difference, or relationship in the data. It’s like saying, “Nothing unusual is happening.”
Alternative Hypothesis (H₁): This hypothesis contradicts the null hypothesis. It represents what we are interested in proving, such as the presence of an effect, difference, or relationship.

The outcome of a hypothesis test can go in two directions:

Fail to Reject H₀: We do not have strong evidence against the null hypothesis.
Reject H₀: We find strong evidence in favor of the alternative hypothesis.

It’s important to note that hypothesis testing does not prove whether a hypothesis is true or false. Instead, it provides evidence to support or reject the hypotheses based on the data.

Types of Errors in Hypothesis Testing

When conducting hypothesis tests, we must be aware of the potential for errors:

Type I Error (False Positive): When we incorrectly rejected a true null hypothesis (H₀). In other words, we concluded that there is an effect when, in fact, there isn’t one.
Type II Error (False Negative): When we failed to reject a false null hypothesis. We missed a real effect or difference because the evidence wasn’t strong enough.

To handle such errors, we have some metrics to check upon to.

Significance Level (α) and Type I Error

To manage the risk of a Type I Error, we set a significance level (α), which defines the threshold for deciding when to reject the null hypothesis. The significance level represents the probability of making a Type I error, and it’s typically set at common values like 0.05 (5%) or 0.01 (1%).

Example: Coin Flip and Type I Error

Let’s consider a simple example involving a fair coin flip:

Assume we flip a coin 1,000 times, and each flip has a 50% chance of landing heads (p=0.5).
We repeat this 1,000-flip experiment 10,000 times, recording the number of heads (successes) each time. This creates a binomial distribution around the mean of 500 heads.

If we suspect the coin might be biased, we would compare our observed number of heads to this distribution. A Type I Error would occur if we mistakenly conclude that the coin is biased when it is actually fair (p=0.5). The significance level α determines how much “unusual” behavior we are willing to accept before deciding that the coin is unfair.

The significance level α quantifies our tolerance for this type of error. For example, if we set α=0.05, we are willing to accept up to a 5% chance of erroneously declaring the coin unfair when it is actually fair. It’s crucial to note that α does not represent the error margin around the mean; it does not imply that values outside the range of 500 ± 25 automatically determine the truth of the null hypothesis. Contextual interpretation of the data is essential!

Here’s a graph of the distribution of heads:

The curve represents the normal distribution of the number of heads when the coin is fair.
The red shaded areas are the critical regions. If the result of a coin flip falls into these regions, we might incorrectly conclude that the coin is unfair — a Type I Error.

A lower α means less of the curve is shaded red, which reduces the chance of a Type I Error but also makes it harder to detect real effects, potentially increasing the chance of a Type II Error.

We can summarise it as,

Significance level (α) is directly proportional to the probability of a Type I error.
Lowering α reduces the risk of a Type I error.

Type II Error (β), p-Value, and Decision Making

A Type II Error occurs when we fail to reject the null hypothesis (H₀) even though the alternative hypothesis (H₁) is true. The probability of making a Type II error is denoted by β, and it represents the risk of not detecting an effect when one actually exists.

The p-value measures the probability of observing the data at least as extreme as the current data, if H₀ is true. It tells us how compatible our data is with the null hypothesis.

High p-value: This means the probability of observing the current data or more extreme values, assuming H₀ is true, is high. Therefore, we do not reject H₀. This could result in a Type II error if H₁ is actually true because we might mistakenly conclude that there is no effect when there is one.
Low p-value (less than or equal to α): This means the probability of observing the current data or more extreme values, assuming H₀ is true, is very low. This provides evidence against H₀, prompting us to reject H₀ and consider H₁ as a plausible explanation.

It is important to note that p-value helps us decide whether the evidence is strong enough to reject H₀. It does not directly minimize the risk of Type II errors (β). Type II Error is more related to the power of the test, which depends on sample size, effect size, and significance level (α).

Example: Coin Flip and Type II Error

Continuing with the coin flip example, let’s compare the distribution of heads under two scenarios:

When the coin is fair (H₀).
When the coin is biased (H₁).

In the graph below:

The blue curve represents the distribution when the coin is fair (H₀).
The green curve represents the distribution when the coin is biased (H₁).

If the observed result falls in the overlap between the two curves, we might fail to detect the true bias of the coin, leading to a Type II Error.

We can summarise it as,

p-value > significance level (α): Fail to reject H₀. This does not mean we are accepting H₀; we are simply saying there isn’t enough evidence against it.
p-value ≤ significance level (α): Reject H₀ in favor of H₁.
p-value does not minimize Type II Errors (β). It is more related to the strength of evidence against H₀.
Decision Rules: Use the p-value to determine whether to reject or fail to reject H₀ based on the significance level.

It’s important to remember that these rules are based on the context of hypothesis testing and the chosen significance level, which balances the risk of Type I and Type II errors.

Power of a Test (1−β)

The probability of making Type II error is,

β= P(Fail to reject H0 ∣ H1 is true)

The power of a test is the probability of correctly rejecting the null hypothesis (H₀) when the alternative hypothesis (H₁) is true. It’s calculated as 1−β.

Higher power means a lower probability of making a Type II Error, reflecting the test’s ability to detect true effects.
Power is influenced by factors such as sample size, effect size(Δ = null hypothesis value (μ₀) -the true value under H₁(μ₁)), and the chosen significance level.

Confidence Intervals: An Intuitive Approach

Confidence intervals provides a practical, intuitive method to assess whether our hypothesis is true or false. We find a lower and upper bound using sample proportion and z-score. The values falling under the bounds will validate null hypothesis and vice-versa.

Suppose we have a fair coin, which means each flip has an equal probability of landing heads or tails (p = 0.5). This is our null hypothesis (H₀). We flip the coin 1000 times and observe 525 heads.

To determine whether this result is consistent with our hypothesis of a fair coin, we use the concept of confidence intervals. A confidence interval gives us a range of plausible values for the true proportion of heads, based on our observed data. This range helps us understand how much our observed result (525 heads out of 1000 flips) could vary due to random chance.

From the Central Limit Theorem (CLT) and the properties of the binomial distribution, we can approximate the distribution of the sample proportion of heads. Our observed proportion is calculated as:

Confidence Interval= Sample proportion ± Z × SE

Sample proportion: This is calculated as the ratio of the number of successes (e.g., heads) to the total number of trials. For example, if you flip a coin 1000 times and observe 525 heads, the sample proportion is:

Number of Heads/Total Flips = 525/1000 = 0.525

Z-Score: This represents the number of standard deviations a data point is from the mean of a distribution. For a 95% confidence level, the Z-score is approximately 1.96.

Standard Error (SE): Describes the variability of the sample proportion as an estimate of the population proportion. It is calculated using the formula:

SE = sqrt[sample proportion * (1 − sample proportion)/n]

where n is the sample size.

We set a 95% confidence level, which means we want to be 95% sure that the true proportion of heads lies within a certain range. We calculate the bounds for this confidence interval based on our observed proportion. If our observed value of 525 heads falls within this range, we conclude that the coin is likely fair (we fail to reject H₀). However, if it falls outside this range, we might question the fairness of the coin and consider alternative explanations.

If the observed proportion falls within the shaded orange area, as it does in this plot, we would not reject the null hypothesis (H₀) that the coin is fair. If the observed proportion falls outside the shaded area, it would suggest that the observed result is unusual under the assumption of a fair coin, potentially leading us to question the fairness of the coin.

We can conclude that,

A wider confidence interval corresponds to a higher confidence level but less precision.
A narrower interval provides more precision but may reduce the confidence level.

Balancing the Metrics

Significance Level (α): It is set normally at 1% or 5%. A lower significance level reduces the chance of a Type I error but increases the chance of a Type II error. Therefore, find the acceptable risk of Type I error based on the context of your research. In high-stakes situations (like medical trials), a lower α (~1%) might be more appropriate.
p-value: It should be considered alongside the significance level and confidence intervals because p-value solely could be misleading. A small p-value (typically less than α) leads to rejecting null hypotheses.
Power of test: Higher power (typically 80% is considered acceptable) means a lower chance of committing a Type II error (β). To achieve a desired power, you may need to conduct a power analysis before collecting data. A larger sample size increases both power and confidence in estimates, thus allowing for a more balanced approach. Increasing the sample size generally increases power, but it may also affect the significance level if not controlled.
Confidence Intervals: The width of the interval is influenced by the sample size and variability in the data. Alongside p-values, always report confidence intervals to provide context to the results. This helps in understanding the range of plausible values for the population parameter.

For a deeper dive into these principles and detailed notes swing by my GitHub repository.

This article is inspired from the book Data Science from Scratch by Joel Grus.