Gradient Descent Simplified: How Machines Learn from Data

Tanu Khanuja PhD
9 min readFeb 12, 2025

--

In this blog, we’ll break down the core concepts of Gradient Descent, from optimizing models to reducing errors, making it easier for anyone, even beginners, to understand how machines learn over time. Whether you’re just starting out or looking to sharpen your skills, this guide will give you the foundational knowledge to better understand the power of Gradient Descent.

Why Do We Need Gradient Descent?

At its core, gradient descent is an optimization algorithm. But what does that mean? In Machine Learning, optimization is the process of improving a model to make it perform as good as possible on a given task. This usually involves minimizing errors or maximizing the likelihood of correct predictions.

For example, imagine you’re building a model to predict house prices based on features like size, location, and number of bedrooms. Your model will make predictions, but they won’t be perfect at first. The difference between the predicted price and the actual price is the error. The goal of optimization is to tweak the model’s parameters so that this error is as small as possible.

This is where gradient descent comes in. It’s the algorithm that helps us find the best set of parameters for our model by minimizing the error.

What Is a Model?

Before we go further, let’s clarify what we mean by a model. In Machine Learning, a model is a mathematical representation of the relationship between input variables (features) and output variables (predictions). For example, in a simple linear regression model, the relationship between input x and output y can be expressed as:

y = wx + b

Here,

  • w is the weight (or slope),
  • b is the bias (or intercept).

The goal is to find the values of w and b that best fit the data.

A Simple Example

Let’s make this concrete with a simple dataset:

If we plot these points, we’ll see that they don’t form a perfect straight line.

However, for simplicity, let’s assume a linear relationship between x and y. Our goal is to find the best values for w and b so that the line y = wx + b fits the data as closely as possible.

Calculating Errors

To measure how well our model fits the data, we need to calculate the error for each data point. The error is simply the difference between the actual value y and the predicted value y_pred​.

Let’s assume initial values for w and b. For example, start with w = 1 and b = 1. Using these values, we can calculate the predicted y values (using wx + b) and the corresponding errors (y — y_pred):

Mean Squared Error (MSE)

If we simply take the average of these errors, the positive and negative errors might cancel each other out. To avoid this, we use the Mean Squared Error (MSE), which squares the errors before averaging them. The formula for MSE is:

Plugging in our errors:

Our goal is to minimize this MSE value by finding the optimal values for w and b.

Gradient Descent: The Optimization Process

To minimize the MSE, we need to understand how the error changes as we adjust w and b. This is where gradients come in. The gradient of the error function with respect to w and b tells us the direction in which we should adjust these parameters to reduce the error.

Calculating Gradients

Apply the basic partial differentiation and get the gradient of MSE with respect to w:

​Similarly, the gradient of MSE with respect to b:

Updating Parameters

These gradients show us which direction to change the weights (w) and bias (b) to reduce the error. A negative gradient means that if we increase the parameter, the gradient gets smaller, so we need to increase the parameter to get closer to the minimum value. A positive gradient means we need to decrease the parameter. But the gradients don’t tell us how much we should change the parameter, and that’s where the learning rate plays a crucial role.

Using these gradients, we can update w and b iteratively using the following rules:

Here, η is the learning rate, which controls how big of a step we take in each iteration.

Learning Rate

The learning rate determines the size of the steps we take during each update. Without a learning rate, we might take steps that are too large or too small:

  • A higher learning rate means larger steps, which can lead to faster convergence. However, if the learning rate is too large, it risks overshooting the minimum or oscillating around it.
  • A lower learning rate means smaller steps, which can lead to more precise convergence. However, it might take a long time to reach the minimum. If the learning rate is too small, the algorithm might get stuck in a local minimum. The learning rate should be large enough to jump out of local minima and continue searching for a better solution.

The learning rate ensures that we take steps of an appropriate size, balancing speed and stability.

Applying Gradient Descent

Let’s apply this to our example. Starting with w=1 and b=1, and using a learning rate η = 0.1, we can compute the gradients and update w and b.

Iteration 1

  • Compute the gradients.

Gradient wrt w:

Gradient wrt b:

  • Update w and b.

Iteration 2

After updating the values of w and b, we repeat the process. With each iteration, the values of w and b will gradually move closer to the optimal values that minimize the MSE. This process continues until the values converge to the best possible values for the model. And that’s how gradient descent works.

Linear Regression Optimization with Sklearn

Using the LinearRegression model from scikit-learn, I performed optimization steps to find the best values for the weights (w) and bias (b) based on our sample data. After fitting the data to the model, the optimized parameter values were:

  • w = 1.96
  • b = 0.499

Our linear model is optimized to:

y = 1.96 ⋅ x + 0.499

These optimized parameters result in the following predicted values for y:

y_pred = [2.46,4.42,6.38,8.34,10.3]

To visualize how the model fits the data, I created a plot of the original y values and the predicted values.

  • The predicted values (red line) closely follow the trend of the actual data (blue dots), indicating that the linear regression model is capturing the general pattern of the data well.
  • However, there is a small difference between the actual values and the predicted values. This is expected in regression models because the model doesn’t perfectly fit every data point, especially if there is some natural variability or noise in the data.
  • The larger the difference between the actual and predicted values (known as the error), the less accurate the model is for those points.

Optimizing MSE with Each Iteration

To visualize the optimization process, I also plotted the Mean Squared Error (MSE) at each iteration (epoch) during the gradient descent steps. By running the Linear Regression model for 100 epochs, we can observe how the MSE decreases as the model iteratively updates the weights (w) and bias (b). In the plot, you can see that the MSE starts relatively high and gradually decreases, indicating that the model is learning and improving over time. As the epochs increase, the MSE approaches its minimum value, showcasing the model’s convergence to an optimal solution where the error is minimized.

Other Optimization Methods

While Gradient Descent (GD) is popular, there are several other optimization methods that might work better in certain situations. Let’s take a look at a few alternatives:

1. Stochastic Gradient Descent (SGD)

Instead of using the whole dataset, SGD uses just one data point to update the model at a time. This makes it faster but can cause noisy updates.
When to use: Great for large datasets or real-time learning.
When not to use: Can be less stable because updates are based on just one data point.

2. Mini-Batch Gradient Descent

This method updates the model using a small group of data points, balancing speed and stability.
When to use: Often used in deep learning, especially with large datasets.
When not to use: If the batches are too small, it may still be noisy; if too large, it may lose speed.

3. RMSprop

RMSprop adjusts the learning rate based on the average of recent gradients, making it more stable than basic GD.
When to use: Best for problems with sequential data like time series or speech recognition.
When not to use: Needs careful tuning, or it can converge too fast or too slowly.

4. Adam (Adaptive Moment Estimation)

Adam combines two techniques (Momentum and RMSprop) to adjust the learning rate for each parameter, helping the model learn faster and more effectively.
When to use: Works well for deep learning models and complex datasets.
When not to use: Can overfit or settle into a less-than-optimal solution if not tuned correctly.

5. Adagrad

Adagrad reduces the learning rate for frequently updated parameters, helping prevent overfitting.
When to use: Useful for problems with sparse data (where some features are rarely used).
When not to use: The learning rate can decrease too quickly, limiting further improvement.

Conclusion

In this blog, we learned about Gradient Descent, a key technique used in Machine Learning to help models improve over time. It’s like a step-by-step way to find the best settings for a model so that it can make more accurate predictions.

We used Linear Regression as an example to show how Gradient Descent works. We saw that by adjusting the model’s parameters (like weight and bias) over multiple steps, the model can get better at predicting the correct values. We also plotted how the model’s error (called Mean Squared Error or MSE) decreases with each step, showing how the model learns and improves.

In the end, Gradient Descent is important because it helps models learn from data and make predictions that are as accurate as possible. By playing with things like the learning rate and number of itertions (epochs), you can make your models better and faster.

Final Thoughts

Gradient Descent is more than just a mathematical tool — it’s a fundamental concept that bridges theory and practice in machine learning. By understanding how it works, you gain insight into how models learn and improve over time. And while gradient descent is a powerful starting point, exploring other optimization methods can help you tackle more complex problems and achieve better results.

As you continue your machine learning journey, remember that optimization is an iterative process, much like the algorithms themselves. Keep experimenting, keep learning, and don’t be afraid to try new approaches. After all, the goal is not just to minimize errors but to build models that truly understand and generalize from data.

So, the next time you train a model, take a moment to appreciate the role of gradient descent and the optimization process. It’s the unsung hero behind every successful machine learning algorithm.

For a deeper dive into these principles and detailed notes swing by my GitHub repository.

If you found this blog helpful, feel free to share it with others who are diving into the world of machine learning. And don’t forget to follow me for more insights into AI, data science, and optimization techniques!

--

--

Tanu Khanuja PhD
Tanu Khanuja PhD

Written by Tanu Khanuja PhD

PhD in brain injury biomechanics now diving into data science. Freelance consultant passionate about machine learning and data analysis.

No responses yet