Sitemap

Maths behind Machine Learning: An introduction to gradient based methods

12 min readFeb 19, 2023

--

Introduction:

Machine Learning is changing our world, by connecting Math and Computer Science. But when trying to learn more about it, it might be very easy to get lost in the enormous amount of data that is available in the internet. Discover the math behind the magic, and gain a deeper understanding of how these methods can transform raw data into powerful predictions.

The derivative and its importance:

From calculus lessons, you might recall what the derivative is. For those who don’t have it very clear, the derivative represents the instantaneous rate of change that a function has at a certain point. In 2 dimentions the sign of the derivative tells us whether a function is increasing or decreasing. If the derivative of a function at a point is positive we know that the function is increasing, if it is negative it is decreasing. And if it is a zero it does not give conclusive information, but we can check if it is a local maximum or minimum with some additional calculations. Nonetheless, the key point here is that the derivative tells us the direction of maximum increase, and this concept helps us extend this idea with more variables.

Here we can see a plot where we graph the tangent line at every point of the function. Because the slope of the tangent line is the derivative, you can visualize the relation between the sign of the derivative and the growth of the function. Source: http://mirror2.polsri.ac.id/wiki/images/224/22431.gif.htm

We can use this to find in which direction we should move if we would like to find the closest minimum of the function. This direction would be the opposite of the derivative’s sign. Take a moment to look at the following graphs and think about it.

f is a function (in red), and f’ is its derivative function (in blue).

If the function is increasing from left to right (implying a positive derivative), the lowest value in that neighborhood would be reached at the left, so it makes sense that we should move in the negative direction, which is the opposite of the sign of the derivative. The same logic applies for a negative derivative, when a function is decreasing from left to right, we would find the smallest value at the right.

For example: Let’s consider the same function as before: f(x) = x², which derivative is f’(x) = 2x. Suppose that we start at x=1 f(1) = 1

f(x) in red and the point (1, 1) blue.

If we would like to get closer to the minimum that is in x = 0 we should move to the left. Following the logic that I mentioned before we can easily see that f’(1) = 2, so we should move in the direction of -2, that is in fact to the left. But the size of the step that we take should be smaller than 2, so to handle this in practice we use what is called “Learning Rate”.

Why would we want to find minimums?

Well, machine learning models basically try to find a function that can accurately map inputs to outputs based on the data that we have and its intrinsic patterns, so that in the future it can take decisions in unseen scenarios. We can accomplish this by setting a function and tell the machine to find the coefficients for that function that make it have the smallest possible error.

As an illustration, imagine we have a collection of data points plotted on a graph, and we aim to determine the linear function that provides the closest match to the data. This problem is called linear regression. Let’s denote this function that we want to learn: f(x) = a • x + b, this will be our model.

Then, for predicting an estimation for a given “x” value, we would simply do f(x) and the output of that evaluation is our prediction.

Source: https://www.javatpoint.com/linear-regression-in-machine-learning

Suppose that want to predict the total cost of our supermarket purchase depending on the ammount of objects that we buy. Our x values could be the ammount of products that we are buying and our y values would be the total money that we are spending. Let’s say we have the following data: If we buy 1 product, we spend 2 dollars; with 2 products, we spend 4 dollars; with 3 products, we spend 5.5 dollars; and with 4 products, we spend 6 dollars. The plot of this would look something like this:

Here, the black rect is the one that fits the points the best possible and the red is a random one, but what makes it better than the red one? How can we measure if a function is good?

Loss function: Mean Squared Errors

To measure how good our model is doing we will define the loss function, which will vary depending the optimization problem you want to solve. Moreover, the loss function is a key factor to the training of the model, as it will be what guides us towards the best solution. For this particular problem we will use one of the most popular ones that is Mean Squared Errors.

Given a data point (x, y) of our dataset, the prediction that our model makes is f(x) and the real value that we had is y. So if we would like to find by how far our model estimated the real value we could you use the euclidean distance formula:

Note that we are considering the distance between the points: (x, y) and (x, f(x))

But because the x’s are already equal, that first term goes to 0 and we only consider the distance between the predicted y and the real one.

So we want to optimize the loss function to get the minimum possible error (measured as a distance) across all our data points. But it is the same optimizing over the distance or the squared distance, as the square root is an increasing function. So the point where the distance is the smallest is the same that has the smallest squared distance, thats why we can forget about the square root in the equation.

So we can check the error that we have at each estimation (x, f(x)) of the point (x,y) with the function:

The best rect that approximates all the points would be the one that has the smallest difference at each (x, y) pair point. Lucky us, this would be calculating for each data point the squared error, adding all this errors up and dividing by the number of points there are, this is calculating the mean of the squared errors.

In mathematical term our objective would be minimizing the “Loss” function across all our (x, y) points. And this is what makes the rects different, the best one has a smaller Loss. In fact, the black one has a mean squared error of 0.1438 and the red one has 0.3275.

With the machines that we have to this day, it might be impossible to solve enormous systems of equations, that come up when we have lots of data, as it may need an amount of memmory that is not available. Also, models can be too complex for its equations to be solvable directly. Thats why we use Gradient Descent, an iterative algorithm that we can use to find the minimum of a function.

Gradient Descent

The main idea of this algorithm is like going down a mountain in a foggy day, where you cannot see the whole landscape. Imagine that you are being dropped at a part of a mountain and your objective is to descend to you house at the bottom. While taking a direct path may seem like a good strategy for quickly reaching the bottom, it can be dangerous and unpredictable. Instead, a more intelligent strategy would be taking small steps down, always following the steepest direction that you find.

Just like descending a mountain, the algorithm begins at a certain point in the function and takes small steps in the direction of the steepest descent of the function. We use this to find the combination of coefficients that minimize the chosen loss functions.

To achieve this we can use THE DERIVATIVE. What if we instead of thinking the loss function as a function of the variables x, y, but of a and b, representing all the possible coefficients that could result in the best-fitting function.

As a first step, we can pick a and b values at random. For example, a = 0 and b = 1. Then, for finding new coefficients that are better we can make use that the derivative tells us which is the direction of maximum increase, and if we put a minus to it, we get the direction of maximum decrease (this can be proven, and it is specially important for multivariate problems, but for now we will keep with the intuition). So if we derive the loss function with respect to a, we would be finding the direction in which a should move for getting a smaller error. But, how big a step we should take? Well, the answer is: very small, as small as 0.001. The step size that we take is known as the learning rate, and is a parameter that the data scientist picks, depending on many factors, as the algorithm that is being applied or the dataset being used.

So to update the slope “a” we can use the following formula

function to derive

Now, we calculate the partial derivative of the error function with respect to aᵢ.

By using the chain rule we can derive this formula. First we use the power rule for the square and then we multiply by the derivative of the expression inside, which is -x.

And now, we can procede to find the formula for learning our coefficients.

and cancelling the minus of the multiplying x we get:

In english this would mean that the new “a” that we will use is equal to the previous “a” but changed a little moving towards the direction that reduces the error.

And the same goes for “b”, the bias, which partial deriavtive is very similar:

partial derivative of the error function with respect to bᵢ.

So we would run this process through all our data many times, and each time we go through all our data points is called an epoch.

Step-by-step example

Let’s make some calculations to become familiar with this algorithm:

Our data will be the set of points from before: { (1.00, 2.00), (2.00, 4.00), (3.00, 5.50), (4.00, 6.00) }

We see that we cannot draw a line that goes through all the points, but we can find the one that approximates them the best.

Firstly, we initialize a and b values. We will set a = 1 and b = 0.5, but this is completely random and arbitrary. We can easily check that the loss of this model is of 2.1875

Graph of the data points and the inital model. f(x) = 1 * x + 0.5

Now, we must update the coefficients for making better predictions. For this example we will use the learning rate (lr) = 0.01 for reaching a better solution quicker.

Following the previous equations we will now train our model for 2 epochs and see the results (some numbers will be rounded for the sake of simplicity, but I kept the majority of the true values so you can see the precision that machines handle).

Let’s recall the formulas for updating the coefficients:

Epoch 1:

With just 1 epoch we reduced the mean squared error in our predictions from 2.1875 to 0.6579!

The best model that we get after epoch 1

Let’s continue with this process

Epoch 2:

And now we reached to a 0.2997 loss, while the best possible is 0.1438. Here is a graph of our new model:

With just 2 more epochs we would get a solution with 0.1619 loss. And If we continue with this process until a total of 10 epochs we would get a solution that achieves 0.1587 mean squared errors. Let’s visualize this:

The black rect is the best one possible (0.1438 loss). The red one is the one that we got after the first epoch (0.6579 loss). The green one is the one that we get after the the second epoch (0.2997 loss). Finally, the violet one is what we get after iterating thorugh 10 epochs (0.1587 loss)

Now, let’s plot the progression of the Mean Squared Errors (MSE) that we got step after step (a step would be each time that we updated our coefficients):

There are 40 steps because we trained for 10 epochs with 4 data points. So for each epoch we took 4 steps. Here we took a learning rate of 0.01. Here we reach a loss of 0.1587

But what happens if we choose a smaller learning rate?

Training for 10 epochs with a learning rate of 0.003. Here we reach a loss of 0.1935

But if we continue for more epochs:

Training for 800 epochs at a learning rate of 0.003. Here we reach at epoch 731 the best solution possible (loss 0.1438). The coefficients of the best solution found are; a= 1.346527354299218 and b = 0.998589496087786.

If we train with a higher learning rate, we cannot get as close to the best solution because we would be taking big steps when updating our coefficients and potentially “skipping” the good solutions. It is like driving a car at high speed without being able to turn the steering wheel. You will be going fast, but you won’t be able to make any adjustments to stay on the right path.

Training of 800 epochs at a learning rate of 0.01. We get a loss of 0.14515801 with the parametrs: a = 1.3387112256098237 and b= 0.9928847516872763

Results

The best model that we found tells us that if we would like to take 7 products we would spend 10.42 bucks and if we would take 8, 11.77.

Bonus

I invite you to check the following Google Colab, where I did a simple implementation of what we went through. Here you can play with the different parameters such as the data points, the learning rate, number of epochs and initializing values: https://colab.research.google.com/drive/1f5eBd6xi3y6hAmtow8hyniTQz8HSHAU1?usp=sharing

Conclusion

In this post we have learned the intuition of the math (in its 2 dimensional version) that is used for training most of machine learning algorithms, including neural networks. The key concept that we have acquired here is the update of the coefficients, with the gradient descent algorithm. We have also went through a step-by-step process utilizing a loss/cost function called Mean Squared Errors and visualizing the results of this algorithm.

Although gradient descent may be computationally intensive and may require some manual hyperparameter tuning (such as choosing the learning rate, ammount of epochs, etc.), it is a reliable and versatile algorithm that can be used in a wide range of applications. With the Python code we can see how easy it can be for a machine to do this calculation, while for us, humans, it may take longer.

In one of my next posts I will expand this lesson to its multivariate form, what will enable us to make to make estimations through more complex models and with relationship between more than 2 things, such as predicting the diagnose of a patient based on lab results.

--

--

Responses (3)