Optimizing Your Neural Networks

Last week I posted an article about how to build simple neural networks, specifically multi-layer perceptrons. This article will dive deeper into the specifics of neural networks to discuss how we can maximize the performance of a neural network by tweaking its configurations.

How Long to Train Your Model For

When training a model, you might think that if you train your model enough, the model will become flawless. This may be true, but that only holds for the dataset it was trained on. In fact, if you give it another set of data where the values are different, the model could output completely incorrect predictions.

To understand this further, let's say you were practicing every single day for your driver's exam by driving in a straight line without moving the wheel. (Please don't do this.) While you would probably perform very well on the drag strip, if you were told to make a left turn on the actual exam, you might end up turning into a STOP sign instead.

This phenomenon is called overfitting. Your model can learn all the aspects and patterns of the data it's trained on but if it learns a pattern that adheres to the training dataset too closely, then when given a new dataset, your model will perform poorly. At the same time, if you don't train your model enough, then your model won't be able to recognize patterns in other datasets properly. In this case, you would be underfitting.

An example of overfitting. The validation loss, represented by the orange line is gradually increasing while the training loss, represented by the blue line is decreasing.

In the example above, a great position to stop training your model would be right when the validation loss reaches its minimum. It's possible to do this with early stopping, which stops training once there is no improvement in validation loss after an arbitrary number of training cycles (epochs).

Training your model is all about finding a balance between overfitting and underfitting while utilizing early stopping if necessary. That's why your training dataset should be as representative as possible of your overall population so that your model can more accurately make predictions on data it hasn't seen.

Loss Functions

Perhaps one of the most important training configurations that can be tweaked is the loss function, which is the "inaccuracy" between your model's predictions and their actual values. The "inaccuracy" can be represented mathematically in many different ways, one of the most common being mean squared error (MSE):

\text{MSE} = \frac{\sum_{i=1}^n (\bar{y_i} - y_i)^2}{n}

where $\bar{y_i}$ is the model's prediction and $y_i$ is the true value. There's a similar variant called mean absolute error (MAE)

\text{MAE} = \frac{\sum_{i=1}^n |\bar{y_i} - y_i|}{n}

What's the difference between these two and which one is better? The real answer is that it depends on a variety of factors. Let's consider a simple 2-dimensional linear regression example.

In many cases, there can be data points that act outliers, points that are far away from other data points. In terms of linear regression, this means that there are a few points on the $xy$ -plane that are far away from the rest of them. If you remember from your statistics classes, it's points like these that can significantly affect the linear regression line that's calculated.

A simple graph with points on (1, 1), (2, 2), (3, 3), and (4, 4)

If you wanted to think of a line that could cross all four points, then $y = x$ would be a great choice because this line would go through all the points.

A simple graph with points on (1, 1), (2, 2), (3, 3), and (4, 4) and the line $y = x$ going through it

However, let's say I decide to add another point at $(5, 1)$ . Now what should the regression line be? Well, it turns out that it's completely different: $y = 0.2x + 1.6$

A simple graph with points on (1, 1), (2, 2), (3, 3), (4, 4), and (5,1) with a linear regression line going through it.

Given the previous data points, the line would expect that the value of $y$ when $x = 5$ is 5, but because of the outlier and its MSE, the regression line is "pulled downwards" significantly.

This is just a simple example, but this poses a question that you, as a machine learning developer, need to stop and think about: How sensitive should my model be to outliers? If you want your model to be more sensitive to outliers, then you would choose a metric like MSE, because in that case, errors involving outliers are more pronounced due to the squaring and your model will adjust itself to minimize that. Otherwise, you would choose a metric like MAE, which doesn't care as much about outliers.

Optimizers

In my previous post, I also discussed the concept of backpropagation, gradient descent, and how they work to minimize the loss of the model. The gradient is a vector that points towards the direction of greatest change. A gradient descent algorithm will calculate this vector and move in the exact opposite direction so that it eventually reaches a minimum.

Most optimizers have a specific learning rate, commonly denoted as $\alpha$ that they adhere to. Essentially, this represents how much the algorithm will move towards the minimum each time it calculates the gradient. Be careful of setting your learning rate to be too large! Your algorithm may never reach the minimum due to the large steps it takes that could repeatedly skip over the minimum.

[Tensorflow's neural network playground](https://playground.tensorflow.org) showing what can happen if you set the learning rate to be too large. Notice how the testing and training loss are both `NaN`.

Going back to gradient descent, while it is effective in minimizing loss, this might significantly slow down the training process as the loss function is calculated on the entire dataset. There are several alternatives to gradient descent that are more efficient but have their respective downsides.

Stochastic Gradient Descent

One of the most popular alternatives to standard gradient descent is a variant called stochastic gradient descent (SGD). As with gradient descent, SGD has a fixed learning rate. But rather than running through the entire dataset like gradient descent, SGD takes a small sample is randomly selected and the weights of your neural network are updated based on the sample instead. Eventually, the parameter values converge to a point that approximately (but not exactly) minimizes the loss function. This is one of the downsides of SGD, as it doesn't always reach the exact minimum. Additionally, similar to gradient descent, it remains sensitive to the learning rate that you set.

The Adam Optimizer

The name, Adam, is derived from adaptive moment estimation. It essentially combines two variants of SGD to adjust the learning rate for each input parameter based on how often it gets updated during each training iteration (adaptive learning rate). At the same time, it also keeps track of past gradient calculations as a moving average to smooth out updates (momentum). However, because of its momentum characteristic, it can sometimes take longer to converge than other algorithms.

Putting it All Together

Now for an example!

I've created an example walkthrough on Google Colab that uses PyTorch to create a neural network that learns a simple linear relationship.

If you're a bit new to Python, don't worry! I've included some explanations that discuss what's going on in each section.

Reflection

While this obviously doesn't cover everything about optimizing neural networks, I wanted to at least cover a few of the most important concepts that you can take advantage of while training your own models. Hopefully you've learned something this week and thanks for reading!

Blog