Overfitting in Machine Learning

Understanding Overfitting using Higher order linear regression

This is a project done to understand Overfitting using linear regression and Polynomial model using scikit learn. We will also see how to avoid overfitting using regularization techniques.

Please do refer to this Jupiter notebook for the whole code

First, we will understand overfitting...

Overfitting

Overfitting is the phenomenon where the model will perfectly coincide with the training data and will have large errors for unseen data.

Summing up small or 0 training error but very high validation error. Meaning it memorized the training data.

If this happens then the model will only work if the input lines in the training data and the model will not behave as expected for unseen validation data.

This can usually happen if there is a complex model or a small training dataset. If a model is overfit supplying more data for training will still overfit the model.

Understanding overfitting

To understand fitting we will take a simple sine curve with some noise in it and we'll train a model with one-half of the data and predict the outcome with the training and the validation datasets. We do this with polynomial functions of degrees 0, 1, 3 and 9

While we do so we will infer that increasing the complexity will start to overfit the model. This can be seen very clearly from the degree 9 polynomial features model. It will be a flawless copy of the training data with no changes and the validation data will have large errors.

So our model is as follows:

linear_regression = LinearRegression()

pipeline = make_pipeline(PolynomialFeatures(degrees[i]), linear_regression)

The above is a pipeline of Polynomial Features of respective degrees and linear regression. After this, we fit the model to get the updated weights. Once we have done this the model will try to fit for the training dataset. So after this, we can try to predict the outcome for both training and validation dataset.

We then use the Root Mean Square Error to find the loss between the prediction and the actual. When we plot a curve we could see that the difference invalidation and training loss would be very less for polynomial functions 0, 1 and 3 but for order 9 the validation loss will be so high and the training loss would be so close to 0 (For me it was validation loss = 281664.538268 and training loss = 3.261626e-10)

That is the visual definition of overfitting. The higher-order function was not able to predict properly for unseen data and had a loss for seen data so close to 0 of the order of 10^-10.

From the above plot which is a plot of the model training and validation and also the actual function, the plot is the same for the training part of degree 9 with RMSE = 0.00. It is depicted with a yellow line.

How to avoid overfitting

Now that we know what overfitting is now we have to avoid it. There are many ways to stop overfitting but the basic ways are from what we have learnt so far is:

Increasing the number of data points.

If the dataset is sparse then the chances of overfitting are larger.

Decreasing the complexity of the model.

If the model is too complex it will fit the training dataset perfectly and produce a lot of error for the unseen dataset.

But other techniques are useful to reduce overfitting. One such technique is called Regularization.

Regularization

The problem in overfitting is that the model while training takes the extreme value that is it mimics the training dataset exactly as it is. What regularization does is, it will have an additional term called penalty which will control and prevent the model from taking these extreme values for its coefficients.

We have pre-existing API's in scikit to learn to implement these for us. There a 2 of these:

Lasso (L1 Regularization)
Ridge (L2 Regularization)

Let us first understand how a model trains itself.

A model is a function (y) with parameters from the input (x) of some order and weights (w) and biases (b). It would be something like the following:

y=b+w1*x+w2*x^2+...+wn*x^n

Where each weight and bias represent a feature of the model which will try to reduce the loss function for each training step.

loss = ∑(y(x,w)-t)²

The problem with overfitting is while training the loss function will be giving lesser values so close to 0 and higher large errors for unseen validation dataset. What Ridge and Lasso will be doing is it will reduce the model complexity and prevent it from over-fitting.

Ridge

In Ridge, the cost function is changed by adding a penalty equal to the sum of the square of the magnitude of the coefficients.

loss = ∑(y(x,w)-t)² + λ * ∑|w|²

The λ is the multiplication factor. The lower the value of lambda the loss function will behave more like a linear regression hence resulting in overfitting.

To check this out we are going to now take the 9th order polynomial and have a large dataset to check the conditions with varying lambda. So for this, we will be using a different model pipeline as follows:

ridge = Ridge(alpha=alpha)

pipeline = make_pipeline(PolynomialFeatures(degree), ridge)

pipeline.fit(X_train[:, np.newaxis], Y_train)

This model has a ridge instead of linear regression.

The model will now learn more effectively:

If we see the MSE it is lower for a higher value of λ.

And the Validation loss is higher for a lesser value of λ. So this proves that the lesser the lambda the more overfitting and also more similar to the linear regression model.