Training a Machine

Introduction

A day ago I learned how to train a machine to close on a prediction.

I thought it was easy as I just coded the math to narrow down on a prediction from the loss or the inaccuracy.

The best thing was that I thought it would be very complicated. But I had understood it I did that tedious math by hand and coded it myself.

The steps were as follows

  1. Analyze and predict the function.
  2. Assume certain values for the bias, constants, and step.
  3. Plugin the input to the prediction.
  4. Calculate the loss.
  5. Calculate the gradients for each of the constants and bias.
  6. Update the constants and the Bias by subtracting the respective gradients.
  7. Now repeat from step 3 till you get 0 loss.

From the above steps, there are critical things that we need to note

To analyze and predict the function of the data was done for me. The day was also a linear result so the function resulting was linear and predictable with the known function
y = wx + b
where w is the weight constant and b is the bias constant not all functions that we encounter will be that way.

Also, the loss function was simple and was assumed to be an MSE (Mean Square Error) Loss function.

To calculate the gradient I was taught to calculate the slope and for that, all we needed to do was use the differential for the loss function w.r.t. the constant.

With that, we can update the constant by subtracting the product of step and respective gradient.

To explain,
Assume the prediction function is y' = wx + b. Then,
the loss function as per our previous statement is MSE which is,
∑(y'-y)² = ∑(wx + b - y)²
the gradient w.r.t. w is differential of the above w.r.t. w which is,
= ∑2 (wx + b - y) (x)

= ∑2x(wx + b - y)

As all others are constant when w is varying. This is because we are assuming the weight and bias are independent of each other that is varying w doesn't affect bias or y so db/dw or dy/dw is 0.
Similarly, the gradient w.r.t. b is,
= ∑2(wx + b - y)
Now we found the gradient we need to assume a step and use the formula to update the weight and bias:
w = w - ((gradient w.r.t. w) * step)
This will bring the predicted function closer to the actual meaning reducing the loss. It's not just the weight constant also the bias can be updated in a similar way.
b = b - ((gradient w.r.t. b) * step)

Why do we subtract the gradients?

Fig 1: The loss vs weight curve

From the loss function, we can see that the loss is a polynomial which will be a parabolic curve. In a parabola, the least value will be resulting in the lowest portion of the curve. Which is what we want the least value of the loss. From Fig 1 we can see that as weight w decreases the loss decreases up to a certain point and again increases from that point onwards. We need to bring the constant values to the minimum loss position so that we can predict the values.


The ∑ is there to get the sum of all pairs of x and y values.
So this is why we need the previously known values for the training.

The Problem that I solved

I did the problem to predict real estate prices from the dataset on the page here. It has a visual guide of how we do it as a process with step by step guide.
Below is the dataset that was given to us:
Area (sq ft) (x)Price (y)
2,104399,900
1,600329,900
2,400369,000
First, we have to form the i/p and o/p arrays and keep the bove as a training dataset and the prediction of the price must be made for the 2000 sq ft datapoint. So we get x and y as follows:
x = [2104, 1600, 2400]
y = [399.9, 329.9, 369]

and we have the prediction function y' as

y' = w₁x² + w₂x + b

From the above function, we know that there are 2 weight constants w₁ & w₂. Also a bias b. Based on this we will have 3 gradients to vary w1, w2, and b respectively 

We'll find the MSE as,

MSE = (y' - y)² = (w₁x² + w₂x + b - y)²

Now we'll find the gradient w.r.t. w₁ as,

gradient w.r.t. w₁ = 2x²(w₁x² + w₂x + b - y)

Similarly, the gradient w.r.t. w₂ and b is as follows,

gradient w.r.t. w₂ = 2x(w₁x² + w₂x + b - y)

gradient w.r.t. b = 2(w₁x² + w₂x + b - y)

Now we have the loss function and the gradient for controlling the loss. now we can start coding the logic we explained before.

The Code

x = [2104, 1600, 2400]
y = [399.9, 329.9, 369]

w1 = 1
w2 = 1
bias = 0

error_prev = 0

decent = 1 * (10 ** -13.78)

x_input = 2000

total_epoch = 100


def fwd(x_param):
return ((x_param ** 2) * w1) + (x_param * w2) + bias


print("Prediction before training:", fwd(x_input))


def loss(x_param, y_param):
y_prediction = fwd(x_param)
return (y_prediction - y_param) ** 2


def gradient(x_param, y_param, flag):
if flag == 0:
return 2 * (x_param ** 2) * (((x_param ** 2) * w1) + (x_param * w2) + bias - y_param)
elif flag == 1:
return 2 * x_param * (((x_param ** 2) * w1) + (x_param * w2) + bias - y_param)
elif flag == 2:
return 2 * (((x_param ** 2) * w1) + (x_param * w2) + bias - y_param)
return 0


for epoch in range(total_epoch):
gw1 = 0
gw2 = 0
gb = 0
error = 0

for x_val, y_val in zip(x, y):
error += loss(x_val, y_val)
gw1 += gradient(x_val, y_val, 0)
gw2 += gradient(x_val, y_val, 1)
gb += gradient(x_val, y_val, 2)

print("Progress:", math.floor((epoch * 100) / total_epoch), "%", "When w1 =", w1, "When w2 =", w2, "Bias =", bias,
"Prediction:", fwd(x_input), "Loss =", error)
if epoch != 0 and error_prev < error:
break
error_prev = error
w1 = w1 - (decent * gw1)
w2 = w2 - (decent * gw2)
bias = bias - (decent * gb)
print("Prediction after training:", fwd(x_input))

Output

Prediction before training: 4002000
Progress: 0 % When w1 = 1 When w2 = 1 Bias = 0 Prediction: 4002000 Loss = 59372896046219.016
Progress: 1 % When w1 = -0.9699427250893544 When w2 = 0.9990957072634792 Bias = -4.232551101900951e-07 Prediction: -3877772.708943314 Loss = 55771450181443.69
Progress: 2 % When w1 = 0.9393189348203625 When w2 = 0.9999721337800483 Bias = -1.3048981114201942e-08 Prediction: 3759275.6835489976 Loss = 52388461107652.94
Progress: 3 % When w1 = -0.9111308470719042 When w2 = 0.9991226933697888 Bias = -4.1063059575297603e-07 Prediction: -3642525.142901288 Loss = 49210677655366.734
Progress: 4 % When w1 = 0.8823186665405012 When w2 = 0.9999459572019037 Bias = -2.5307129562263668e-08 Prediction: 3531274.5580763835 Loss = 46225652445602.03
...
Progress: 98 % When w1 = 0.04623610548493272 When w2 = 0.999561641043619 Bias = -2.0547985030724172e-07 Prediction: 186943.54522161264 Loss = 128917422585.07596
Progress: 99 % When w1 = -0.04555806447460185 When w2 = 0.9995194981090609 Bias = -2.2520787944047987e-07 Prediction: -180233.2189024145 Loss = 121097546155.75017
Prediction after training: 175633.22798353786

From the above output, we can see that the Loss is decreasing and getting closer to the prediction but not accurate.

For increasing the accuracy we may have to increase the step. But I found a few problems in doing so,
  1. We do not want to increase the step too much as it will move far away from the desired point and we will start seeing the loss increase. For this, I have a code to stop when the loss starts to increase.
  2. We don't have the resource to calculate numbers with more than 32-bit values.
Due to this what I did was increase the epoch and also substitute the result w₁, w₂, and b and run the code which will continue the descent from where it left off instead of calculating from the first. Surprisingly the prediction was close when I substituted the result after a long time of doing the above process for the following weight and bias values:
w1 = -1.7639253531609754e-05
w2 = 0.21248294140374407
bias = 0.00021487994761540023

Prediction before training: 354.40908356099675

Progress: 0 % When w1 = -1.7639253531609754e-05 When w2 = 0.21248294140374407 Bias = 0.00021487994761540023 Prediction: 354.40908356099675 Loss = 3735.977235008907

Progress: 1 % When w1 = -1.7639253938957483e-05 When w2 = 0.21248294229112952 Bias = 0.00021487994849988702 Prediction: 354.4090837063776 Loss = 3735.9771875601227

Progress: 2 % When w1 = -1.7639254346305185e-05 When w2 = 0.21248294317851493 Bias = 0.0002148799493843738 Prediction: 354.40908385175857 Loss = 3735.9771401113376

Progress: 3 % When w1 = -1.763925475365291e-05 When w2 = 0.21248294406590035 Bias = 0.0002148799502688606 Prediction: 354.4090839971393 Loss = 3735.9770926625533

Progress: 4 % When w1 = -1.763925516100059e-05 When w2 = 0.21248294495328576 Bias = 0.0002148799511533474 Prediction: 354.4090841425203 Loss = 3735.9770452137677

Progress: 5 % When w1 = -1.7639255568348356e-05 When w2 = 0.21248294584067115 Bias = 0.00021487995203783416 Prediction: 354.4090842879009 Loss = 3735.9769977649876

...

Progress: 97 % When w1 = -1.7639293044322988e-05 When w2 = 0.21248302748009715 Bias = 0.0002148800334105857 Prediction: 354.4090976629358 Loss = 3735.9726324802295

Progress: 98 % When w1 = -1.7639293451670497e-05 When w2 = 0.21248302836748187 Bias = 0.00021488003429507178 Prediction: 354.40909780831606 Loss = 3735.9725850315153

Progress: 99 % When w1 = -1.7639293859017786e-05 When w2 = 0.2124830292548666 Bias = 0.00021488003517955786 Prediction: 354.4090979536972 Loss = 3735.972537582807

Prediction after training: 354.4090980990777

Bibliography

  1. http://jalammar.github.io/visual-interactive-guide-basics-neural-networks/

Comments

Popular posts from this blog

DISASTER ALERT SYSTEM

CIFAR10 Image Classifier using PyTorch

Sentiment Prediction using Naive Bayes Algorithm