Introduction to Neural Networks For Self Driving Cars (Foundational Concepts — Part 2)

Foundational concepts in the fields of Machine Learning, Deep Neural Networks and Self Driving Cars

Prateek Sawhney
7 min readSep 8, 2022

Welcome to this Medium Article. This article is an extended version of the Introduction to Neural Networks For Self Driving Cars (Foundational Concepts Part — 1) 😀

Image by ahmedgad on Pixabay

One-Hot Encoding

So, as we’ve seen so far, all our algorithms are numerical. This means we need to input numbers, such as a score in a test or the grades, but the input data will not always look like numbers.

Let’s say the module receives as an input the fact that you got a gift or didn’t get a gift. How do we turn that into numbers? Well, that’s easy. If you’ve got a gift, we’ll just say that the input variable is 1. And, if you didn’t get a gift, we’ll just say that the input variable is 0. But, what if we have more classes as before or, let’s say, our classes are Duck, Beaver and Walrus?

What variable do we input in the algorithm?

Maybe, we can input a 0 or 1 and a 2, but that would not work because it would assume dependencies between the classes that we can’t have. So, this is what we do. We will come up with one variable for each of the classes. That’s one variable for Duck, one for Beaver and one for Walrus. Now, if the input is a duck then the variable for duck is 1 and the variables for beaver and walrus are 0. Similarly for the beaver and the walrus. We may have more columns of data but at least there are no unnecessary dependencies. This process is called The One-Hot Encoding and it will be used a lot for processing data.

Maximum Likelihood

So we’re still in our search for an algorithm that will help us pick the best model that separates our data well. Well, since we’re dealing with probabilities then just use them in our favor. Let’s say I’m a student and I have two models. One that tells me that my probability of getting accepted is 80% and one that tells me the probability is 55%. Which model looks more accurate?

A Sample Neural Network (Image by author)

Well, if I got accepted then I’d say the better model is probably the one that says 80%. What if I didn’t get accepted? Then the more accurate model is more likely the one that says 55 percent. But I’m just one person. What if it was me and a friend? Well, the best model would more likely be the one that gives the higher probabilities to the events that happened to us, whether it’s acceptance or rejection. This sounds pretty intuitive.

The method is called maximum likelihood.

What we will do is pick the model that gives the highest probability for the existing labels. Thus, by maximizing the probability, we can pick the best possible model.

Cross-Entropy

Cross entropy really says the following.

If I have a bunch of events and a bunch of probabilities, how likely is it that those events happen based on the probabilities? If it’s very likely, then we have a small cross entropy. If it’s unlikely, then we have a large cross entropy.

Gradient Descent

So now let’s study gradient descent. So we’re standing somewhere in Mount ABC and we need to go down. So now the inputs of the functions are W1 and W2 and the error function is given by E. Then the gradient of E is given by the vector sum of the partial derivatives of E with respect to W1 and W2.

This gradient actually tells us the direction we want to move if we want to increase the error function the most. Thus, if we take the negative of the gradient, this will tell us how to decrease the error function the most. And this is precisely what we’ll do.

At the point we’re standing, we’ll take the negative of the gradient of the error function at that point. Then we take a step in that direction. Once we take a step, we’ll be in a lower position. So we do it again, and again, and again, until we are able to get to the bottom of the mountain. So this is how we calculate the gradient.

But what matters here is the gradient of the error function. The gradient of the error function is precisely the vector formed by the partial derivative of the error function with respect to the weights and the bias. Now, we take a step in the direction of the negative of the gradient. As before, we don’t want to make any dramatic changes, so we’ll introduce a smaller learning rate alpha. For example, 0.1. And we’ll multiply the gradient by that number. Now taking the step is exactly the same thing as updating the weights and the bias as follows.

The weight Wi will now become Wi prime. Given by Wi minus alpha times the partial derivative of the error, with respect to Wi. And the bias will now become b prime given by b minus alpha times partial derivative of the error with respect to b. Now this will take us to a prediction with a lower error function. So, we can conclude that the prediction we have now with weights W prime b prime, is better than the one we had before with weights W and b. This is precisely the gradient descent step.

Gradient Descent vs The Perceptron Algorithm

So now let’s compare the Gradient Descent algorithm and the Perceptron algorithm. In the Gradient Descent algorithm, we take the weights and change them from Wi to Wi + alpha * Y_hat-Y * Xi.

In the Perceptron algorithm, not every point changes weights, only the misclassified ones. Here, if X is misclassified, we’ll change the weights by adding Xi to Wi if the point label is positive, and subtracting if negative. Now the question is, are these two things the same?

Well, let’s remember that in that Perceptron algorithm, the labels are one and zero. And the predictions Y_hat are also one and zero. So, if the point is correct, classified, then Y — Y_hat = 0 because Y is equal to Y_hat.

Now, if the point is labeled blue, then Y=one. And if it’s misclassified, then the prediction must be Y_hat=0. So Y_hat-Y=(-1). Similarly, with the points labeled red, then Y=0 and Y_hat=1. So, Y_hat-Y=1.

But let’s study Gradient Descent even more carefully. Both in the Perceptron algorithm and the Gradient Descent algorithm, a point that is misclassified tells a line to come closer because eventually, it wants the line to surpass it so it can be in the correct side. Now, what happens if the point is correctly classified?

A Sample Illustration of weights (Image by author)

Well, the Perceptron algorithm says do absolutely nothing. In the Gradient Descent algorithm, you are changing the weights. But what is it doing? Well, if we look carefully, what the point is telling the line, is to go farther away. And this makes sense, right? Because if you’re correctly classified, say, if you’re a blue point in the blue region, you’d like to be even more into the blue region, so your prediction is even closer to one, and your error is even smaller. Similarly, for a red point in the red region. So it makes sense that the point tells the line to go farther away. And that’s precisely what the Gradient Descent algorithm actually does.

The misclassified points asks the line to come closer and the correctly classified points asks the line to go farther away. The line listens to all the points and takes steps in such a way that it eventually arrives to a pretty good solution.

Multiclass Classification

It seems that neural networks work really well when the problem consist on classifying two classes. For example, if the model predicts a probability of receiving a gift or not then the answer just comes as the output of the neural network. But what happens if we have more classes?

Say, we want the model to tell us if an image is a duck, a beaver, or a walrus. Well, one thing we can do is create a neural network to predict if the image is a duck, then another neural network to predict if the image is a beaver, and a third neural network to predict if the image is a walrus. Then we can just use Softmax or pick the answer that gives us the highest probability. But this seems like overkill, right?

The first layers of the neural network should be enough to tell us things about the image and maybe just the last layer should tell us which animal it is. As a matter of fact, this is exactly the case. So what we need here is to add more nodes in the output layer and each one of the nodes will give us the probability that the image is each of the animals. Now, we take the scores and apply the Softmax function that was previously defined to obtain well-defined probabilities. This is how we get neural networks to do multi-class classification.

--

--

Prateek Sawhney

AI Engineer at DPS, Germany | 1 Day Intern @Lenovo | Explore ML Facilitator at Google | HackWithInfy Finalist’19 at Infosys | GCI Mentor @TensorFlow | MAIT, IPU