Introduction to Convolutional Neural Networks

Introductory concepts in the field of Image Recognition using Convolutional Neural Networks

Prateek Sawhney
7 min readSep 3, 2022

One of the very popular ways to structure a neural network is called a Convolutional Neural Network. It was invented by Yann LeCun about 30 years ago, but it’s become incredibly popular for things like image processing, and processing large datasets. So, let’s talk about Convolutional Neural Networks.

Photo by Victor Grabarczyk on Unsplash

Statistical Invariance

Here’s an example. We have an image, and we want our network to say it’s an image with a cat in it. It doesn’t really matter where the cat is, it’s still an image with a cat. If our network has to learn about kittens in the left corner, and about kittens in the right corner independently, that’s a lot of work that it has to do. How about we telling it, instead explicitly, that objects and images are largely the same whether they’re on the left or on the right of the picture.

That’s what’s called translation invariance. Different positions, same kitten. Yet another example. Imagine we have a long text that talks about kittens. Does the meaning of kitten change depending on whether it’s in the first sentence or in the second one? Mostly not, so if we’re trying to network on text, maybe we want the part of the network that learns what a kitten is to be reused every time we see the word kitten, and not have to re-learn it every time. The way we achieve this in our own networks, is using what’s called weight sharing. When we know that two inputs can contain the same kind of information, then we share their weights. And train the weights jointly for
those inputs. It’s a very important idea. Statistical invariants, things that don’t change on average across time or space, are everywhere. For images, the idea of weight sharing will get us to study cnn’s. For text and sequences in general, it will lead us to embeddings and recurrent neural networks.

Convolutional Neural Networks (CNN)

Let’s talk about Convolutional Networks, or ConvNets. ConvNets are neural networks that share their parameters across space. Imagine we have an image. It can be represented as a flat pancake. It has a width, a height, and
because we typically have red, green, and blue channels, it also has a depth. In this instance, depth is three. That’s our input. Now, imagine taking a small patch of this image, and running a tiny neural network on it, with K outputs.

Convolution Operation (Image by author)

Now, let’s slide that little neural network across the image without changing the weights. Just slide across invertically like we’re painting it with a brush. On the output, we’ve drawn another image.

Patch over a dog image (Image by author)

It’s got a different width, a different height. And more importantly, it’s got a different depth. Instead of just R, G, and B, now, we have an output that’s got many colored channels, K of them. This operation is called the convolution.

Shifted Patch over a dog image (Image by author)

If our patch size were the size of the whole image, it would be no different than the regular layer of a neural network. But because we have this small patch instead, we have many fewer weights and they are shared across space. A convolutional neural network is going to basically be a deep network where instead of having stacks of matrix multiply layers, we’re going to have stacks of convolutions. The general idea is that they will form a pyramid. At the bottom, we have this big image, but very shallow just R, G, and B. We’re going to apply convolutions that are going to progressively squeeze the spacial dimensions while increasing the depth which corresponds roughly to the semantic complexity of your representation.

At the top, we can put our classifier. We have a representation where all this spacial information has been squeezed out, and only parameters that map to content of the image remain. So that’s the general idea. If we’re going to implement this, there are lots of little details to get right, and a fair bit of lingo to get used too. We’ve know the concept of Patch and Depth. Patches are sometimes called Kernels. Each pancake in our stack is called a feature map.

Another term that we need to know is stride. It’s the number of pixels, so that we’re shifting each time we move our filter. The stride of one makes the output roughly the same size as the input. A stride of two means it’s about half the size. I say roughly because it depends a bit about what we do at the edge of our image.

Either we don’t go pass the edge, and it’s often called valid padding as a shortcut. Or we go off the edge and pad with zeros in such a way that the output map size is exactly the same size as the input map. That is often called same padding as a shortcut.

Hierarchy Diagram showing detection at various layers of a convolutional neural network (Image by author)

That’s it, we can build a simple convolutional neural network with just this. Stack up our convolutions which thankfully we don’t have to implement ourselves. Then use triads to reduce the dimensionality and increase the depth of our network layer after layer. And once we have a deep and narrow presentation, connect the whole thing to a few regular, fully connected layers, and we’re ready to train our classifier. You might wonder what happens to training and to chain rule, in particular, when you use shared weights like this. Nothing really happens, the math just works. You just add up the derivatives for all the possible locations on the image.

Exploring the Design Space

Now that we’ve seen what a simple convolutional neural network looks like, there are many things that we can do to improve it. We’re going to talk about two of them:

  1. Pooling
  2. 1x1 Convolutions

The first improvement is a better way to reduce the spatial extent of our future maps in the convolutional pyramid. Until now, we’ve used striding to shift
the filters by a few pixels each time and reduce the feature map size. This is a very aggressive way to down sample an image. It removes a lot of information. What if instead of skipping one in every two convolutions, we still run with a very small stride, say for example, one but then took all the convolutions in a neighborhood and combine them somehow?

That operation is called pooling, and there are a few ways to go about it. The most common is the max pooling. At every point of on the feature map, look at a small neighborhood around that point and compute the maximum of all the responses around it. There are some advantages to using max pooling.

First, it doesn’t add to our number of parameters, so we don’t risk an increase in over fitting.

Max Pooling (Image by author)

Second, it simply often yields a more accurate model. However, since the convolutions that went below run at lower stride, the muddle then becomes a lot more expensive to compute. And now, we have even more hyper parameters to worry about, the pooling region size and the pooling stride. And no, they don’t have to be the same. A very typical architecture for a cnn is a few layers alternating convolutions and max pooling, followed by a few fully connected layers at the top. The first famous model to use this architecture was Lenet-5 designed by Yan Lecun to do character recogniztion back in 1998. Modern convolutional networks, such as AlexNet, which famously won the competitive ImageNet object recognition challenge in 2012, used the same architecture with a few wrinkles.

Another notable form of pooling is average pooling. Instead of taking the max, just taking an average over the window of pixels around a specific location.

1x1 Convolutions

Also, I want to introduce you to another idea. It’s the idea of 1 x 1 convolutions. You might wonder, why one would ever want to use 1 x 1 convolutions?

They’re not really looking at a patch of the image just that one pixel. Look at the classic convolution setting, it’s basically a small classifier for a patch of the image but it’s only a linear classifier. But if we add a 1 x 1 convolution in the middle, suddenly we have a mini neural network running over the patch
instead of a linear classifier. Interspersing our convolutions with 1 x 1 convolutions is a very inexpensive way to make our models deeper and
have more parameters, without completely changing their structure.

With this, we have come to the end of this article. Thanks for reading this and following along. Hope you loved it! Bundle of thanks for reading it!

My Linkedin :)

--

--

Prateek Sawhney

AI Engineer at DPS, Germany | 1 Day Intern @Lenovo | Explore ML Facilitator at Google | HackWithInfy Finalist’19 at Infosys | GCI Mentor @TensorFlow | MAIT, IPU