Introduction to Object Detection for Self Driving Cars

Foundational concepts like Gradient features, Sliding window protocol, Color Histogram features and Training a Classifier

Prateek Sawhney
14 min readSep 13, 2021

In previous medium articles, we’ve accomplished the task of finding what we’re looking for in images. Check in references. For the most part we used colors and gradients. Now that we’re familiar with some popular classifiers, we’re going to go deeper into our exploration of colors and gradients to see how we can locate and classify objects in images. Image classification is tricky. And it becomes even trickier when we don’t know exactly where in an image our objects of interest will appear or what size they’ll be or even how many of them we might find.

Image by Dan Gold on Unsplash

Coming up, we’ll focus on the task of detecting vehicles in images taken from a camera mounted on the front of a car. But the same principles apply to pedestrian detection, or traffic sign detection, or identifying any object we might be looking for in an image. In this medium article, we’ll practice searching for objects in images. And we’ll explore what features are the most useful input for our classifier. With those tools, we can then work on tracking vehicles in a video stream.

Introduction

Knowing where the other vehicles are on the road and being able to anticipate where they’re heading next is essential in a self-driving car. Also, determining things like how far away they are, which way they’re going, and how fast they’re moving. Much the same way as we do with our own eyes as we drive. Object detection and tracking is a main concept in advance computer vision. We’ll first explore what kind of visual features we can extract from images in order to reliably classify vehicles. Next, we’ll look into searching an image for detections. And then we’ll track those detections from frame to frame in a video stream.

Image by Evgeny Tchebotarev on Unsplash

With traditional computer vision approaches we basically tune all the parameters by hand. Which gives us a lot of intuition about what works and why. When it comes to things like image classification, in many cases a deep learning approach works even better for the same tasks. But can seem like kind of a black box. We’re not quite sure why it works. By learning both approaches, we can gain maximum insight and get the best performance in our algorithms.

Object Detection Overview

Recognizing what’s in an image is the essence of computer vision. When we look at the world through our own eyes, we’re constantly performing classification tasks with our brain. And in the case of self-driving cars, reliable object detection and classification are essential to giving the car to see the world just like we do. We’ll also come across with the methods for separating false positives from real detections and in track the real detections from one frame to the next. Well, like any good machine learning problem, it all begins with feature extraction. So, first we’ll look at how to compute features that we can use to reliably identify cars in our images.

Introduction to Features

We need to figure out what differentiates the objects of interest from the rest of the image. We’ve already seen that things like colors and gradients can be good differentiators but let’s give them an identity. All of these potential characteristics are features that we could use. What features are more important may depend upon the appearance of the objects in question. In most applications, we’ll end up using a combination of features that give us the best results.

Color Features

The simplest feature we can get from images consists of raw color values. For instance, here is an image which contains a car.

Sample Image (Image by author)

Let’s say you want to find out whether some region of a new test image contains a car or not. What do we do?

Well, using our known car image as is, you can simply find the image difference between the car image and the test region, and see the difference is small. This basically means subtracting the corresponding color values, aggregating the differences and comparing it with the threshold.

Alternatively, we could compute the correlation between the car image and test region and check if that is high. Either way, this general approach is known as template matching. Our known image is the template or model and we try to match it with regions of the test image. Template matching does work in limited circumstances but isn’t very helpful for our case. Can you guess why?

Color Histogram Features

In template matching, we depend on raw color values laid out in a specific order, and that can vary a lot, so we need to find some transformations that are robust to changes in appearance.

An image template is useful for detecting things that do not vary in their appearance much. For instance, icons or emojis on the screen. But for most real world objects that do appear in different forms, orientation, sizes, this technique doesn’t work quite well.

Once a transform is to compute the histogram of color values in an image. When we compare the histogram of a known object with regions of a test image, locations with similar color distributions will reveal a close match. Here, we have removed our dependents on structure, that is, we are no longer sensitive to a perfect arrangement of pixels. Therefore, objects that appear in slightly different aspects and orientations will still be matched. Variations and image size can also be accommodated by normalizing the histograms. However, note that we are now solely relying on the distribution of color values which might match some unwanted regions.

Color Spaces

Whether we use raw colors directly or build a histogram of those values, we still haven’t solved the problem of representing objects of the same class that can be of different colors. That is, we still haven’t achieved color and variance. If we take an image of a car and look at analyse how its color values are distributed in the RGB color space. That is, using red, green, and blue intensities as three coordinates. Sometimes, the red and blue cars’ pixels are clustered into two separate groups. Although we could come up with a scheme to identify these groups using RGB values, it can get complicated very quickly as we try to accommodate different colors.

Gradient and HOG Features

So, far we’ve been manipulating and transforming color values. But they only capture one aspect of an object’s appearance. When we have a class of objects that can vary in color, structural ques like gradients or edges might give us a more robust presentation. Let’s look at this realistic image, for example, this 64 by 64 pixel image of a car. And let’s compute the gradient magnitudes and directions at each pixel.

64 x 64 Pixel Image of a car (Image by author)

Now, instead of using all these individual values, let’s group them up into small cells, say, of size 8 by 8 pixels each. Inside each cell is where the magic happens. We will compute a histogram of gradient directions or orientations from each of the 64 pixels within the cell. The gradient samples are distributed into, say, nine orientation bins, and summed up. We typically won’t need any more accuracy at this small scale. A better way to visualize the histogram for an individual cell would be to add up the contributions for samples in each orientation bin to get a sort of star with arms of different lengths. The direction with the longest arm is the dominant gradient direction in the cell. Note that the histogram is not strictly a count of the number of samples in each direction. Instead, we sum up the gradient magnitude of each sample. So stronger gradients contribute more weight to their orientation bin, and the effect of small random gradients due to noise, etc., is reduced.

Color L Channel and HOG Visualization (Image by author)

In other words, each pixel in the image gets a vote on which histogram bin it belongs in based on the gradient direction at that position. But the strength or weight of that vote depends on the gradient magnitude at that pixel. When we do this for all the cells, we begin to see a representation of the original structure emerge. As demonstrated with simpler shapes before, something like this can be used as a signature for a given shape. This is known as a histogram of oriented gradients, or HoG feature.

The main advantage now is that we have built in the ability to accept small variations in the shape, while keeping the signature distinct enough. How accommodating or sensitive the feature is can be tweaked by varying parameters such as the number of orientation bins, grid of cells, cell sizes, adding any overlap between cells, etc. A number of other enhancements are used in practice, including normalizing for intensity across small blocks of cells, etc.

Combining Features

As noted before, it’s not necessary to use only one kind of feature for object detection. We can combine both color-based and shape-based features. After all, they complement each other in the information they capture about a desired object. In fact, a variety of features can help us design a more robust detection system. However, we do need to be careful about how we use them. For example, assume that we are using HSV values as one input feature with the flatten vector containing a elements. And HoG as the other feature with b elements. The simplest way of combining them is to concatenate the two. HSV and HoG into a long a + b element vector. If we visualize this vector as a simple bar plot, we might notice a difference in magnitude between the color-based and gradient-based features. This is because they represent different quantities. A normalization step may prevent one type from dominating the other in later stages. Also note that there might be a lot more elements of one type than the other. This may or may not be a problem in itself, but it’s generally a good idea to see if there are any redundancies in the combined feature vector. For instance, could use a decision tree to analyze the relative importance of features, and drop the ones that are not contributing much.

Building a Classifier

A classic approach is to first design a classifier that can differentiate car images from non-car images, and then run that classifier across an entire frame sampling small patches along the way. The patches that classified as car are the desired detections. For this approach to work properly, we must train our classifier to distinguish car and non-car images.

To train any classifier, we need labeled data. Lots of it. In this case, the two classes we would like to distinguish are car and non-car images. So we need samples of both. If we only have full video frames available, we need to crop out regions and scale them to a fixed size.

A kind of Dataset needed (Image by author)

Ideally we want a balanced data set. That is the number of car and non-car images should be roughly equal. If that’s not the case, we’re under risk of the classifier trying to predict everything as belonging to the majority class. There are some techniques for handling imbalanced data sets, for example, duplicate some samples from the smaller class to balance the counts. For vehicle classification, if we don’t have enough non-car images we could simply extract more from video frames. Okay, once we have a sizable data set, we need to split it into two collections. A training set and a test set. As the names suggest, we will only use images from the training set when training our classifier and then check how it performs on unseen examples from the test set. To avoid any possible ordering effects in the data, we should sample or shuffle the data set randomly when splitting it for training and testing. Even within the training and test sets, we should aim for a balance between the number of car and non car images. All of this preprocessing might seem like a lot of work but machine learning algorithms work on a principle of garbage in, garbage out. So we need to be careful about what we feed them.

Training a Classifier

The training phase essentially consists of extracting features for each sample in the training set, and supplying these feature vectors to the training algorithm, along with corresponding labels. The training algorithm initializes a model, and then tweaks its parameters using the feature vectors and labels. Typically, this involves an iterative procedure where one or more samples are presented to the classifier at a time, which then predicts their labels. The error between these predicted labels and ground-truth is used as a signal to modify the parameters. When this error falls below a certain threshold, or when enough iterations have passed, we can consider the model to have been sufficiently trained. Now, we can verify how it performs on previously unseen examples using the test set.

The error on the test set is typically larger than that on the training set, which is expected. Also, both errors typically decrease the more we train our model. However, we have to be careful about one thing. If we keep training beyond a certain point, our training error may keep decreasing, but our test error will begin to increase again. This is known as overfitting. Our model fits the training data very well, but is unable to generalize to unseen examples. One thing we haven’t talked about yet is the choice of what classifier to use. That’s because it might require some experimentation to figure out what classifier works best for a given problem. In this case, we’re going to start with support vector machines. But we’re free to choose what classifier we ultimately use. It could even be a combination or ensemble of multiple classifiers.

Sliding Windows Approach

Okay, so now we’ve decided which features to extract from each image, and we’ve trained a classifier using label data. Great work. The next step is to implement a method of searching for objects, in this case, for vehicles in an image.

So how are we going to do that?

Well, we’ve seen that we can consider cutouts or subregions of an image, and run our classifier on each subregion to see if it contains the object we’re trying to detect. So what we’ll do next is implement a sliding window technique, where we’ll step across an image in a grid pattern and extract the same features we trained our classifier on in each window. We’ll run our classifier to give a prediction at each step. And with any luck, it will tell us which windows in our image contains cars.

Sliding window technique (Image by author)

In general, we won’t know what size our object of interest will be in the image we’re searching. So makes sense to search in multiple scales. In this case, it’s good idea to establish a minimum and a maximum scale at which we expect the object to appear, and then reasonable number of intermediate scales to scan as well. The think to be careful about here is that the total number of windows we’re searching can increase rapidly which means our algorithm will run slower. We’re looking for vehicles, so it makes sense through restrict our search to the only areas of the images where vehicles might appear. Furthermore, when it comes to scale, we know for example that vehicles that appear small will be near the horizon. So search in a small scales could be restricted to even narrow a strip across the image.

False Positives

We now have a scheme for searching across the image for possible detections, but we’ll notice that our classifier is not perfect. In some cases, it will report multiple overlapping instances of the same car or even report cars where there are none. These are known as duplicates and false positives, and we’ll need to filter them out. In correctly combining duplicate detections and rejecting false positives, we are performing the task of identifying where vehicles are on the road. But equally important where they are not. False positives that are not properly filtered out can lead to taking actions like emergency breaking when it’s not necessary and potentially, to causing an accident. In order to avoid running into another car, we would like to get the best estimate possible for the position and size of the cars we detect. That means whether it’s a single detection or multiple detections on the same car, a tight bounding box for each car is what we’re aiming for. These bounding boxes are ultimately going to be used by your path-planning or motion control algorithms, to try to cause the steer is clear of the other vehicles.

Bounding boxes over the detections (Image by author)

Tracking Pipeline

Okay, let’s summarize our overall tracking pipeline.

  1. In each frame of the video, we will run a search for vehicles using a sliding window technique.
  2. Wherever our classifier returns a positive detection, we’ll record the position of the window in which the detection was made.
  3. In some cases we might detect the same vehicle in overlapping windows or at different scales. In the case of overlapping detections we’re going to assign the position of the detection to the centroid of the overlapping windows.
  4. We also have false positives which we’ll filter out by determining which detections appear in one frame but not the next. Once we have a high confidence detection we can record how it’s centroid is moving from frame to frame and we eventually estimate where it will appear in each subsequent frame.

Summary

First we need to decide what features to use. We’ll want to try some combination of color and gradient based features. But keep in mind that this might require some experimentation to decide what works best. Next, we’ll need to choose and train a classifier. A linear SVM is probably our best bet for an ideal combination of speed and accuracy. We’ve chosen features and trained a classifier. Next, we’ll implement a sliding window technique to search for vehicles in some test images. We can try multiscale search or different tiling schemes to see what works best. But keep in mind, we’d like to minimize the number of search windows. So, for example, we probably don’t need to search for cars in the sky and the treetops. Once we’ve got a working detection pipeline, we’ll try it on a video stream. And implement tracking to follow, detect vehicles. And reject spurious detections.

--

--

Prateek Sawhney

AI Engineer at DPS, Germany | 1 Day Intern @Lenovo | Explore ML Facilitator at Google | HackWithInfy Finalist’19 at Infosys | GCI Mentor @TensorFlow | MAIT, IPU