YOLO (You Only Look Once) is one of the most popular object detector convolutional neural networks (CNNs). After Joseph Redmon et al. published their first YOLO paper in 2015, subsequent versions were published by them in 2016, 2017 and by Alexey Bochkovskiy in 2020. This article is the first in a series of articles that provide an overview of how the YOLO CNN has evolved from the first version to the latest version.

1. YOLO v1 — Motivation:

Before the invention of YOLO, object detector CNNs such R-CNN used Region Proposal Networks (RPNs) first to generate bounding box proposals on the input image, then run a classifier on the bounding boxes and finally apply post-processing to eliminate duplicate detections as well as refine the bounding boxes. Individual stages of the R-CNN network had to be trained separately. R-CNN network was hard to optimize as well as slow.

Creators of YOLO were motivated to design a single stage CNN that could be trained end to end, was easy to optimize and was real-time.

2. YOLO v1 — Conceptual design:

Image for post

Figure 1: YOLO version 1 conceptual design

As shown in figure 1 left image, YOLO divides the input image into S x S grid cells. As show in figure 1 middle top image, each grid cell predicts B bounding boxes and an “objectness” score P(Object) indicating whether the grid cell contains an object or not. As shown in figure 1 middle bottom image, each grid cell also predicts the conditional probability P(Class | Object) of the class the object contained by the grid cell belongs to.

For each bounding box, YOLO predicts five parameters — _x, y, w, h _and a confidence score. The center of the bounding box with respect to the grid cell is denoted by the coordinates (x,y). The values of _x _and _y _are bounded between 0 and 1. The width _w _and height _h _of the bounding box are predicted as a fraction of the width and height of the whole image. So their values are between 0 and 1. The confidence score indicates whether the bounding box has an object and how accurate the bounding box is. If the bounding box does not have an object, then the confidence score is zero. If the bounding box has an object, then the confidence score equals Intersection Over Union (IoU) of the predicted bounding box and the ground truth. Thus for each grid cell, YOLO predicts B x 5 parameters.

For each grid cell, YOLO predicts C class probabilities. These class probabilities are conditional based on an object existing in the grid cell. YOLO only predicts one set of C class probabilities per grid cell even though the grid cell has B bounding boxes. Thus for each grid cell, YOLO predicts C + B x 5 parameters.

Total prediction tensor for an image = S x S x (C + B x 5). For PASCAL VOC dataset, YOLO uses S = 7, B = 2 and C = 20. Thus final YOLO prediction for PASCAL VOC is a 7 x 7 x (20 + 5 x 2) = 7 x 7 x 30 tensor.

Finally, YOLO version 1 applies Non Maximum Suppression (NMS) and thresholding to report final predictions as show in figure 1 right image.

#yolo #machine-learning #deep-learning #computer-vision #object-detection #deep learning

1. YOLO v1 — Motivation:

2. YOLO v1 — Conceptual design:

towardsdatascience.com

Evolution of YOLO — YOLO version 1