How to Implement a YOLO (v3) Object Detector using PyTorch (Part 4)

Part 4 of the tutorial series on how to implement a YOLO v3 object detector from scratch using PyTorch.

This is Part 4 of the tutorial on implementing a YOLO v3 detector from scratch. In the last part, we implemented the forward pass of our network. In this part, we threshold our detections by an object confidence followed by non-maximum suppression.

The code for this tutorial is designed to run on Python 3.5, and PyTorch 0.4. It can be found in it’s entirety at this Github repo.

This tutorial is broken into 5 parts:

Part 1 : Understanding How YOLO works
Part 2 : Creating the layers of the network architecture
Part 3 : Implementing the the forward pass of the network
Part 4 (This one): Confidence Thresholding and Non-maximum Suppression
Part 5 (This one): Designing the input and the output pipelines

Prerequisites

Part 1-3 of the tutorial.
Basic working knowledge of PyTorch, including how to create custom architectures with nn.Module, nn.Sequential and torch.nn.parameter classes.
Basic knowledge of NumPy

In case you’re lacking on any front, there are links below the post for you to follow.

In the previous parts, we have built a model which outputs several object detections given an input image. To be precise, our output is a tensor of shape B x 10647 x 85. B is the number of images in a batch, 10647 is the number of bounding boxes predicted per image, and 85 is the number of bounding box attributes.

However, as described in Part 1, we must subject our output to objectness score thresholding and Non-maximal suppression, to obtain what I will call in the rest of this post as the true detections. To do that, we will create a function called write_results in the file util.py

def write_results(prediction, confidence, num_classes, nms_conf = 0.4):

The functions takes as as input the prediction, confidence (objectness score threshold), num_classes (80, in our case) and nms_conf (the NMS IoU threshold).

Object Confidence Thresholding

Our prediction tensor contains information about B x 10647 bounding boxes. For each of the bounding box having a objectness score below a threshold, we set the values of it’s every attribute (entire row representing the bounding box) to zero.

    conf_mask = (prediction[:,:,4] > confidence).float().unsqueeze(2)
    prediction = prediction*conf_mask

Performing Non-maximum Suppression

Note: I assume you understand what IoU (Intersection over union) is, and what Non-maximum suppression is. If that is not the case, refer to links at the end of the post).

The bounding box attributes we have now are described by the center coordinates, as well as the height and width of the bounding box. However, it’s easier to calculate IoU of two boxes, using coordinates of a pair of diagnal corners of each box. So, we transform the (center x, center y, height, width) attributes of our boxes, to (top-left corner x, top-left corner y, right-bottom corner x, right-bottom corner y).

    box_corner = prediction.new(prediction.shape)
    box_corner[:,:,0] = (prediction[:,:,0] - prediction[:,:,2]/2)
    box_corner[:,:,1] = (prediction[:,:,1] - prediction[:,:,3]/2)
    box_corner[:,:,2] = (prediction[:,:,0] + prediction[:,:,2]/2) 
    box_corner[:,:,3] = (prediction[:,:,1] + prediction[:,:,3]/2)
    prediction[:,:,:4] = box_corner[:,:,:4]

The number of true detections in every image may be different. For example, a batch of size 3 where images 1, 2 and 3 have 5, 2, 4 true detections respectively. Therefore, confidence thresholding and NMS has to be done for one image at once. This means, we cannot vectorise the operations involved, and must loop over the first dimension of prediction (containing indexes of images in a batch).

    batch_size = prediction.size(0)

    write = False

    for ind in range(batch_size):
        image_pred = prediction[ind]          #image Tensor
           #confidence threshholding 
           #NMS

As describe previously, write flag is used to indicate that we haven’t initialized output, a tensor we will use to collect true detections across the entire batch.

Once inside the loop, let’s clean things up a bit. Notice each bounding box row has 85 attributes, out of which 80 are the class scores. At this point, we’re only concerned with the class score having the maximum value. So, we remove the 80 class scores from each row, and instead add the index of the class having the maximum values, as well the class score of that class.

        max_conf, max_conf_score = torch.max(image_pred[:,5:5+ num_classes], 1)
        max_conf = max_conf.float().unsqueeze(1)
        max_conf_score = max_conf_score.float().unsqueeze(1)
        seq = (image_pred[:,:5], max_conf, max_conf_score)
        image_pred = torch.cat(seq, 1)

Remember we had set the bounding box rows having a object confidence less than the threshold to zero? Let’s get rid of them.

        non_zero_ind =  (torch.nonzero(image_pred[:,4]))
        try:
            image_pred_ = image_pred[non_zero_ind.squeeze(),:].view(-1,7)
        except:
            continue

        #For PyTorch 0.4 compatibility
        #Since the above code with not raise exception for no detection 
        #as scalars are supported in PyTorch 0.4
        if image_pred_.shape[0] == 0:
            continue

The try-except block is there to handle situations where we get no detections. In that case, we use continue to skip the rest of the loop body for this image.

#pytorch #deep-learning #data-science #developer #python

Prerequisites

Object Confidence Thresholding

Performing Non-maximum Suppression

blog.paperspace.com

How to Implement a YOLO (v3) Object Detector using PyTorch (Part 4)