How to Implement a YOLO (v3) Object Detector using PyTorch (Part 2)

Part 2 of the tutorial series on how to implement your own YOLO v3 object detector from scratch in PyTorch.

This is Part 2 of the tutorial on implementing a YOLO v3 detector from scratch. In the last part, I explained how YOLO works, and in this part, we are going to implement the layers used by YOLO in PyTorch. In other words, this is the part where we create the building blocks of our model.

The code for this tutorial is designed to run on Python 3.5, and PyTorch 0.4. It can be found in it’s entirety at this Github repo.

This tutorial is broken into 5 parts:

Part 1 : Understanding How YOLO works
Part 2 (This one): Creating the layers of the network architecture
Part 3 : Implementing the the forward pass of the network
Part 4 : Confidence Thresholding and Non-maximum Suppression
Part 5 (This one): Designing the input and the output pipelines

Prerequisites

Part 1 of the tutorial/knowledge of how YOLO works.
Basic working knowledge of PyTorch, including how to create custom architectures with nn.Module, nn.Sequential and torch.nn.parameter classes.

I assume you have had some experiene with PyTorch before. If you’re just starting out, I’d recommend you to play around with the framework a bit before returning to this post.

Getting Started

First create a directory where the code for detector will live.

Then, create a file darknet.py. Darknet is the name of the underlying architecture of YOLO. This file will contain the code that creates the YOLO network. We will supplement it with a file called util.py which will contain the code for various helper functions. Save both of these files in your detector folder. You can use git to keep track of the changes.

Configuration File

The official code (authored in C) uses a configuration file to build the network. The cfg file describes the layout of the network, block by block. If you’re coming from a caffe background, it’s equivalent to .protxt file used to describe the network.

We will use the official cfg file, released by the author to build our network. Download it from here and place it in a folder called cfg inside your detector directory. If you’re on Linux, cd into your network directory and type:

mkdir cfg
cd cfg
wget https://raw.githubusercontent.com/pjreddie/darknet/master/cfg/yolov3.cfg

If you open the configuration file, you will see something like this.

[convolutional]
batch_normalize=1
filters=64
size=3
stride=2
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=32
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[shortcut]
from=-3
activation=linear

We see 4 blocks above. Out of them, 3 describe convolutional layers, followed by a shortcut layer. A shortcut layer is a skip connection, like the one used in ResNet. There are 5 types of layers that are used in YOLO:

Convolutional

[convolutional]
batch_normalize=1  
filters=64  
size=3  
stride=1  
pad=1  
activation=leaky

Shortcut

[shortcut]
from=-3  
activation=linear

A shortcut layer is a skip connection, akin to the one used in ResNet. The from parameter is -3, which means the output of the shortcut layer is obtained by adding feature maps from the previous and the 3rd layer backwards from the shortcut layer.

Upsample

[upsample]
stride=2

Upsamples the feature map in the previous layer by a factor of stride using bilinear upsampling.

Route

[route]
layers = -4

[route]
layers = -1, 61

The route layer deserves a bit of explanation. It has an attribute layers which can have either one, or two values.

When layers attribute has only one value, it outputs the feature maps of the layer indexed by the value. In our example, it is -4, so the layer will output feature map from the 4th layer backwards from the Route layer.

When layers has two values, it returns the concatenated feature maps of the layers indexed by it’s values. In our example it is -1, 61, and the layer will output feature maps from the previous layer (-1) and the 61st layer, concatenated along the depth dimension.

YOLO

[yolo]
mask = 0,1,2
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
classes=80
num=9
jitter=.3
ignore_thresh = .5
truth_thresh = 1
random=1

YOLO layer corresponds to the Detection layer described in part 1. The anchors describes 9 anchors, but only the anchors which are indexed by attributes of the mask tag are used. Here, the value of mask is 0,1,2, which means the first, second and third anchors are used. This make sense since each cell of the detection layer predicts 3 boxes. In total, we have detection layers at 3 scales, making up for a total of 9 anchors.

Net

[net]
## Testing
batch=1
subdivisions=1
## Training
## batch=64
## subdivisions=16
width= 320
height = 320
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

There’s another type of block called net in the cfg, but I wouldn’t call it a layer as it only describes information about the network input and training parameters. It isn’t used in the forward pass of YOLO. However, it does provide us with information like the network input size, which we use to adjust anchors in the forward pass.

Parsing the configuration file

Before we begin, add the necessary imports at the top of the darknet.py file.

from __future__ import division

import torch 
import torch.nn as nn
import torch.nn.functional as F 
from torch.autograd import Variable
import numpy as np

We define a function called parse_cfg, which takes the path of the configuration file as the input.

def parse_cfg(cfgfile):
    """
    Takes a configuration file

    Returns a list of blocks. Each blocks describes a block in the neural
    network to be built. Block is represented as a dictionary in the list

    """

The idea here is to parse the cfg, and store every block as a dict. The attributes of the blocks and their values are stored as key-value pairs in the dictionary. As we parse through the cfg, we keep appending these dicts, denoted by the variable block in our code, to a list blocks. Our function will return this block.

We begin by saving the content of the cfg file in a list of strings. The following code performs some preprocessing on this list.

file = open(cfgfile, 'r')
lines = file.read().split('\n')                        ## store the lines in a list
lines = [x for x in lines if len(x) > 0]               ## get read of the empty lines 
lines = [x for x in lines if x[0] != '#']              ## get rid of comments
lines = [x.rstrip().lstrip() for x in lines]           ## get rid of fringe whitespaces

Then, we loop over the resultant list to get blocks.

block = {}
blocks = []

for line in lines:
    if line[0] == "[":               ## This marks the start of a new block
        if len(block) != 0:          ## If block is not empty, implies it is storing values of previous block.
            blocks.append(block)     ## add it the blocks list
            block = {}               ## re-init the block
        block["type"] = line[1:-1].rstrip()     
    else:
        key,value = line.split("=") 
        block[key.rstrip()] = value.lstrip()
blocks.append(block)

return blocks

#pytorch #deep-learning #data-science #developer #python

Prerequisites

Getting Started

Configuration File

Parsing the configuration file

blog.paperspace.com

How to Implement a YOLO (v3) Object Detector using PyTorch (Part 2)