Object detection is one of the most central and critical tasks in computer vision. It is also a task with a number of practical benefits. From autonomous driving to surveillance, a well trained object detector can bring a lot of performance advantages to the table.

The recent advances in Deep Learning aided computer vision, driven primarily by the Convolutional Neural Network (CNN) architecture and more recently by the Transformer architecture have produced a number of excellent object detectors at the disposal of a computer vision practitioner. Focusing on the CNNs, a series of models of the two stage approach have been developed. These include Fast R-CNN and Faster R-CNN, two go to designs for practitioners. As the description suggests, these designs require two passes through the image: in the fast pass the network learns to formulate good regions of interest (RoI) and in the second pass the RoIs are linked to the objects to be detected.

As can be imagined, the two pass design makes these designs slower to train, and hence Single Shot Detectors (SSD) were developed that require a single pass through the image. The network performs the tasks of producing regions of interest, called anchor boxes in this design, as well as doing the object classification simultaneously in these designs. Examples of this architecture include SSD, YOLO, RetinaNet and EfficientDet. While the initial single shot detectors were not as accurate, recent revisions have greatly improved the accuracy of these designs, and their faster training times make them highly desirable for practical applications.

The performance of Deep Learning architectures often depends on carefully chosen hyper-parameters, and not surprisingly, the single shot detectors are no exception — in particular, the anchor scales and anchor ratios are prime examples of such parameters. These parameters, along with the image size and shape being used (such as 512x512 or 1024x1024 etc), determine the overall accuracy of the model being trained. Let us look deeper into how we can determine the best values of these for a task. For our example, we will work with the task of detecting helmets of NFL players in images taken at different angles. This dataset was provided as part of the recent NFL 1st and Future Kaggle Challenge. We will use EfficientDet as the model under study. Data is presented for training with compound coefficient 0 (512x512 image) and batch size 4 (due to GPU restrictions).

#pytorch #object-detection #deep-learning #machine-learning #python

Tips for Single Shot Object Detectors
2.00 GEEK