Welcome back guys! Happy to see you guys:) Last time, we realized that how copy-and-paste is embedded in CNNs for deep image inpainting. Can you get the main idea? If yes, Good! If no, Don’t worry! Today, we are going to dive into a breakthrough in deep image inpainting, for which contextual attention is proposed. By using contextual attention, we can effectively borrow information from distant spatial locations for reconstructing the local missing pixels. This idea is actually more or less the same as copy-and-paste. Let’s see how they can do that together!

Recall

In my previous post, I have introduced the shift-connection layer in which features from known regions act as references to the generated features inside missing regions to allow us to further refine the generated features for better inpainting results. Here, we assume that the generated features are reasonable estimations of the ground truth and suitable references are determined according to the similarity between features from known regions and the generated features inside missing regions.

Motivation

For the task of image inpainting, the structure of CNNs cannot effectively model the long-term correlations between the missing regions and information given by distant spatial locations. If you are familiar with CNNs, you should know that the kernel size and the dilation rate control the receptive field at a convolutional layer and the network has to go deeper and deeper so as to see the entire input image. This means that if we want to capture the context of an image, we have to rely on deeper layers but we lose the spatial information as deeper layers always have smaller spatial size of features. So, we have to find a way to borrow information from distant spatial locations (i.e. understanding the context of an image) without going too deep into a network.

If you remember what dilated convolution is (We have covered in previous post), you will know that dilated convolution is one way to increase the receptive field at early convolutional layers without adding additional parameters. However, dilated convolution has its limitations. It skips consecutive spatial locations in order to enlarge the receptive field. Note that the skipped consecutive spatial locations are also crucial for filling in the missing regions.

Introduction

This work shares similar network architecture, loss function and relevant techniques that we have covered before. For the architecture, the proposed framework consists of two generator networks and two discriminator networks. The two generators follow the fully convolutional networks with dilated convolutions. One generator is for coarse reconstruction and another one is for refinement. This is called standard coarse-to-fine network structure. The two discriminators also look at the completed images both globally and locally. The global discriminator takes the entire image as input while the local discriminator takes the filled region as input.

For the loss function, simply speaking, they also employ adversarial loss (GAN loss) and L1 loss (for pixel-wise reconstruction accuracy). For the L1 loss, they use a spatially discounted L1 loss in which a weight is assigned to each pixel difference and the weight is based on the distance of a pixel to its nearest known pixel. For GAN loss, they use a WGAN-GP loss instead of the most standard adversarial loss we have introduced. They claim that this WGAN adversarial loss is also based on L1 distance measure, hence the network is easier to train and the training process is more stable.

In this post, I would like to focus on the proposed contextual attention mechanism. Therefore, I briefly cover the coarse-to-fine network architecture, the WGAN adversarial loss, and the weighted L1 loss in above. Interested readers can refer to my previous posts and the paper of this work for further details.

#image-inpainting #image-processing #deep-learning #convolutional-network #machine-learning

A Breakthrough in Deep Image Inpainting
1.35 GEEK