Nvidia filling the blanks: A Partial Convolutions Research Paper

The Model Approach and Architecture

The proposed model uses stacked partial convolutional operations and mask updating steps to perform image in-painting. Let’s start with defining convolution and mask-update mechanism.

For brevity, we refer to our partial convolution operation and mask update function jointly as the Partial Convolutional Layer.

Let W be the convolution filter weights for the convolution filter and b its the corresponding bias. X are the feature values (pixels values) for the current convolution (sliding) window and M is the corresponding binary mask. The partial convolution at every location, similarly defined in , is expressed as:

Partial Convolution mechanism

After each partial convolution operation, we then update our mask. Our unmasking rule is simple: if the convolution was able to condition its output on at least one valid input value, then we remove the mask for that location. This is expressed as:

Mask Update Scheme

and can easily be implemented in any deep learning framework as part of the forward pass. With sufficient successive applications of the partial convolution layer, any mask will eventually be all ones, if the input contained any valid pixels.

Network Design

The network design is largely based on UNet like architectures using just one minor tuning, which is replacing all convolutional layers with partial convolutional ones.

The network architecture

Elaborating about the network architecture, it is important to mention that PConv 1 to PConv 8 is the encoding network and the following ones having UpSampling skip links is the decoding architecture of the same.

The BatchNorm column indicates whether PConv is followed by a Batch Normalization layer. The Non-linearity column shows whether and what non-linearity layer is used (following the BatchNorm if BatchNorm is used).

Loss Functions

From the excerpts of the research paper:

Our loss functions target both per-pixel reconstruction accuracy as well as composition, i.e. how smoothly the predicted hole values transition into their surrounding context.

Given input image with hole I_in, initial binary mask M (0 for holes)the network prediction I_out, and the ground truth image I_gt, we first define our per pixel losses L_hole = k(1−M)⊙(I_out −I_gt)k1 and L_valid = kM ⊙(I_out −I_gt)k1. These are the L1 losses on the network output for the hole and the non-hole pixels respectively.

Perpetual Loss has been calculated using :

where Ψn is the activation map of the nth selected layer.

Perpetual Loss

While the style losses has been taken into considerations and used as :

Style Losses

Our final loss term is the total variation (TV) loss L_tv: which is the smoothing penalty on P, where P is the region of 1-pixel dilation of the hole region.

Smoothing Penalty

So, the Total loss (after coefficient hyper parameter tuning) comes out to be:

Total Loss

read original article here