The Model Approach and Architecture
The proposed model uses stacked partial convolutional operations and mask updating steps to perform image in-painting. Let’s start with defining convolution and mask-update mechanism.
For brevity, we refer to our partial convolution operation and mask update function jointly as the Partial Convolutional Layer.
Let W be the convolution ﬁlter weights for the convolution ﬁlter and b its the corresponding bias. X are the feature values (pixels values) for the current convolution (sliding) window and M is the corresponding binary mask. The partial convolution at every location, similarly deﬁned in , is expressed as:
After each partial convolution operation, we then update our mask. Our unmasking rule is simple: if the convolution was able to condition its output on at least one valid input value, then we remove the mask for that location. This is expressed as:
and can easily be implemented in any deep learning framework as part of the forward pass. With suﬃcient successive applications of the partial convolution layer, any mask will eventually be all ones, if the input contained any valid pixels.
The network design is largely based on UNet like architectures using just one minor tuning, which is replacing all convolutional layers with partial convolutional ones.
Elaborating about the network architecture, it is important to mention that PConv 1 to PConv 8 is the encoding network and the following ones having UpSampling skip links is the decoding architecture of the same.
The BatchNorm column indicates whether PConv is followed by a Batch Normalization layer. The Non-linearity column shows whether and what non-linearity layer is used (following the BatchNorm if BatchNorm is used).
From the excerpts of the research paper:
Our loss functions target both per-pixel reconstruction accuracy as well as composition, i.e. how smoothly the predicted hole values transition into their surrounding context.
Given input image with hole I_in, initial binary mask M (0 for holes)the network prediction I_out, and the ground truth image I_gt, we ﬁrst deﬁne our per pixel losses L_hole = k(1−M)⊙(I_out −I_gt)k1 and L_valid = kM ⊙(I_out −I_gt)k1. These are the L1 losses on the network output for the hole and the non-hole pixels respectively.
Perpetual Loss has been calculated using :
where Ψn is the activation map of the nth selected layer.
While the style losses has been taken into considerations and used as :
Our ﬁnal loss term is the total variation (TV) loss L_tv: which is the smoothing penalty on P, where P is the region of 1-pixel dilation of the hole region.
So, the Total loss (after coefficient hyper parameter tuning) comes out to be: