SINGLE OBJECT DETECTION PART – 2

STEP 3:- GET A SUITABLE ARCHITECTURE FOR MODELLING OF BOUNDING BOX AROUND THE LARGEST OBJECT IN THE IMAGE.

  • How to Create a Bounding box?

For this , just need to figure out two things:-

  1. Create a Neural Network with four activations that predicts four numbers i.e the bounding box edges (top left co-ordinates and bottom right co-ordinates) for the largest Object. This is a regression problem with four outputs.
  2. Decide the Loss function in such a way that when it is minimized , our four predicted numbers are pretty good. Let’s see how to do this.
BB_CSV = PATH/'tmp/bb.csv'
bb = np.array([training_largest_annotations[o][0] for o in training_ids])
# Pick largest item in the image.
bbs = [' '.join(str(p) for p in o) for o in bb]
# Create bounding boxes separated by space.
df = pd.DataFrame({'fn': [training_filenames[o] for o in training_ids], 'bbox': bbs}, columns=['fn','bbox'])
# Put the image filename in one column 'fn' and bounding box of the largest object in that image in the 'bbox' column
df.to_csv(BB_CSV, index=False)
# Write it in a .csv file
BB_CSV.open().readlines()[:5]
# Check out how the data is stored in the below snapshot.
# To do multiple label classification , the multiple labels should be space separated and file name should be comma separated.

Note:- When we are doing scaling or data augmentation to the images , that needs to be happening to the bounding box co-ordinates also.

But Why?

Earlier in classification case , we use to augment the images in the dataset. But there is a small change in the bounding box case. Here we have got the images and the bounding box coordinates of the objects in the images . In this case we have to augment the dependent variable i.e the bounding box coordinates as well as the image. So lets see what happens if we augment the images only .

augs = [RandomFlip(), 
RandomRotate(30),
RandomLighting(0.1,0.1)]
tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO, aug_tfms=augs)
md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms, continuous=True, bs=4)
idx=3
fig,axes = plt.subplots(3,3, figsize=(9,9))
for i,ax in enumerate(axes.flat):
x,y=next(iter(md.aug_dl))
ima=md.val_ds.denorm(to_np(x))[idx]
b = bb_hw(to_np(y[idx]))
print(b)
show_img(ima, ax=ax)
draw_rect(ax, b)

As we can see , when we are augmenting the images and not the bounding box coordinates , the images gets augmented while the bounding box representing the coordinates of the object in the image remains the same, which is not correct . This is making the data wrong. In other words, while the augmented images are changing , the bounding box representing the object in the images remains the same. Hence we need to augment the dependent variable i.e the bounding box coordinates as these two are connected to each other. The bounding box coordinates should go through all of the geometric transformation as that of the image. As can be seen in bold in the code below , we are using tfm_y=TfmType.COORD parameter which explicitly means that whatever augmentation is being done to the image should be done to the bounding box coordinates also.

augs = [RandomFlip(tfm_y=TfmType.COORD),
RandomRotate(3,p=0.5, tfm_y=TfmType.COORD),
RandomLighting(0.1,0.1, tfm_y=TfmType.COORD)]
# RandomRotate parameters:- Maximum of 3 degree of rotations .p=0.5 means rotate the image half of the time.
tfms = tfms_from_model(f_model, sz, crop_type=CropType.NO, tfm_y=TfmType.COORD, aug_tfms=augs)
# Adding (tfm_y=TfmType.COORD) helps in changing the bounding box coordinates in case the model is squeezing or zooming the image
md = ImageClassifierData.from_csv(PATH, JPEGS, BB_CSV, tfms=tfms, continuous=True, bs=4)
# Note that we have to tell the transforms constructor that our labels are coordinates, so that it can handle the transforms correctly.
idx=4
fig,axes = plt.subplots(3,3, figsize=(9,9))
for i,ax in enumerate(axes.flat):
x,y=next(iter(md.aug_dl))
ima=md.val_ds.denorm(to_np(x))[idx]
b = bb_hw(to_np(y[idx]))
print(b)
show_img(ima, ax=ax)
draw_rect(ax, b)

TfmType.COORD basically represents that if we apply flip transformation to the image , we need to change the bounding box coordinates accordingly. Hence we are adding TfmType.COORD to all the transformations that is being applied to the images.

If we see the image below ,it makes sense . The bounding box keeps on changing with the image and represents the object in the right spot.

Now, create a ConvNet based on resnet34 but here we have a twist. We don’t want any standard set of FC layers at the end that has created a classifier but instead we want to add a single layer with four output, as it’s a regression problem.

head_reg4 = nn.Sequential(Flatten(), nn.Linear(25088,4))
# Append head_reg4 on top of resnet34 model which will result in creation of regressor that predicts four values as output as shown in the code below.
# Here it is creating a tiny model that flattens the previous layer of the dimensions 7*7*512 =25088 and brings it down to 4 activations
learn = ConvLearner.pretrained(f_model, md, custom_head=head_reg4)
learn.opt_fn = optim.Adam
# Use Adam optimizer to optimize the loss function.
learn.crit = nn.L1Loss()
# The loss function here is L1 loss.

The custom_head parameter in ConvLearner.pretrained(...) is added at the top of the model . It prevents creating any of the fully connected layer and adaptive maxpooling layer , which is done by default . Instead, it will replace those by whatever model we ask for. Here we want four activation representing the bounding box coordinates. We will stick this custom_head on top of the pretrained model and then train it for a while.

Check out the final layer :-

learn.summary()

After this step ,its all the same to find a best learning rate and using that learning rate , train your NN model. Lets see how it is done:-

learn.lr_find(1e-5,100)
learn.sched.plot(5)

lr = 2e-3
learn.fit(lr, 2, cycle_len=1, cycle_mult=2)
lrs = np.array([lr/100,lr/10,lr])
learn.freeze_to(-2)
lrf=learn.lr_find(lrs/1000)
learn.sched.plot(1)
learn.fit(lrs, 2, cycle_len=1, cycle_mult=2)
learn.freeze_to(-3)
learn.fit(lrs, 1, cycle_len=2)
learn.save('reg4')
learn.load('reg4')
x,y = next(iter(md.val_dl))
learn.model.eval()
preds = to_np(learn.model(VV(x)))
fig, axes = plt.subplots(3, 4, figsize=(12, 8))
for i,ax in enumerate(axes.flat):
ima=md.val_ds.denorm(to_np(x))[i]
b = bb_hw(preds[i])
ax = show_img(ima, ax=ax)
draw_rect(ax, b)
plt.tight_layout()

As seen in the above predicted output snapshot , its doing a pretty good job . Although it fails in the case of peacock and cows.

In our next blog post, we will combine Step 2 and Step 3. This will help us to predict the largest object in an image as well as the bounding box for that largest object at the same time .

There were lots of stuff covered in this blog post . You might feel like this

But that’s okay !!! I highly encourage you to go back to the previous part and check out the flow . I’ve marked the important stuffs in bold points and it will help you understand the intermediate important steps.

Thanks for sticking by this part.

In my next blog post , we will see how to combine step 2 and step 3 in a single go. Nothing new from computer vision perspective, but its the beauty of Pytorch coding which we will dive into.

Until then Good bye ..

It is a really good feeling to get appreciated by Jeremy Howard. Check out what he has to say about the Fast.ai Part 1 blog of mine . Make sure to have a look at it.

read original article here