Is Object Detection a Done Deal? – Hacker Noon

There is a general conception that Pooling (Max Pooling) provides scale and translation invariance. This is both true and false. What needs to be understood that pooling helps in ‘learning’ invariance, and for learning the NW should be trained with images. Also CNN’s are invariant to translation. I guess there are few who think,mean rotation also same, as a translation. But translation here means shifting the position of the object left, right , up or down. (shown here clearly)

Maxpooling helps here. This answer , illustrates this lucidly. Here no data augmentation is needed. Assuming that a CNN is good in detecting a picture of a cat, it will detect a cat translated anywhere in the frame.

Here is from a very reputed source the Deep Learning book by Ian Goodfellow and Yoshua Bengio and Aaron Courvill. Along with Geoffrey Hinton and Yann Lecun, Bengio is considered one of the three people most responsible for the advancement of deep learning during the 1990s and 2000s”

In all cases, pooling helps to make the representation approximately invariant to small translations of the input. Invariance to translation means that if we translate the input by a small amount, the values of most of the pooled outputs do not change — Deep Learning book ,

But regarding scale invariance and rotation invariance; here is from the same book

Convolution is not naturally equivariant to some other transformations, such as changes in the scale or rotation of an image.

And there are other papers that have tested current networks and reported the same. Here is a quote from a Dec 2017 paper (2)

“We obtain the surprising result that architectural choices such as the number of pooling layers and the convolution filter size have only a secondary effect on the translation invariance of a network. Our analysis identifies training data augmentation as the most important factor in obtaining translation-invariant representations of images using convolutional neural networks.” From “Quantifying Translation-Invariance in Convolutional Neural Networks (Eric Kauderer-Abrams Stanford University) “

And from another recent paper May 2018

Deep convolutional neural networks (CNNs) have revolutionized computer vision. Perhaps the most dramatic success is in the area of object recognition, where performance is now described as “superhuman” [20]. …

Despite the excellent performance of CNNs on object recognition, the vulnerability to adversarial attacks suggests that superficial changes can result in highly non-human shifts in prediction …

Obviously, not any data augmentation is sufficient for the networks to learn invariances. To understand the failure of data augmentation, it is again instructive to consider the subsampling factor. Since in modern networks the subsampling factor is approximately 45, then for a system to learn complete invariance to translation only, it would need to see 452 = 2025 augmented versions of each training example. If we also add invariance to rotations and scalings, the number grows exponentially with the number of irrelevant transformations

From Why do deep convolutional networks generalize so poorly to small image transformations? Yair Weiss, Aharon Azulay ELSC Hebrew University of Jerusalem

If that is the case, how do the Google API been able to recognizance the inverted and rotated car in the tests that we showed earlier ?(notice that it got the Car pretty high, and only missed on the other details like brand, which it may have not trained that strong).

Data Augmentation

The key is data augmentation. Basically the input image is used along with rotations, scaling, noise etc generated from the image as other images to the training. Some good explanation is here

CNN’s are scale invariant to some level, if it is trained to be; as pooling implementation will then be able to handle that. Also rotational invariance has to be trained in.

‘Learning’ Invariance to Rotation via Pooling

Let us see rotational invariance first , how a CNN can be trained for that first as it is bit easier. Here is the illustration from the Deep Learning book.

source pg 338

Example of learned invariances. A pooling unit that pools over multiple features that are learned with separate parameters can learn to be invariant to transformations of the input. Here we show how a set of three learned filters and a max pooling unit can learn to become invariant to rotation. All three filters are intended to detect a hand written 5.Each filter attempts to match a slightly different orientation of the 5. When a 5 appears in the input, the corresponding filter will match it and cause a large activation in a detector unit. The max pooling unit then has a large activation regardless of which detector unit was activated

pg 338

Basically we need to either augment the training images by rotating , or get a data pool of images which are taken at different angles and use them for training the CNN.

Invariance to Scale (Size of the Object)

This is a little more complex. For real time detection we use a CNN called a Single Shot Detector. Single shot detectors sacrifice some accuracy for performance.

Here is one picture you may have seen from the YOLO home page.


Multi object detection . Note- Detection is different or more difficult from classification in that it needs to also predict the bounding boxes that the object is present in.

Here is the output on a previous version of Yolo (Yolo v2, the current v3 seems to have improved a lot) on a pictures taken at a height.

Yolo v2 from an arieal picture- Image resolution problem

And if you think these type of pictures or use cases are rare, I beg to differ. Many industry use cases are similar to this; or from some high lamp pole. The friendly cat and dog on a porch type of pictures are rare.

Most of the things that work very well in a demo is useless in production or for a particular customers use case; and one reason what has prompted me to write this.

As I said why the NW does not detect small sizes though trained well for large can be due to two reasons.

Problem 1 :Limit of Input resolution /Scaling down of Images

Here in YoloV2 it scales images down; the input image was a frame from a HD video feed. Scaling it from input (1280*720) down to (416*416) immediately destroys lot of features, esp of small objects. This is the first problem. Lesson use a NW and implementation that will take higher resolution images (Retinanet), plus have a decent GPU with enough memory (GTX 1080 should do)

If we cut the above frame into 4 frames and give it to Yolo v2 individually and then stitch together, it performs well (a good solution at that time by one of my team mate). There is a cost involved here; one of speed; and then the complexity overhead of removing overlapping boundary boxes; as a straight slicing may cut the objects itself in the boundaries; so logic of overlapped cutting and then ignoring possible duplicates has to be done.

obfuscated image for demo

Problem 2: CNN layers removes feature; not good news for small object detection with deep neural networks.

This is a bigger problem. Each convolution layer basically looks for some patterns while loosing some details; so at some depth, all these small cars features completely vanish.

SSD uses layers already deep down into the convolutional network to detect objects. If we redraw the diagram closer to scale, we should realize the spatial resolution has dropped significantly and may already miss the opportunity in locating small objects that are too hard to detect in low resolution. If such problem exists, we need to increase the resolution of the input image.


Here is a little more technical explanation from a recent published paper

Since feature maps of layers closer to the input are of higher resolution and often contain complementary information (wrt. conv5), these features
are either combined with shallower layers (like conv4, conv3) [23, 31, 1, 31] or independent predictions are made at layers of different resolutions [36, 27, 3]. Methods like SDP [36], SSH [29] or MS-CNN [3], which make independent predictions at different layers, also ensure that smaller objects are trained on higher resolution layers (like conv3) while larger objects are trained on lower resolution layers (like conv5).

An Analysis of Scale Invariance in Object Detection — SNIP
Bharat Singh Larry S. Davis University of Maryland, College Park

Excellent blogs from Jonathan Hui ; he explains here how Yolo v3 overcomes this problem with Feature Pyramid; so this may not be too much of a problem now, also other NW like Retina net perform well as well. But for quite a lot of time it was a surprise which took some time for us to understand and find a way out. Shows how on the initial stages the industry and research is now.

Here is from another paper April 2018

We provide an illustration of the motivation of the paper …. Pedestrian instances in the automotive images (e.g., Caltech dataset [11]) often have very small sizes….. Accurately localizing these small-size pedestrian instances is quite challenging due to the following difficulties. Firstly, most of the small-size instances appear with blurred boundaries and obscure appearance. It is difficult to distinguish them from the background clutters and other overlapped instances. Secondly, the large-size pedestrian instances typically exhibit dramatically different visual characteristics from the small-size ones


For instance, body skeletons of the large-size instances can provide rich information for pedestrian detection while skeletons of the small-size instances cannot be recognized so easily. Such differences can also be verified by comparing the generated feature maps for large-size and small-size pedestrians, as shown in Fig. 1.

From Scale-Aware Fast R-CNN for Pedestrian Detection By

Jianan Li ; Xiaodan Liang ; Shengmei Shen ; Tingfa Xu ; Jiashi Feng ; Shuicheng Yan

Scale Invariance- Training it in

From my experience, the CNN’s currently are not scale invariant. It may be do to the above two factors, feature loss when the image is small ,compounded with the features loss in deep neural network. However we have found that if we are able to prepare a training data set that have both small and large objects the current network is able to detect different scales with the same class as long as it can work on input images without scaling down much.So training the network has become sort of a skill now.

Finally — the Elephant in the room ; Need for large good quality human annotated image set for Training

Here is the most painful things about CNN’s today. You need thousands to literally hundred thousands of good annotated images of an object for a good training; that is good enough generalization without over-fitting.

The presence of COCO image set is for image detection, what ImageNet is for image classification.

However there is a high chance that the object you want to detect is not one of the 80 classes of images in COCO.

Why is this so important ? For this we need to understand a bit about generalization.

The central challenge in machine learning is that our algorithm must perform well on new, previously unseen inputs — not just those on which our model was trained. The ability to perform well on previously unobserved inputs is called generalization Chapter 5.2 Deep Learning book

When a neural net trains, it uses the divergence from the test data to learn the correct weights via back propagation. If there are only few images to train on, the NW will learn too well (or be too specific) to the training data, and will perform worse on data in the wild. To reduce there there are techniques used. Instead of just train and test, there is also a third set of images called validation set, and if the results starts to diverge too much from validation set, though it matches the test set more, then it is an indication to do a ‘early stop’ of the training.

The other option is using drop-out.

Simply put, dropout refers to ignoring units (i.e. neurons) during the training phase of certain set of neurons which is chosen at random. By “ignoring”, I mean these units are not considered during a particular forward or backward pass. from

A lot of questions abound in the internet regarding how to prevent over-fitting

Apart from the methods above, the base is to have enough data points to train with.

To prevent overfitting, the best solution is to use more training data. A model trained on more data will naturally generalize better. When that is no longer possible, the next best solution is to use techniques like regularization

A word about Transfer Learning

In the recent Google NEXT event AutoML was presented.If using AutoML for Vision, it was claimed that ten to twenty images of leaves are all what is needed for training. I am not sure of the internals of AutoML, but my inference is that, it could be from transfer-learning (practically for a NW like Retinanet described here).

Here is the same sentiment from another source

The origin of the 1,000-image magic number comes from the original

ImageNet classification challenge, where the dataset had 1,000 categories, each with a bit less than 1,000 images for each class (…. This was good enough to train the early generations of image classifiers like AlexNet, and so proves that around 1,000 images is enough.

Can you get away with less though? Anecdotally, based on my experience, you can in some cases but once you get into the low hundreds it seems to get trickier to train a model from scratch. The biggest exception is when you’re using

transfer learning on an already-trained model. Because you’re using a network that has already seen a lot of images and learned to distinguish between the classes, you can usually teach it new classes in the same domain with as few as ten or twenty examples.From

But if we have to detect for a object class of an image that is not in the same domain as other images on it is trained for, this transfer-learning will not work. To give a crude example — it is definitely possible to train a system to detect based on few images of say nails; but then it will see everything as nails- literally. Basically since CNN’s are very deep neural networks, they need a lot of data( read images) to generalize. This calls for lot of work in collecting the required images, and then annotating it; and then training the network in a way, and till such time as to get the optimal result,preventing underfitting or overfitting.

The Future

If you can see a glimpse of light, you can already start imaging the sky; I guess very soon we will be out of this tunnel going by researchCapsuleNet coming from stalwarts like Geoffrey Hinton.

read original article here