Although Computer Vision (CV) has only exploded recently (the breakthrough moment happened in 2012 when AlexNet won ImageNet), it certainly isn’t a new scientific field.
Computer scientists around the world have been trying to find ways to make machines extract meaning from visual data for about 60 years now, and the history of Computer Vision, which most people don’t know much about, is deeply fascinating.
In this article, I’ll try to shed some light on how modern CV systems, powered primarily by convolutional neural networks, came to be.
I’ll start with a work that came out in the late 1950s and has nothing to do with software engineering.
One of the most influential papers in Computer Vision was published by two neurophysiologists — David Hubel and Torsten Wiesel — in 1959. Their publication, entitled “Receptive fields of single neurons in the cat’s striate cortex”, described core response properties of visual cortical neurons as well how a cat’s visual experience shapes its cortical architecture.
The duo ran some pretty elaborate experiments. They placed electrodes into the primary visual cortex area of an anesthetized cat’s brain and observed, or at least tried to, the neuronal activity in that region while showing the animal various images. Their first efforts were fruitless; they couldn’t get the nerve cells to respond to anything.
However, a few months into the research, they noticed, rather accidentally, that one neuron fired as they were slipping a new slide into the projector. This was one lucky accident! After some initial confusion, Hubel and Wiesel realized that what got the neuron excited was the movement of the line created by the shadow of the sharp edge of the glass slide.
The researchers established, through their experimentation, that there are simple and complex neurons in the primary visual cortex and that visual processing always starts with simple structures such as oriented edges.
Sounds familiar? Well, yeah, this is essentially the core principle behind deep learning.
The next highlight in the history of CV was the invention of the first digital image scanner.
In 1959, Russell Kirsch and his colleagues developed an apparatus that allowed transforming images into grids of numbers — the binary language machines could understand. And it’s because of their work that we now can process digital images in various ways.
One of the first digitally scanned photos was the image of Russell’s infant son. It was just a grainy 5cm by 5cm photo captured as 30,976 pixels (176×176 array), but it has become so incredibly famous that the original image is now stored in the Portland Art Museum.
Next, let’s discuss Lawrence Roberts’ “Machine perception of three-dimensional solids”, which was published in 1963 and is widely considered to be one of the precursors of modern Computer Vision.
In that Ph.D. thesis, Larry described the process of deriving 3D info about solid objects from 2D photographs. He basically reduced the visual world to simple geometric shapes.
The goal of the program he developed and described in the paper was to process 2D photographs into line drawings, then build up 3D representations from those lines and, finally, display 3D structures of objects with all the hidden lines removed.
Larry wrote that the processes of 2D to 3D construction, followed by 3D to 2D display, were a good starting point for future research of computer-aided 3D systems. He was emphatically right.
It should be noted that Lawrence didn’t stay in Computer Vision for long. Instead, he went on to join DARPA and is now known as one of the inventors of the Internet.
The 1960s was when AI became an academic discipline and some of the researchers, extremely optimistic about the field’s future, believed it would take no longer than 25 years to create a computer as intelligent as a human being. This was the period when Seymour Papert, a professor at MIT’s AI lab, decided to launch the Summer Vision Project and solve, in a few months, the machine vision problem.
He was of the opinion that a small group of MIT students had it in them to develop a significant part of a visual system in one summer. The students, coordinated by Seymour himself and Gerald Sussman, were to engineer a platform that could perform, automatically, background/foreground segmentation and extract non-overlapping objects from real-world images.
The project wasn’t a success. Fifty years later, we’re still nowhere near solving computer vision. However, that project was, according to many, the official birth of CV as a scientific field.
In 1982, David Marr, a British neuroscientist, published another influential paper — “Vision: A computational investigation into the human representation and processing of visual information”.
Building on the ideas of Hubel and Wiesel (who discovered that vision processing doesn’t start with holistic objects), David gave us the next important insight: He established that vision is hierarchical. The vision system’s main function, he argued, is to create 3D representations of the environment so we can interact with it.
He introduced a framework for vision where low-level algorithms that detect edges, curves, corners, etc., are used as stepping stones towards a high-level understanding of visual data.
David Marr’s representational framework for vision includes:
- A Primal Sketch of an image, where edges, bars, boundaries etc., are represented (this is clearly inspired by Hubel and Wiesel’s research);
- A 2½D sketch representation where surfaces, information about depth and discontinuities on an image are pieced together;
- A 3D model that is hierarchically organized in terms of surface and volumetric primitives.
David Marr’s work was groundbreaking at the time, but it was very abstract and high-level. It didn’t contain any information about the kinds of mathematical modeling that could be used in an artificial visual system, nor mentioned any type of a learning process.
Around the same time, a Japanese computer scientist, Kunihiko Fukushima, also deeply inspired by Hubel and Wiesel, built a self-organizing artificial network of simple and complex cells that could recognize patterns and was unaffected by position shifts. The network, Neocognitron, included several convolutional layers whose (typically rectangular) receptive fields had weight vectors (known as filters).
These filters’ function was to slide across 2D arrays of input values (such as image pixels) and, after performing certain calculations, produce activation events (2D arrays) that were to be used as inputs for subsequent layers of the network.
Fukushima’s Neocognitron is arguably the first ever neural network to deserve the moniker deep; it is a grandfather of today’s convnets.
A few years later, in 1989, a young French scientist Yann LeCun applied a backprop style learning algorithm to Fukushima’s convolutional neural network architecture. After working on the project for a few years, LeCun released LeNet-5 — the first modern convnet that introduced some of the essential ingredients we still use in CNNs today.
As Fukushima before him, LeCun decided to apply his invention to character recognition and even released a commercial product for reading zip codes.
Besides that, his work resulted in the creation of the MNIST dataset of handwritten digits — perhaps the most famous benchmark dataset in machine learning.
In 1997, a Berkeley professor named Jitendra Malik (along with his student Jianbo Shi) released a paper in which he described his attempts to tackle perceptual grouping.
The researchers tried to get machines to carve out images into sensible parts (to determine automatically which pixels on an image belong together and distinguish objects from their surroundings) using a graph theory algorithm.
They didn’t get very far; the problem of perceptual grouping is still something CV experts are struggling with.
In the late 1990s, Computer Vision, as a field, largely shifted its focus.
Around 1999, lots researchers stopped trying to reconstruct objects by creating 3D models of them (the path proposed by Marr) and instead directed their efforts towards feature-based object recognition. David Lowe’s work “Object Recognition from Local Scale-Invariant Features” was particularly indicative of this.
The paper describes a visual recognition system that uses local features that are invariant to rotation, location, and, partially, changes in illumination. These features, according to Lowe, are somewhat similar to the properties of neurons found in the inferior temporal cortex that are involved in object detection processes in primate vision.
Soon after that, in 2001, the first face detection framework that worked in real-time was introduced by Paul Viola and Michael Jones. Though not based on deep learning, the algorithm still had a deep learning flavor to it as, while processing images, it learned which features (very simple, Haar-like features) could help localize faces.
Viola/Jones face detector is still widely used. It is a strong binary classifier that’s built out of several weak classifiers; during the learning phase, which is quite time-consuming in this case, the cascade of weak classifiers is trained using Adaboost.
To find an object of interest (face), the model partitions input images into rectangular patches and submits them all to the cascade of weak detectors. If a patch makes it through every stage of the cascade, it is classified as positive, if not, the algorithm rejects it immediately. This process is repeated many times over at various scales.
Five years after the paper was published, Fujitsu released a camera with a real-time face detection feature that relied on the Viola/Jones algorithm.
As the field of computer vision kept advancing, the community felt an acute need for a benchmark image dataset and standard evaluation metrics to compare their models’ performances.
In 2006, the Pascal VOC project was launched. It provided a standardized dataset for object classification as well as a set of tools for accessing the said dataset and annotations. The founders also ran an annual competition, from 2006 to 2012, that allowed evaluating the performance of different methods for object class recognition.
In 2009, another important feature-based model was developed by Pedro Felzenszwalb, David McAllester, and Deva Ramanan — the Deformable Part Model.
Essentially, it decomposed objects into collections of parts (based on pictorial models introduced by Fischler and Elschlager in the 1970s), enforced a set of geometric constraints between them, and modeled potential object centers that were taken as latent variables.
DPM showed great performance in object detection tasks (where bounding boxes were used to localize objects) and beat template matching and other object detection methods that were popular at the time.
The ImageNet Large Scale Visual Recognition Competition (ILSVRC), which you’ve probably heard about, started in 2010. Following PASCAL VOC’s footsteps, it is also run annually and includes a post-competition workshop where participants discuss what they’ve learned from the most innovative entries.
Unlike Pascal VOC that only had 20 object categories, the ImageNet dataset contains over a million images, manually cleaned, across 1k of object classes.
Since its inception, the ImageNet challenge has become a benchmark in object category classification and object detection across a huge number of object categories.
In 2010 and 2011, the ILSVRC’s error rate in image classification hovered around 26%. But in 2012, a team from the University of Toronto entered a convolutional neural network model (AlexNet) into the competition and that changed everything. The model, similar in its architecture to Yann LeCun’s LeNet-5, achieved an error rate of 16.4%.
This was a breakthrough moment for CNNs.
In the following years, the error rates in image classification in ILSVRC fell to a few percent and the winners, ever since 2012, have always been convolutional neural networks.
As I’ve mentioned earlier, convolutional neural networks have been around since the 1980s. So why did it take so long for them to become popular?
Well, there are three factors we owe the current CNN explosion to:
- Thanks to Moore’s law, our machines are vastly more fast and powerful now compared to the 1990s when LeNet-5 was released.
- NVIDIA’s parallelizable Graphics Processing Units have helped us achieve significant progress in deep learning.
- Finally, today’s researchers have access to large, labeled, high-dimensional visual datasets (ImageNet, Pascal and so on). Therefore, they can train their deep learning models sufficiently and avoid overfitting.
Despite the recent progress, which has been impressive, we’re still not even close to solving computer vision. However, there are already multiple healthcare institutions and enterprises that have found ways to apply CV systems, powered by CNNs, to real-world problems. And this trend is not likely to stop anytime soon.