Self-Supervised Learning is getting attention because it has the potential to solve a significant limitation of supervised machine learning, viz. requiring lots of external training samples or supervisory data consisting of inputs and corresponding outputs. Yann LeCun¹ recently in a Science and Future Magazine interview presented self-supervised learning as a significant challenge of AI for the next decade.
Humans, for instance, can determine the semantic meaning of the word “orange” from context when it appears near “t-shirt,” “fridge,” “county,” or “mobile.” Similarly, in machine learning, the Word2Vec algorithm predicts the semantic context of a word based on surrounding words. The research behind self-supervised learning follows the same principle of automatically identifying, extracting and using supervisory signals.
What is Self-Supervised Learning?
Self-supervised learning is autonomous supervised learning. It is a representation learning approach that eliminates the pre-requisite requiring humans to label data. Self-supervised learning systems extract and use the naturally available relevant context and embedded metadata as supervisory signals.
Cats continue to play an essential role in everything significant in machine learning. Self-supervised research “Unsupervised Visual Representation Learning by Context Prediction” predicts the positional location of one rectangular section of an image relative to another by using spatial context as a supervisory signal for training a rich visual representation. For instance, the right ear of a cat would be in the top-right position relative to the eyes of a cat. This approach allows learning about cats, dogs, or buses without prior explicit semantic labeling.
Self-supervised learning, fortunately, is not limited to learning from visual cues or associated meta-data in cat images or videos and has use cases beyond computer vision.
Self-supervised vs. supervised learning
Self-supervised Learning is supervised Learning because its goal is to learn a function from pairs of inputs and labeled outputs. Explicit use of labeled input-outputs pairs in self-supervised learning is not needed. Instead, correlations, embedded metadata, or domain knowledge available within the input is implicitly and autonomously extracted from the data and used as supervisory signals. Like supervised learning, self-supervised learning has use cases in regression and classification.
Self-supervised vs. unsupervised learning
Self-supervised learning is like unsupervised Learning because the system learns without using explicitly-provided labels. It is different from unsupervised learning because we are not learning the inherent structure of data. Self-supervised learning, unlike unsupervised learning, is not centered around clustering and grouping, dimensionality reduction, recommendation engines, density estimation, or anomaly detection.
Self-Supervised vs. semi-supervised learning
Combination of labeled and unlabeled data is used to train semi-supervised learning algorithms, where smaller amounts of labeled data in conjunction with large amounts of unlabeled data can speed up learning tasks. Self-supervised learning is different as systems learn entirely without using explicitly-provided labels.
Why is self-supervised learning relevant?
Self-supervised learning is essential for many reasons but mainly because of shortcomings in both approach and scalability of supervised learning.
Supervised learning is an arduous process, requiring collecting massive amounts of data, cleaning it up, manually labeling it, training and perfecting a model purpose-built for the classification or regression use case you are solving for, and then using it to predict labels for unknown data. For instance, with images, we collect a large image data set, label the objects in images manually, learn the network and then use it for one specific use case. This way is very different from the approach of learning in humans.
Human learning is trial-based, perpetual, multi-sourced, and simultaneous for multiple tasks. We learn mostly in an unsupervised manner, using experiments and curiosity. We also learn in a supervised manner but we can learn from much fewer samples, and we generalize exceptionally well.
For supervised learning, we have spent years collecting and professionally annotating tens of millions of labeled bounding boxes or polygons and image level annotations, but these datasets Open Images, PASCAL Visual Object Classes, Image Net, and Microsoft COCO collectively pale in comparison to billions of images generated on a daily basis on social media, or millions of videos requiring object detection or depth perception in autonomous driving. Similar scalability arguments exist for common sense knowledge.
Self-Supervised Reinforcement Learning
A dog trainer can reward a dog for positive behavior and punish for negative behavior. Over time the dog figures out and learns actions it took to get a reward. Similarly, in Reinforcement learning, a navigating robot learns how to navigate a course when rewarded for staying on course and punished when it collides with something in the environment. In both cases, this reward and punishment feedback reinforces which actions to perform and which to avoid. Reinforcement Learning works well in the presence of a feedback system for rewards. It also requires a comprehensive set of training data and may be impractical when considering the cost, time, and the number of iterations required before succeeding.
In the absence of rewards-based feedback system, a dog or a navigating robot may learn on its own by curiously exploring the environment. Researchers from BAIR² created an “Intrinsic Curiosity Model,” a self-supervised reinforcement learning system that can work even in the absence of explicit feedback. It uses curiosity as a natural reward signal to enable the agent to explore its environment and learn skills for use later in its life. See reseach at: Curiosity-driven Exploration by Self-supervised Prediction and the accompanying writeup.