This is part of the “Machine Learning” series on introducing machine learning from the very beginning. More articles are coming!
Already heard talking about Machine Learning in conferences, meetups or in my first article and want to learn more?
You’re in the right place, the second step for you will be to discover the different kind of machine learning. I’ll introduce them to you through many practical examples.
What you’ll know after reading
ML comes in two flavors: supervised and unsupervised learning. Don’t judge them by their names, there aren’t opposite learning methods.
The one you’ll choose only depends on your purpose. When you want the model to produce a formula which predicts a value from your data, you’re using supervised learning.
Otherwise, if your goal isn’t to predict a value but rather finding patterns in your data, you’ll use unsupervised machine learning.
Here is a diagram which sums up the concepts and vocabulary I want to teach you in this article.
Don’t worry if classification and regression are unknown words to you. I’ll introduce them later with examples.
Anyway, try to guess the answer to this question:
Which kind of ML can help to find out if some items are often sold together?
Supervised learning or Unsupervised learning?
#1 Is my fruit tastes sweet or acid?
I know, using machine learning to predict if the apple you’re going to eat is sweet or acid sounds like :
You wanted a banana but what you got were a gorilla holding the banana and the entire jungle.
— Joe Armstrong (Coders at work)
Still, the purpose of this example is to predict if the fruit you’ll eat is either sweet or acid. This goal is quite precise, you give a fruit to your ML model and it answers sweet or acid which is the taste of your fruit.
Supervised learning with classification is what you’re looking for.
Let me explain, it’s supervised because you want to predict a specific value. With classification because this value isn’t a number, it can only get a defined set of values.
Here is the training set we’ll give to our model to begin its training. It looks like a historical of the same question already asked with different fruits but the answer comes with. That’s why it’s called a training set, you give both the question and the answer for training.
Your ML model will look for any kind of logic between each question and its answer. This link is actually function, it’s the Hypothesis function.
The hypothesis with this training set will be as follows:
Is this fruit sweet = Sum of this fruit sweet / Sum of this fruit
Lemons: 0/2 = 0% are sweet and 100% are acid
Apples: 3/4 = 75% are sweet and 25% are acid
Let’s assume you’re a lemon eater. Even though you’re an expert about lemons tastes, you’ll ask you ML model to know if this lemon in your hand is sweet. Thank’s to its hypothesis it made up with the training data, it know 0% are sweet and 100% are acid so its answer will be acid. Indeed, as I explained before the answer is a value, not a probability. My question wasn’t right, a better would be: “How this lemon tastes?”
With an apple in your hand, the answer would be sweet because it knows 75% of apples have a sweet taste.
The training dataset we used to train our classification algorithm was simple. It could also include other pieces of information such as the current season and the fruit color.
With more informations and training sets, the model will be able to improve its accuracy.
#2 How much is this house worth?
For this example, you can forget about fruits. You now evolved to a whole new level and will try to estimate houses prices.
The purpose is clear, you want to sell or buy a house but you don’t know how much it worth. You’ve got property ads with plenty of details about properties. Those will be your training data for your ML model.
Let’s sum up, you have a house with its living area and you want your model to give you its price. The goal is clear you’re asking for a value and this value is a number, it can be any price in a range.
You must use Supervised learning with regression.
Thus, the difference between classification and regression is only about the range of values the value you’re looking for can get. You’ll need classification for a discrete value and regression for a continuous value.
Now, let’s feed our model with a training dataset I’ve got from Kaggle.
By the way, if you don’t know the Kaggle website I highly recommend you to check it out. You’ll find plenty of ML competitions with real-world datasets. Competitors also write articles to share their approach and techniques to deal with the ML competition problems.
SalePrice: sale price in dollars
GrLivArea: living area in square foot
The first approach is to draw a straight line roughly in the middle of all the dots on the chart. As mathematics taught us a straight line on a graph is nothing more than a formula which gives y with x:
y = f(x) = a * x + b.
It represents the hypothesis as in the previous example. It’s a formula to get the price of the house
y with the house living area
Let’s say this morning you saw a property ad for a big property with a living area of 4,000 square feet. You want to check its price to see the owner didn’t overestimate its house.
No need to calculate the
b from the hypothesis, check the graph. It estimates a price of around $400,000 for a 4,000 square foot property.
Using a different method to get the hypothesis, it can be more precise and answer a very different price. Now for a 4,000 square foot property, the estimated price is around $480,000.
The method used in the ML model to learn from the training data and create a hypothesis is very important, it’s called the learning algorithm. More this algorithm is well chosen and more the answers will be accurate.
#3 Throw away junk emails
Did you already ask yourself how some email are magically moved into the spam folder of your inbox? The magic part is actually using machine learning to figure out what makes an email a spam or not.
In this case, the goal is clear. We want to know if an email is a spam or not. This is another example of classification, like the fruits example, there are only two possible results spam or not spam. It’s a particular scenario of classification using only two classes (or categories). Besides, it also exists other variations with more than two classes.
Still, this is Supervised learning with classification.
Let’s have an overview of how a supervised machine learning model works:
The first step is the training step. As I explained before, you have to find a training set which contains both the question and its answer. In our example its a training set with the emails and their classification (spam or not spam).
During its training, the ML algorithm will then build a function called hypothesis to be able to classify any email.
The purpose of this training is to finally answer the same question with an email the model doesn’t know.
The hypothesis function will provide the probability the given email is a spam. If this probability is less than 50%, the answer will be not a spam, otherwise, it’s a spam and it’ll move to the spam folder.
#4 What people buy diapers with?
This is one most famous history in machine learning domain. It’s about how Walmart used its customer data to get some business insights.
Here is a recap about this project (including the diapers thing):
Some time ago, Wal-Mart decided to combine the data from its loyalty card system with that from its point of sale systems. The former provided Wal-Mart with demographic data about its customers, the latter told it where, when and what those customers bought. Once combined, the data was mined extensively and many correlations appeared. Some of these were obvious; people who buy gin are also likely to buy tonic. They often also buy lemons. However, one correlation stood out like a sore thumb because it was so unexpected.
On Friday afternoons, young American males who buy diapers (nappies) also have a predisposition to buy beer. No one had predicted that result, so no one would ever have even asked the question in the first place. Hence, this is an excellent example of the difference between data mining and querying.
Interesting story, isn’t? Here the goal is unclear, you’re not trying to figure out some value. Actually, you don’t even have a question, your concern is just to find correlations in the data.
This is what Unsupervised learning is best at.
Sorry to disappoint you, but I have to tell you this story isn’t real. Even if this story is used a lot at the university and for speeches.
#5 Explore your DNA
Machine learning isn’t only used to predict sales. Scientists also use it heavily in various domains. In this example, we’ve got DNA analysis from many and our goal to gather people in different groups.
For each individual, you’ll measure how much certain genes are expressed.
Don’t let this graph overwhelm you, we’ve almost finished with it. Each little block is a gene. A green colour means the individual got this gene and a red colour means there is a very low possibility he has it.
These colours show the degree to which different individuals do or do not have a specific gene.
The purpose of our ML model it to gather individuals in categories from data which includes gene degree each people have. Now, its time for the question:
Which kind of ML to use in this example?
Supervised learning or Unsupervised learning?
Be careful, the answer may not be what you’re expecting. Also, I’ll keep writing another sentence to hide the answer in the middle of a paragraph. If you answer is Supervised learning I’m sorry, it was a bit of trap that you fall into. Don’t worry, I’ll explain to you why it’s not the right answer.
In reality, the model will group individual in categories but those are not defined beforehand in training data.
Our question isn’t to know in which category the person we’ve picked is. The purpose is to find similar individuals and gather them in groups.
Before you go
Do you remember this schema I show you a the beginning? Maybe not, but you now may be able to understand all the words on it. Give it a try!
Let’s sum up for the last time. The first step in machine learning is to train your model with a training set. Thank’s to those examples, the model will build a hypothesis function to be able to answer our question.
The accuracy of the answer will be closely related to the training set quality and size. Besides, the learning algorithm used to build the hypothesis is also very important.
If your training set is full of questions with answers, you do Supervised learning. When the answer value is continuous, then it’s Regression, otherwise it’s Classification.
If you don’t know the question to ask and so your training set doesn’t contain question/answer sets, then you do Unsupervised learning.