Understanding The Importance Of Data For Machine Learning | Hacker Noon


M Shehzen Sidiq Hacker Noon profile picture

M Shehzen Sidiq

I am a Postgrad student, researching and finding ways to explore AI.

Data is the most important and must-have food for machine learning. It can be any fact, text, symbols, images, videos, etc., but in unprocessed form. When the data is processed, it is known as information. Machine learning without data is nothing but a bare machine with no soul and no mind. This data makes machines do such amazing tasks, which we have not thought of a few years back in history.

Despite having such importance, machines do not understand what data represents. They don’t understand why ‘a’ is ‘a’ and why it is written in this way or why ‘this’ means what it means. Most of us don’t understand the food that we eat. The only thing that we know is that we have to eat, and we do so. We don’t care for backgrounds and foregrounds. Data for machine learning is food. It just consumes it and then learns the relations between different data rather than understanding the data.

So basically, all machines do, is find the relations between the different data. Be with me to understand why data is important and how come machines don’t understand the data but find the relations between the data.

Data is crucial for machine learning, and without data, machine learning is not possible. It requires data in one form or the other. Just like we humans need food for our development of mind and then when we get another type of data by visualizing, hearing, etc., and get experience from such data. That data plays a vital role in the type of human we will be in the future.

In the same way, data for machine learning is important to grow its experience and ability to make decisions based on the data fed to it. This data for machine learning can be of two types — (for beginners):

This type of data is in the form of numbers and only numbers. This is a good type of data, and all machine learning models work with data. All other data types need to be translated to this form, and then that data is fed into the machine—the data like 1, 2, age, salary, experience, etc.

This is another type of data. Usually, the data which contains characters like text, symbols comes in this category. This type of data is firstly and is very important to convert to numerical form using some techniques. Unless converted, the machine can’t take this data and formulate the relations between input and output data. When dealing with this type of data, it is important to keep this point in mind.

If you want to read in-depth about the types of data, read this amazing post by Alina Zhang.

Always convert the categorical data into numerical data


This is also an essential point to consider when dealing with machine learning. How much is an important factor, and keep in mind, we should have sufficient data s.t the machine does not die of starvation and not too much data so that the machine does not become worthless.

A Very Important Point

Too little and too much both are bad for machines and also for humans and all beings.


Sometimes we have very little data, and in those times, we have to get more data. There are various techniques to get more data.

The first step should be to get the data manually, through surveys, questionnaires, etc. If such a thing is not possible, then other techniques do exist for such purposes.Data augmentation is one of the techniques used to produce more data. It produces fake data, and that data can be used in machine learning.

And sometimes, we have more data, and in those times, we leave some data and use only some data. But it may lead to the loss of valuable data.

The other method uses all data, but each training phase leaves some data and only uses some data. In the next training, use another data and leave some other data out.

When we train a machine learning model, we have to test it on some data. Just like the students are tested in exams. For such purposes, we leave some data for testing and use only some data for training. The data used for testing is known as testing data, and the data used for training is known as training data.

The testing data should be kept hidden from the machine and used only in the testing phase and not before that. The whole purpose is to see how well the machine has learned.

When training the model and data as input to the machine, the machine goes through every data item in the dataset. It learns the data, remembers the data patterns. When finally we ask the machine to make the prediction, it uses the previous knowledge and experience collected from the data and makes the prediction.

Suppose we take an example of Weather prediction. When the data is given to the machine, it maps the inputs to the outputs, i.e., maps the cause and effects. When we ask for prediction and give some data like temp, etc., it uses the mapping of previous data, i.e., training data from experience, and predicts the given data.

So, machines learn the mapping of data and make predictions based on experience, and if the prediction is true, then the machine has learned, and if it is false, then there is space for improvement. There are always mistakes in machine learning, as do humans. Learning aims to minimize these mistakes to the maximum and reduce the gap between prediction and actual.

This article saw the importance of data for machines and why data should be present sufficiently. Data is like food and experience for the machine, and it is this data makes machine learning possible.

While diving into the machine learning project, be sure you have sufficient data. If you like this article, give it a thumbs up and share this.