In order to benefits from this blog:
- You should be familiar with python.
- You should already have some understanding of what deep learning and neural network are.
Prepare data for training
One of the hardest parts of solving a deep leaning problem is having a well-prepared dataset. There are three general steps to preparing data:
- Identify Bias — Since models are trained on a predefined dataset, it is important to make sure you do not introduce bias.
statistics, sampling bias is a bias in which a sample is collected in such a way that some members of the intended population are less likely to be included than others. It results in a biased sample, a non-random sample of a population (or non-human factors) in which all individuals, or instances, were not equally likely to have been selected. If this is not accounted for, results can be erroneously attributed to the phenomenon under study rather than to the method of sampling.
(Khan Academy provides great examples on how to identify bias in samples and surveys.)
2. Remove outliers from data.
Outliers are extreme values that deviate from other observations on data , they may indicate a variability in a measurement, experimental errors or a novelty. In other words, an outlier is an observation that diverges from an overall pattern on a sample.
3. Transform our dataset into a language a machine can understand — numbers. I will only focus in this tutorial, on transforming datasets as the other two points require a blog to cover them in full details.
To prepare a dataset you must of course first have a dataset. We are going to use the Fashion-MNIST dataset because it is already optimized and labeled for a classification problem.
(Read more about the Fashion-MNIST dataset here.)
The Fashion-MNIST dataset has 70,000 grayscale, (28x28px) images separated into the following categories:
| Label | Description |
| 0 | T-shirt/top |
| 1 | Trouser |
| 2 | Pullover |
| 3 | Dress |
| 4 | Coat |
| 5 | Sandal |
| 6 | Shirt |
| 7 | Sneaker |
| 8 | Bag |
| 9 | Ankle boot |
Fortunately, the majority of deep learning (DL) frameworks support Fashion-MNIST dataset out of the box, including Keras. To download the dataset yourself and see other examples you can link to the github repo — here.
from keras.datasets.fashion_mnist import load_data
# Load the fashion-mnist train data and test data
(x_train, y_train), (x_test, y_test) = load_data()
x_train shape: (60000, 28, 28) y_train shape: (60000,)
x_test shape: (10000, 28, 28) y_test shape: (10000,)
load_data()function returns training and testing dataset.
It is essential to split your data into training and testing sets.
Training data: is used to train the Neural Network (NN)
Testing data : is used to validate and optimize the results of the Neural Network during the training phase, by tuning and re-adjust the hyperparameters.
Hyperparameter are parameters whose value is set before the learning process begins.
After training a Neural Network, we run the trained model against our validation dataset to make sure that the model is generalized and is not overfitting.
What is overfitting? :)
Overfitting means a model predicts the right result when it tests against training data, but otherwise fails to predict accurately. However, if a model predicts the incorrect result for the training data, this is called underfitting. For further explanation of overfitting and underfitting.
Thus, we use the validation dataset to detect overfitting or underfitting. But, most of the time we train the model multiple times in order to have a higher score in the training and validation datasets. Since we retrain the model based on the validation dataset result, we can end up overfitting not only in training dataset but also in the validation set. In order to avoid this, we use a third dataset that is never used during training, which is the testing dataset.