Tackle Bias and Other Problems/Solutions in Machine Learning Models

Predictive Analytics models rely heavily on Regression, Classification and Clustering methods. When analysing the effectiveness of a predictive model, the closer the predictions are to the actual data, the better it is. This article hopes to be a one-stop reference to the major problems and their most popular/effective solutions, without diving into details for execution.

A Linear Regression Plot
A clustering algorithm plot

Primarily, data selection and pruning happens during the Data Preparation phase, where you take care to get rid of bad data in the first place. Then again, there are issues with the data, and their relevance to the ML model’s objectives during training, troubles with usage of algorithms, and errors in the data that occur throughout. Effectively, the model is tested for bias, variance, autocorrelations, and many such errors that can occur when finalizing the model. Before finalizing the model, some defined tests are performed on the data- these are test algorithms that detect such errors.

After running these tests, you go back to the model and make those corrections, and approve the model as fit, or ‘good’. But, the best of the industry have figured out ways in which such errors can be avoided during further iterations.There are multitudes of errors that can occur, but let’s explore few of them with well-defined and most effective, tests and solutions:

Overfitting and Underfitting

Overfitting and Underfitting Problems can be explained with the Bias-Variance Tradeoff Property:

Bias-Variance tradeoff happens when either one of bias and variance is stressed in the data. Bias indicates a tendency for the data to point at one direction, in favor of certain properties (parameters) when dealing with a set of data- a model imbalance. It might be that the data will have those biases, but they should be minimized to obtain a fair inference out of such data.

Variance is the spread of data, and a largely spread out, or sparse data, that will not give concrete information. For a better understanding of how the data looks if variance is not taken care-off, refer to the diagram above.

Low-variance and Low-bias is the best combination for an accurate model.

Therefore, Bias and Variance are contributing factors to two most common phenomenon in predictive analytics models.

How does bias and variance contribute to Overfitting and Underfitting?

To determine a perfect fit for the model, we analyse how the test samples/data points were considered for model analysis. When parsing through millions of rows, it is possible that you try to include all the data points, whether relevant or not, or cross the threshold in foregoing them. The crux here is to not include every data point to perfection, nor travel very further from neglecting data points when you try to fit a curve.

High bias and high variance is the worst combination creating a result that is the most further away from reality. When you try to reduce the bias alone, variance still remains, creating the Overfitting phenomenon. And, taking care of just the high variance, and there is still high bias, it will result in Underfitting. That is where the term ‘trade off’ comes into existence, as reducing just the bias, will not improve the model, and vice versa. The ‘sweet spot’ is to land the data points at a place where there is low-bias and low-variance. Basically, find a pattern by not taking any of the extremes such that it tampers accuracy. Most of the time, the planning and choosing of these points are the biggest challenges that data Scientists and Analysts face.

The best fit may not be the one that excludes outliers to the T, but is always a compromise

However, there are methods of testing for the fit of the model. Some solutions provided to tackle these phenomenon are:

Answer to Bias-Variance Tradeoff Problem:

Build A More Complex Model

The first and simplest solution to an underfitting problem is to train a more complex model to fix the problem. And, for an Overfitting model, get more data in. and regularization.

Cross Validation

In cross-validation, all the available or chosen data is not used in training the model. There are usually three folds that help in performing the cross-validation method- the training data, test data, and validation dataset. You can use a combination of Training and Test data alone, or use all three data folds.

[Training data = for model training

Test data = for model hyperparameter tuning

Validation data = for model validation and accuracy estimation]

There are many ways to work with these folds, and The Training data is usually 60% of the total dataset, the test dataset will be 20% and, the validation data set comprises of the remaining 20%.

The quality of the trained model is tested by first training the model by using just the training data, and then compare that model with the model that is trained with the test data. In this manner, we can identify which data points bring about a better prediction. There are many variations of cross-validation:

Hold-Out: The data is divided into test-data and training data and later compared. In Hold-Out method, we use only one set of training data that is kept on hold.

100 samples, 60 training, 20 test, and the 20 in validation dataset. During the training you calculate the accuracy of the model. Test is to test accuracy after training the model.

K-Fold Cross Validation: Here data is divided into k-sets. Then the first set or first fold is the validation data set, and the first fold is removed from the the total number of folds( where, suppose k=10). For each iteration, we take one fold for validation (the 9th, after the first iteration (k-1)) and then subtract it from the now remaining total sets of folds (now k=9). This method is effective yet requires huge computational power.

Example for k-fold cross validation with 10 folds

Leave-One-Out: This method is more painstaking as one-one data gets eliminated each time to test for n number of data points.

Dropout:

The drop-out method is engaged when working with neural networks in deep learning. Dropout is a technique that is old and proven to help the accuracy of models. which makes some activations in a layer deactivated (equals 0). We can choose any amount of data from the dataset to create a dropout layer. Usually, this is in the range of 20 or 30 percent. Suppose, if we use 30% dropout, then the activations for a random 30% neurons in that particular layer gets deactivated. The deactivated neurons will not be propagated to the next layer of the network. We do this to avoid overfitting, as more noise will make the model robust.

Dropout method: Here, some neurons have been deactivated( red colored, right). Suppose the activation is x, then in dropout it is equated to zero

Intuitively, this forces the network to be accurate even in the absence of certain information. The threshold for the deactivation is decided earlier.

Gradient Noise:

This method involves adding gradient noise during the training, a method that proved to have increased the accuracy of a model. Refer to this paper- Adding Gradient Noise Improves Learning for Very Deep Networks).

Adding noise sampled from Gaussian Distribution:

Regularisation:

Regularization is just yet another popular method of reducing the overfitting phenomenon. Used to resolve a problem of high variance, the technique involves penalising coefficients and weights, to get a higher accuracy for both training data and test data.

Here, w is a weight value, the red box represents the Regularization Term, and lambda is the Regularization Parameter, which get optimized during the training. The remaining is the loss function which calculates the least of the squares.

Vanishing and Exploding Gradients Problems

When training a deep neural network using back-propagation, you add new and new hidden layers to the network. This ends up in building a highly complex model, but compromises on speed of training. Here, the problem of vanishing gradients occur when using a sigmoid activation function or tanh activation function, two of the functions used to fire up neurons of a neural network, that determine how the gradients behave as they pass through the layers.

It happens that when the gradient for weight matrices are calculated and then subtracted from the very same matrices. However, if the model has a lot of layers, eventually some gradients equate to zero, therefore making it the weight values to not change, and they stop learning. However, this poses a problem as the model does not learn from these vanishing gradients that achieve nothing. Usually this effect of decreasing value of gradients increases as you backpropagate through the layers, thus making those earlier layers to stop learning.

Gradient Descent and Vanishing/Exploding Gradients

To be more clear, when using back propagation, if sigmoid activation function is employed which has values between 0 and 1. So, if a high value (>1) is generated, then the activation function will activate the value to 1, the during back propagation, the derivative becomes 0, thus completely missing higher values, and vice versa (low values [>0], stays constant at 0. To avoid such vanishing gradients, other activation functions such as ReLU, PReLU, SeLu and ELU are used.

A Tanh function
A sigmoid function. Notice that higher values beyond -6 and 6 remains constant, here

Answer to Vanishing and Exploding Gradients Problem

Activation functions — ReLU, PReLU, RReLU, ELU

ReLU: (Rectified Linear Unit) In order that values more than zero does not become invalid, ReLU marks it to infinity, thereby generating a linear function. However, ReLu is faulted for equating values lower than zero to zero, which is not so good in some cases as it misses those values altogether, but increase speed. And, when there is a saturation of values below zero, ReLU actually prevents any training at all.

ReLu

PReLU (Parametric Rectified Linear Unit): Better than ReLU, PReLU is effective by not deactivating values below zero, yet increasing the speed. It alleviates saturation by substituting values from 0.01 by a parameter ‘α’.

RReLU (Randomized Leaky Rectified Linear Unit) : RReLU is said to beat every one of the above activation functions. RReLU assigns random values to the negative slope, thereby not compromising on speed, or accuracy.

ELU (Exponential Linear Unit): ELU avoids saturation for values above zero by equating it to 1. Mostly employed for higher accuracy in classification, ELU speeds up training.

Refer to the article here for equations and detailed explanation of these functions.

Normalization:

Normalization solves the issues of overfitting, underfitting and vanishing gradient problems.

Batch Normalization: Batch normalization technique is used to improve the performance of back-propagation. It involves rescaling the input values to prevent them from becoming too big or small.

Instance Normalization: Instance normalization is a normalization which uses only a single sample, instead of a batch of samples, like in the batch normalization.

Multicollinearity

Multicollinearity occurs when there are multiple correlations between predictor variables in a model prediction. This phenomenon is one which most are familiar with and is very common in regression models. Problem with multicollinearity occurs only when you need to know why certain prediction happened, i.e., the reason for the prediction is needed.This can bring in explanations for any predictions by the model. Sometimes a heavily correlated column, can seem to be the causation of certain outcomes, when in reality they are only correlated.

Finding Multicollinearity within a dataset, can prevent seriously wrong conclusions about certain results, such as in pneumonia patients with asthma, who were thought to be better resistant to asthma, as they got treated sooner. However, the truth was that the asthma patients were given immediate care when affected by pneumonia simultaneously, as they are more prone to a fatal outcomes without immediate treatment.

Credits: creativewisdom.com

Answer to Multicollinearity

Autocorrelation & Partial Autocorrelation Tests: These are tests that can detect a correlation phenomenon in the model. They are usually used during Time Series Analysis, and Forecasting. With these tests you can detect where correlation occurs, and remove highly correlated columns.

Autocorrelation: It detects the correlation, or occurrence of repeated signals in the data, mostly in Time-Series Analysis and Forecasting. It can happen between two dependent variables, x1 and x2.

Principal Components Analysis (PCA):

Principal Component Analysis is used as to correct correlation errors. It simply keeps a set of new predictor variables that are a combination of the behaviors of variables that are highly correlated. So instead of dropping those correlated variables that have their own role in the model, these new variables retain the behaviors of those otherwise recurrent and correlated variables. It works through feature extraction.

Plot that analyses the Principal Components of a Dataset through Feature Extraction

Plot that analyses the Principal Components of a Dataset through Feature Extraction

Linear Discriminant Analysis(LDA):

LDA is used in predictive analytics problem.

Logistic Regressions are Classification Algorithms

It assumes that a set of new inputs will belong to the classes in the dataset collected up till now. The When using logistic regression, certain limitations such as instability of the model will occur. Instead, we can use LDA for linear regression. This algorithm also uses the famous Bayes Theorem to calculate the probabilities of inputs to the outputs.

P(Y=x|X=x) = (PIk * fk(x)) / sum(PIl * fl(x))

Pearson correlation coefficient:

The Pearson coefficient is used to find the correlations between two variables X and Y. It gives a value between -1 and 1, that describes a negative or positive correlation, and if the value gives zero, there is no correlation.

Autocorrelation & Partial Autocorrelation Tests:

Autocorrelation tests the degree of (relationship) correlation of an outcome to the variables. The AutoCorrelation Function (ACF) is used to calculate the correlations in a Time Series. The observations for a Time Series prediction is correlated with the already collected Time Series observations. Hence, the name autocorrelation. The aim of the ACF is to use the values plot the graph for all the correlations with a lag. The lag here is the term that is calculated by taking the value required to make the series stationary, from the terms present in previous time series observations.

read original article here