A Machine Learning Approach to IBM Employee Attrition and Performance

Predicting the Attrition of Valuable Employees…..

In an IT firm, there are many Employee Architectures available. Some IT firms or at particular departments or certain levels follow the chief programmer structure, in which there is a “star” organisation around a “chief” position designated to the Engineer who best understands the system requirements.

Chief Programmer Architecture

While, some follow an egoless (democratic) structure, where all the Engineers are at the same level designated for different jobs like Front-End Design, Back-End Coding, Software Testing etc. But, this architecture is not followed by very big or Multi-National Software Giants. But all in all, this is a very successful and working environment-friendly architecture.

Egoless (Democratic) Architecture

3rd Type of architecture is the mixed structure, which is the combination of the above 2 types. This is the mostly followed architecture and very common among software giants.

Mixed Controlled Architecture

Likewise, International Business Machine Corporation (IBM) probably follows either egoless or mixed structures. So, for the HR Department, an important responsibility is to measure the attrition of the Employees at specific time-gaps. The factors on which the Employee Attrition depends upon are:

  1. Age of the Employee
  2. Monthly Income
  3. Overtime
  4. Monthly Rate
  5. Distance from Home
  6. Years at Company

and so on…

IBM also made their Employee Information publicly available, with the problem statement:

Predict the Attrition of the Employees i.e., will there be attrition of the employees or not, given the Employee Details i.e., the factors responsible for attrition”

The Employee Dataset is made available at Kaggle:

A possible solution to solve this problem is by applying Machine Learning i.e., by imparting Machine Intelligence which involves development of a Predictive Model by training it, using the data available and validating it for Model Performance Analysis….

Given below is a step-by-step procedure of Machine Learning Model Development using Python and Scikit-Learn Machine Learning Toolbox:

  1. Model Development:
#importing all the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import pylab as pl
from sklearn.metrics import roc_curve, auc
#loading the dataset using Pandas
df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')
df.head()# Output shown below

Pandas Dataframe Output of the Dataset
#checking whether the dataset contains any missing values...
df.shape == df.dropna().shape # Output shown below

Hence, there are no missing values present in the dataset.

This is a Binary Classification Problem, so the Distribution of instances among the 2 classes, is visualized below:

y_bar = np.array([df[df['Attrition']=='No'].shape[0]
,df[df['Attrition']=='Yes'].shape[0]])
x_bar = ['No (0)', 'Yes (1)'
#Bar Visualization 
plt.bar(x, y)
plt.xlabel('Labels/Classes')
plt.ylabel('Number of Instances')
plt.title('Distribution of Labels/Classes in the Dataset')
# Output shown below

Bar Visualization of the Class Distribution
#Label Encoding for Categorical/Non-Numeric Data
X = df.iloc[:,[0] + list(range(2,35))].values
y = df.iloc[:,1].values
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:,1] = labelencoder_X_1.fit_transform(X[:,1])
X[:,3] = labelencoder_X_1.fit_transform(X[:,3])
X[:,6] = labelencoder_X_1.fit_transform(X[:,6])
X[:,10] = labelencoder_X_1.fit_transform(X[:,10])
X[:,14] = labelencoder_X_1.fit_transform(X[:,14])
X[:,16] = labelencoder_X_1.fit_transform(X[:,16])
X[:,20] = labelencoder_X_1.fit_transform(X[:,20])
X[:,21] = labelencoder_X_1.fit_transform(X[:,21])
y = labelencoder_X_1.fit_transform(y)
#Feature Selection using Random Forest Classifier's Feature
#Importance Scores
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X,y) # Output shown below

list_importances=list(model.feature_importances_)
indices=sorted(range(len(list_importances)), key=lambda k
:list_importances[k])
feature_selected=[None]*34
k=0
for i in reversed(indices):
if k<=33:
feature_selected[k]=i
k=k+1
X_selected = X[:,feature_selected[:17]]
l_features=feature_selected
i=0
for x in feature_selected:
l_features[i] = df.columns[x]
i=i+1
l_features = np.array(l_features)
#Extracting 17 most important features among 34 features
l_features[:17] #Output shown below

#Selecting the 17 most important features
df_features = pd.DataFrame(X_selected, columns=['Age',
'MonthlyIncome', 'OverTime',
'EmployeeNumber', 'MonthlyRate',
, 'DistanceFromHome', 'YearsAtCompany',
'TotalWorkingYears', 'DailyRate',
'HourlyRate', 'NumCompaniesWorked',
'JobInvolvement', 'PercentSalaryHike',
'StockOptionLevel',
'YearsWithCurrManager',
'EnvironmentSatisfaction',
'EducationField', 'Attrition']]
df_selected.head() # Output shown below

So, again label encoding has to be done for the selected categorical features:

#Label Encoding for selected Non-Numeric Features:
X = df_selected.iloc[:,list(range(0,17))].values
y = df_selected.iloc[:,17].values
X[:,2] = labelencoder_X_1.fit_transform(X[:,2])
X[:,16] = labelencoder_X_1.fit_transform(X[:,16])
y = labelencoder_X_1.fit_transform(y)

Now the Data Pre-Processing Steps are over. Let’s move on to Model Training:-

#80-20 splitting where 80% Data is for Training the Model
#and 20% Data is for Validation and Performance Analysis

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=1753)
# Using Logistic Regression Algorithm for Model Training
from sklearn.linear_model import LogisticRegression
clf= LogisticRegression(verbose = 3)
# Training the Model
clf_trained = clf.fit(X_train, y_train) #Output shown below

This is the Library of Parameter Optimization Strategy used by Logistic Regression

2. Model Performance Analysis:

=>Training Accuracy

clf_trained.score(X_train,y_train) # Output shown below

Training Accuracy of 84.44% is achieved by the model

=>Validation Accuracy

clf_trained.score(X_test,y_test) # Output shown below

Validation Accuracy of 89.12% is achieved by the model

=>Precision, Recall and F1-Score

#getting the predictions...
predictions=clf_trained.predict(X_test)
print(classification_report(y_test,predictions))

Classification Report of the model

=>Confusion Matrix

#MODULE FOR CONFUSION MATRIX
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import itertools
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0])
, range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
#Generating the Confusion Matrix
plt.figure()
cm = np.array([[252, 1], [31, 10]])
plot_confusion_matrix(confusion_matrix(y_test,predictions), 
classes=[0,1], normalize=True
, title='Normalized Confusion Matrix')
# Output shown below

Normalized Confusion Matrix

=>Receiver Operator Characteristic Curve:

#Plotting the ROC Curve
y_roc = np.array(y_test)
fpr, tpr, thresholds = roc_curve(y_roc, clf_trained.decision_function(X_test))
roc_auc = auc(fpr, tpr)
pl.clf()
pl.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
pl.plot([0, 1], [0, 1], 'k--')
pl.xlim([0.0, 1.0])
pl.ylim([0.0, 1.0])
pl.xlabel('False Positive Rate')
pl.ylabel('True Positive Rate')
pl.legend(loc="lower right")
pl.show() # Output shown below

Receiver Operator Characteristic Curve (ROC Curve)

According to the Performance Analysis, it can be concluded that the Machine Learning Predictive Model has been successful in effectively classifying 89.12% unknown (Validation Set) examples correctly and has shown quite descent statistical figures for different performance metrics.

Hence, in this way an Employee Attrition Predictive Model can be developed using Data Analysis and Machine Learning.

This model has been deployed in a Web Application by me using php (PHP: Hypertext Preprocessor) as back-end with the help of PHP-ML. The link to the Web-App is given below:

For Personal Contacts regarding the article or the Web-App or discussions on Machine Learning or any department of Data Science, feel free to reach out to me on LinkedIn.

read original article here