How to use deep learning to make predictions on Tabular data with Tensorflow Keras

Roy Lee
9 min readOct 5, 2020

--

It has been light years since I last wrote a post. My role at my job has changed and I no longer do engineering work, but I take every opportunity to learn new technical skills all the time. This time, it’s Deep learning!

Deep learning has lots of use cases, with the most popular use case out there to be image classification (probably). However, my approach to learning has always been to make sure it is applicable to my “daily life”. Hence, with this thought in mind, I have chosen to do work on tabular / structured data.

Story

Organizations often seek to discover the secrets of keeping their best talents at the firm. However, this is not as straightforward as it seems.

The goal of this project is to allow Deep learning to discover the chance of employee attrition based on a dataset of employees taken from Kaggle. Employee Attrition.

Code-first learning (TLDR;)

For those of you who prefer to dive right into code, do refer to my Github repository on Employee attrition Predictor. This notebook was also built on colab by Google.

THE FIRST MODEL

The First Model is based on Red Dragon AI’s class example

Steps

Install and import packages

pip -q install tf-nightly%matplotlib inline import matplotlib.pyplot as plt import math import tensorflow as tf import numpy as np from numpy import unique import pandas as pd from tensorflow import keras from tensorflow.keras import layers from tensorflow.keras.callbacks import ModelCheckpoint, Callback, TensorBoard

Read Files

This example reads files from my Google drive, but if you would like, here’s a copy on my Github

#Training and validation dataset file_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vRTzmPbXWcC6mfBDE1MBg5HoHsYlvYtkZp8oJFHfIMNzqiG6P4cdGaceWsxW9JS6ip9vdJYCNrDEbOx/pub?gid=581336355&single=true&output=csv" dataframe = pd.read_csv(file_url) dataframe.shape #Prediction dataset pred_file_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vRsxR0nTrEaCqfro4FvGNDn6ZYYdQS0e2Tev1SMtJ5jYIjU0WGp77hp6btdJYkMl3XAk4lA01hxE30o/pub?gid=2057317048&single=true&output=csv" pred_dataframe = pd.read_csv(pred_file_url) pred_dataframe.shape

Drop columns

Drop dataframe columns that are not useful

dataframe = dataframe.drop(['id','Attrition','EmployeeCount','EmployeeNumber'], axis=1)

Preprocess data

Split Dataframes for training and validation then convert dataframes to datasets.

val_dataframe = dataframe.sample(frac=0.2, random_state=1337) train_dataframe = dataframe.drop(val_dataframe.index) print( "Using %d samples for training, %d for validation and %d for predicting" % (len(train_dataframe), len(val_dataframe), len(pred_dataframe)) ) def dataframe_to_dataset(dataframe): dataframe = dataframe.copy() labels = dataframe.pop("AttritionBinary") ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels)) ds = ds.shuffle(buffer_size=len(dataframe)) return ds def pred_dataframe_to_dataset(dataframe): dataframe = dataframe.copy() ds = tf.data.Dataset.from_tensor_slices((dict(dataframe))) ds = ds.shuffle(buffer_size=len(dataframe)) return ds train_ds = dataframe_to_dataset(train_dataframe) val_ds = dataframe_to_dataset(val_dataframe) pred_ds = pred_dataframe_to_dataset(pred_dataframe) for x, y in train_ds.take(1): print("Input:", x) print("Target:", y) train_ds = train_ds.batch(32) val_ds = val_ds.batch(32)

Normalize integers, convert strings, do one hot encoding

Feature preprocessing with Keras layers The following features are categorical features encoded as integers:

  • Education
  • EnvironmentSatisfaction
  • JobInvolvement
  • JobLevel
  • JobSatisfaction
  • PerformanceRating
  • RelationshipSatisfaction
  • StandardHours
  • StockOptionLevel
  • WorkLifeBalance

We will encode these features using one-hot encoding using the CategoryEncoding() layer.

We also have some categorical features encoded as strings:

We will first create an index of all possible features using the StringLookup() layer, then we will one-hot encode the output indices using a CategoryEncoding() layer.

Finally, the following feature are continuous numerical features:

  • Age
  • DailyRate
  • DistanceFromHome
  • HourlyRate
  • MonthlyIncome
  • MonthlyRate
  • NumCompaniesWorked
  • PercentSalaryHike
  • TotalWorkingYears
  • TrainingTimesLastYear
  • YearsAtCompany
  • YearsInCurrentRole
  • YearsSinceLastPromotion
  • YearsWithCurrManager

For each of these features, we will use a Normalization() layer to make sure the mean of each feature is 0 and its standard deviation is 1.

Below, we use 3 utility functions to do the operations (all 3 functions were written by fchollet):

  • encode_numerical_feature to apply featurewise normalization to numerical features.
  • encode_string_categorical_feature to first turn string inputs into integer indices, then one-hot encode these integer indices.
  • encode_integer_categorical_feature to one-hot encode integer categorical features.
from tensorflow.keras.layers.experimental.preprocessing import Normalization from tensorflow.keras.layers.experimental.preprocessing import CategoryEncoding from tensorflow.keras.layers.experimental.preprocessing import StringLookup def encode_numerical_feature(feature, name, dataset): # Create a Normalization layer for our feature normalizer = Normalization() # Prepare a Dataset that only yields our feature feature_ds = dataset.map(lambda x, y: x[name]) feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1)) # Learn the statistics of the data normalizer.adapt(feature_ds) # Normalize the input feature encoded_feature = normalizer(feature) return encoded_feature def encode_string_categorical_feature(feature, name, dataset): # Create a StringLookup layer which will turn strings into integer indices index = StringLookup() # Prepare a Dataset that only yields our feature feature_ds = dataset.map(lambda x, y: x[name]) feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1)) # Learn the set of possible string values and assign them a fixed integer index index.adapt(feature_ds) # Turn the string input into integer indices encoded_feature = index(feature) # Create a CategoryEncoding for our integer indices encoder = CategoryEncoding(output_mode="binary") # Prepare a dataset of indices feature_ds = feature_ds.map(index) # Learn the space of possible indices encoder.adapt(feature_ds) # Apply one-hot encoding to our indices encoded_feature = encoder(encoded_feature) return encoded_feature def encode_integer_categorical_feature(feature, name, dataset): # Create a CategoryEncoding for our integer indices encoder = CategoryEncoding(output_mode="binary") # Prepare a Dataset that only yields our feature feature_ds = dataset.map(lambda x, y: x[name]) feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1)) # Learn the space of possible indices encoder.adapt(feature_ds) # Apply one-hot encoding to our indices encoded_feature = encoder(feature) return encoded_feature

Build model

Do note that the encoded inputs are concatenated before going into the deeper layers

# Categorical features encoded as integers Education = keras.Input(shape=(1,), name="Education", dtype="int64") EnvironmentSatisfaction = keras.Input(shape=(1,), name="EnvironmentSatisfaction", dtype="int64") JobInvolvement = keras.Input(shape=(1,), name="JobInvolvement", dtype="int64") JobLevel = keras.Input(shape=(1,), name="JobLevel", dtype="int64") JobSatisfaction = keras.Input(shape=(1,), name="JobSatisfaction", dtype="int64") PerformanceRating = keras.Input(shape=(1,), name="PerformanceRating", dtype="int64") RelationshipSatisfaction = keras.Input(shape=(1,), name="RelationshipSatisfaction", dtype="int64") StandardHours = keras.Input(shape=(1,), name="StandardHours", dtype="int64") StockOptionLevel = keras.Input(shape=(1,), name="StockOptionLevel", dtype="int64") WorkLifeBalance = keras.Input(shape=(1,), name="WorkLifeBalance", dtype="int64") # Categorical feature encoded as string BusinessTravel = keras.Input(shape=(1,), name="BusinessTravel", dtype="string") Department = keras.Input(shape=(1,), name="Department", dtype="string") EducationField = keras.Input(shape=(1,), name="EducationField", dtype="string") Gender = keras.Input(shape=(1,), name="Gender", dtype="string") JobRole = keras.Input(shape=(1,), name="JobRole", dtype="string") MaritalStatus = keras.Input(shape=(1,), name="MaritalStatus", dtype="string") Over18 = keras.Input(shape=(1,), name="Over18", dtype="string") OverTime = keras.Input(shape=(1,), name="OverTime", dtype="string") # Numerical features Age = keras.Input(shape=(1,), name="Age") DailyRate = keras.Input(shape=(1,), name="DailyRate") DistanceFromHome = keras.Input(shape=(1,), name="DistanceFromHome") HourlyRate = keras.Input(shape=(1,), name="HourlyRate") MonthlyIncome = keras.Input(shape=(1,), name="MonthlyIncome") MonthlyRate = keras.Input(shape=(1,), name="MonthlyRate") NumCompaniesWorked = keras.Input(shape=(1,), name="NumCompaniesWorked") PercentSalaryHike = keras.Input(shape=(1,), name="PercentSalaryHike") TotalWorkingYears = keras.Input(shape=(1,), name="TotalWorkingYears") TrainingTimesLastYear = keras.Input(shape=(1,), name="TrainingTimesLastYear") YearsAtCompany = keras.Input(shape=(1,), name="YearsAtCompany") YearsInCurrentRole = keras.Input(shape=(1,), name="YearsInCurrentRole") YearsSinceLastPromotion = keras.Input(shape=(1,), name="YearsSinceLastPromotion") YearsWithCurrManager = keras.Input(shape=(1,), name="YearsWithCurrManager") all_inputs = [ Education, EnvironmentSatisfaction, JobInvolvement, JobLevel, JobSatisfaction, PerformanceRating, RelationshipSatisfaction, StandardHours, StockOptionLevel, WorkLifeBalance, BusinessTravel, Department, EducationField, Gender, JobRole, MaritalStatus, Over18, OverTime, Age, DailyRate, DistanceFromHome, HourlyRate, MonthlyIncome, MonthlyRate, NumCompaniesWorked, PercentSalaryHike, TotalWorkingYears, TrainingTimesLastYear, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager ] # Integer categorical features Education_encoded = encode_integer_categorical_feature(Education, "Education", train_ds) EnvironmentSatisfaction_encoded = encode_integer_categorical_feature(EnvironmentSatisfaction, "EnvironmentSatisfaction", train_ds) JobInvolvement_encoded = encode_integer_categorical_feature(JobInvolvement, "JobInvolvement", train_ds) JobLevel_encoded = encode_integer_categorical_feature(JobLevel, "JobLevel", train_ds) JobSatisfaction_encoded = encode_integer_categorical_feature(JobSatisfaction, "JobSatisfaction", train_ds) PerformanceRating_encoded = encode_integer_categorical_feature(PerformanceRating, "PerformanceRating", train_ds) RelationshipSatisfaction_encoded = encode_integer_categorical_feature(RelationshipSatisfaction, "RelationshipSatisfaction", train_ds) StandardHours_encoded = encode_integer_categorical_feature(StandardHours, "StandardHours", train_ds) StockOptionLevel_encoded = encode_integer_categorical_feature(StockOptionLevel, "StockOptionLevel", train_ds) WorkLifeBalance_encoded = encode_integer_categorical_feature(WorkLifeBalance, "WorkLifeBalance", train_ds) # String categorical features BusinessTravel_encoded = encode_string_categorical_feature(BusinessTravel, "BusinessTravel", train_ds) Department_encoded = encode_string_categorical_feature(Department, "Department", train_ds) EducationField_encoded = encode_string_categorical_feature(EducationField, "EducationField", train_ds) Gender_encoded = encode_string_categorical_feature(Gender, "Gender", train_ds) JobRole_encoded = encode_string_categorical_feature(JobRole, "JobRole", train_ds) MaritalStatus_encoded = encode_string_categorical_feature(MaritalStatus, "MaritalStatus", train_ds) Over18_encoded = encode_string_categorical_feature(Over18, "Over18", train_ds) OverTime_encoded = encode_string_categorical_feature(OverTime, "OverTime", train_ds) # Numerical features Age_encoded = encode_numerical_feature(Age, "Age", train_ds) DailyRate_encoded = encode_numerical_feature(DailyRate, "DailyRate", train_ds) DistanceFromHome_encoded = encode_numerical_feature(DistanceFromHome, "DistanceFromHome", train_ds) HourlyRate_encoded = encode_numerical_feature(HourlyRate, "HourlyRate", train_ds) MonthlyIncome_encoded = encode_numerical_feature(MonthlyIncome, "MonthlyIncome", train_ds) MonthlyRate_encoded = encode_numerical_feature(MonthlyRate, "MonthlyRate", train_ds) NumCompaniesWorked_encoded = encode_numerical_feature(NumCompaniesWorked, "NumCompaniesWorked", train_ds) PercentSalaryHike_encoded = encode_numerical_feature(PercentSalaryHike, "PercentSalaryHike", train_ds) TotalWorkingYears_encoded = encode_numerical_feature(TotalWorkingYears, "TotalWorkingYears", train_ds) TrainingTimesLastYear_encoded = encode_numerical_feature(TrainingTimesLastYear, "TrainingTimesLastYear", train_ds) YearsAtCompany_encoded = encode_numerical_feature(YearsAtCompany, "YearsAtCompany", train_ds) YearsInCurrentRole_encoded = encode_numerical_feature(YearsInCurrentRole, "YearsInCurrentRole", train_ds) YearsSinceLastPromotion_encoded = encode_numerical_feature(YearsSinceLastPromotion, "YearsSinceLastPromotion", train_ds) YearsWithCurrManager_encoded = encode_numerical_feature(YearsWithCurrManager, "YearsWithCurrManager", train_ds) all_features = layers.concatenate( [ Education_encoded, EnvironmentSatisfaction_encoded, JobInvolvement_encoded, JobLevel_encoded, JobSatisfaction_encoded, PerformanceRating_encoded, RelationshipSatisfaction_encoded, StandardHours_encoded, StockOptionLevel_encoded, WorkLifeBalance_encoded, BusinessTravel_encoded, Department_encoded, EducationField_encoded, Gender_encoded, JobRole_encoded, MaritalStatus_encoded, Over18_encoded, OverTime_encoded, Age_encoded, DailyRate_encoded, DistanceFromHome_encoded, HourlyRate_encoded, MonthlyIncome_encoded, MonthlyRate_encoded, NumCompaniesWorked_encoded, PercentSalaryHike_encoded, TotalWorkingYears_encoded, TrainingTimesLastYear_encoded, YearsAtCompany_encoded, YearsInCurrentRole_encoded, YearsSinceLastPromotion_encoded, YearsWithCurrManager_encoded ] ) # Multiple layers with dropout (for higher accuracy) x = layers.Dense(187, activation="tanh", name = "Dense_1")(all_features) x = layers.Dropout(0.2)(x) x = layers.Dense(64, activation="relu", name = "Dense_2")(x) x = layers.Dropout(0.2)(x) x = layers.Dense(32, activation="relu", name = "Dense_3")(x) output = layers.Dense(1, activation="sigmoid",name = "Outputlayer")(x) model = keras.Model(all_inputs, output) #Use Adam optimizer with custom learning rate opt = keras.optimizers.Adam(learning_rate=0.001) model.compile(opt, "binary_crossentropy", metrics=["accuracy"]) model.summary()

View model

keras.utils.plot_model(model, show_shapes=True, rankdir="LR")

Add checkpoints

Checkpoints allow us to save the model at the highest validation accuracy

!mkdir checkpoints !lscheckpoint = ModelCheckpoint('./checkpoints/best_weights.tf', monitor='val_accuracy', verbose=1, save_best_only=True, mode='auto')

We saved the model.fit function into history so we can plot it later.

history = model.fit(train_ds, epochs=70, validation_data=val_ds, callbacks=[checkpoint])

Predict on sample data

sample = { "Education": 2, "EnvironmentSatisfaction": 4, "JobInvolvement": 4, "JobLevel": 2, "JobSatisfaction": 1, "PerformanceRating": 3, "RelationshipSatisfaction": 3, "StandardHours": 80, "StockOptionLevel": 2, "WorkLifeBalance": 3, "BusinessTravel": "Travel_Rarely", "Department":"Research & Development", "EducationField":"Medical", "Gender":"Female", "JobRole":"Manufacturing Director", "MaritalStatus":"Divorced", "Over18":"Y", "OverTime":"No", "Age": 53, "DailyRate": 1084, "DistanceFromHome": 13, "HourlyRate": 57, "MonthlyIncome": 4450, "MonthlyRate": 26250, "NumCompaniesWorked": 1, "PercentSalaryHike": 11, "TotalWorkingYears": 5, "TrainingTimesLastYear": 3, "YearsAtCompany": 4, "YearsInCurrentRole": 2, "YearsSinceLastPromotion": 1, "YearsWithCurrManager": 3, } input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()} model.predict(input_dict)

This employee had a low chance of attrition!

Predict on 400+ employees and chart prediction outcome

predictions = [] for employee in pred_ds: input_dict = {name: tf.convert_to_tensor([value]) for name, value in employee.items()} probability = model.predict(input_dict) employee_number = tf.get_static_value(employee["EmployeeNumber"]) employee_age = tf.get_static_value(employee["Age"]) employee_monthly_income = tf.get_static_value(employee["MonthlyIncome"]) employee_satisfaction = tf.get_static_value(employee["JobSatisfaction"]) * tf.get_static_value(employee["RelationshipSatisfaction"]) if(math.isnan(probability)): #Test data has missing features that cause the model to be unable to determine employee attrition % print( "We do not have a good prediction as to whether Employee Number %d will leave" % (employee_number) ) predictions.append({ "Age": employee_age, "MonthlyIncome":employee_monthly_income, "Satisfaction":employee_satisfaction, "prediction":0 }) elif(probability <0.5): #Employee has low chance of leaving (Less than 50%) print( "Employee Number %d has a low chance of leaving (%f)" % (employee_number, probability) ) predictions.append({ "Age": employee_age, "MonthlyIncome":employee_monthly_income, "Satisfaction":employee_satisfaction, "prediction":1 }) else: #Employee has high chance of leaving (Less than 50%) print( "Employee Number %d has a high chance of leaving (%f)" % (employee_number, probability) ) predictions.append({ "Age": employee_age, "MonthlyIncome":employee_monthly_income, "Satisfaction":employee_satisfaction, "prediction":2 })

Chart attrition prediction against employee age (image gallery below shows other ways of plotting also shown in jupyter notebook)

figure = plt.figure(figsize=(13,8)) predictions_frame= pd.DataFrame(predictions) plt.hist([predictions_frame[predictions_frame['prediction']==2]['Age'],predictions_frame[predictions_frame['prediction']==1]['Age'],predictions_frame[predictions_frame['prediction']==0]['Age']], stacked=True, color = ['r','g','b'], label = ['High chance of attrition','Likely to stay','Unknown']) plt.xlabel('Age') plt.ylabel('Number of employees') plt.legend() plt.title("Employee attrition probability by age")

Improving accuracy

In the First Model, we normalized the continuous integers and used one hot encoding for the categorical integers / strings.

This causes us to lose some form of relationship information between / within the categories.

Therefore, we will try to use feature columns to allow for more flexibility in how we treat our columns and dictate how the model learns.

The Second Model

The Second Model seeks to use Tensorflow’s feature columns to manage the columns based on

  1. Numerical columns
  2. Categorical columns
  3. Embedding columns
  4. Crossed feature columns
%pip install -q sklearnfrom tensorflow import feature_column from sklearn.model_selection import train_test_splittrain, test = train_test_split(dataframe, test_size=0.2) train, val = train_test_split(train, test_size=0.2) print(len(train), 'train examples') print(len(val), 'validation examples') print(len(test), 'test examples') # A utility method to create a tf.data dataset from a Pandas Dataframe def df_to_dataset(dataframe, shuffle=True, batch_size=32): dataframe = dataframe.copy() labels = dataframe.pop('AttritionBinary') ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels)) if shuffle: ds = ds.shuffle(buffer_size=len(dataframe)) ds = ds.batch(batch_size) return ds batch_size = 10 train_ds = df_to_dataset(train, batch_size=batch_size) val_ds = df_to_dataset(val, shuffle=False, batch_size=batch_size) test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size) for feature_batch, label_batch in train_ds.take(1): print('Every feature:', list(feature_batch.keys())) print('A batch of ages:', feature_batch['Age']) print('A batch of targets:', label_batch )

Choosing the feature columns

  • Numeric feature columns for continuous integers
  • Bucketized for age groups
  • Indicator columns for categorical data
  • Embedding columns for large categories of data that should not use one-hot encoding
  • Crossed column for data you want to be understood together
feature_columns = [] # numeric cols for header in [ 'DailyRate', 'DistanceFromHome', 'HourlyRate', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'PercentSalaryHike', 'TotalWorkingYears', 'TrainingTimesLastYear', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager', ]: feature_columns.append(feature_column.numeric_column(header)) # bucketized cols #We can bucketize the Age and Salary age = feature_column.numeric_column('Age') age_buckets = feature_column.bucketized_column(age, boundaries=[1, 20, 35, 45, 70]) feature_columns.append(age_buckets) # indicator_columns indicator_column_names = [ 'EducationField', 'Gender', 'MaritalStatus', 'Over18', 'OverTime', 'Education', 'EnvironmentSatisfaction', 'JobInvolvement', 'JobLevel', 'JobSatisfaction', 'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel', 'WorkLifeBalance', ] for col_name in indicator_column_names: categorical_column = feature_column.categorical_column_with_vocabulary_list( col_name, dataframe[col_name].unique()) indicator_column = feature_column.indicator_column(categorical_column) feature_columns.append(indicator_column) # embedding columns: BusinessTravel businessTravel = feature_column.categorical_column_with_vocabulary_list( 'BusinessTravel', dataframe.BusinessTravel.unique()) BusinessTravel_embedding = feature_column.embedding_column(businessTravel, dimension=8) feature_columns.append(BusinessTravel_embedding) # embedding columns: Department department = feature_column.categorical_column_with_vocabulary_list( 'Department', dataframe.BusinessTravel.unique()) department_embedding = feature_column.embedding_column(department, dimension=8) feature_columns.append(department_embedding) # embedding columns: JobRole jobRole = feature_column.categorical_column_with_vocabulary_list( 'JobRole', dataframe.BusinessTravel.unique()) JobRole_embedding = feature_column.embedding_column(jobRole, dimension=8) feature_columns.append(JobRole_embedding) # crossed columns Department_Job_feature = feature_column.crossed_column([department, jobRole], hash_bucket_size=100) feature_columns.append(feature_column.indicator_column(Department_Job_feature))feature_layer = tf.keras.layers.DenseFeatures(feature_columns) model_2 = tf.keras.Sequential([ feature_layer, layers.Dropout(.2), layers.Dense(73, activation='relu'), layers.Dropout(.2), layers.Dense(73, activation='relu'), layers.Dense(1) ]) opt = keras.optimizers.Adam(learning_rate=0.001) model_2.compile(optimizer=opt, loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), metrics=['accuracy']) history_feature = model_2.fit(train_ds, validation_data=val_ds, epochs=70)

Then, we will compare and arrive at a conclusion by comparing the model accuracy.

Comparing the history of both trainings

Green line: First Model Training accuracy
Red line: First Model Validation accuracy
Blue line: Second Model Training accuracy
Orange line: Second Model Validation accuracy

CONCLUSION

The First Model seemed to have a slightly better accuracy than the Second model, even though their validation accuracy was rather similar.

Could using feature columns differently improve the accuracy of the Second model? Maybe..

Thanks for reading and hope the content was useful to you!

This notebook can also be found here royleekiat/Employee_attrition_predictor

Originally published at http://royleekiat.com on October 5, 2020.

--

--

No responses yet