CaseStudy2DDS

Introduction to the Project

DDSAnalytics that specializes in talent management solutions for Fortune 100 companies. This is a process of developing and retaining employees. It may include workforce planning, employee training programs, identifying high-potential employees and reducing/preventing voluntary employee turnover (attrition). To gain a competitive edge over its competition, DDSAnalytics is planning to leverage data science for talent management. The executive leadership has identified predicting employee turnover as its first application of data science for talent management. Before the business gree lights the project, they have tasked me to conduct an analysis of existing employee data.
I have been provided (CaseStudy2-data.csv) to do a data analysis to identify factors that lead to attrition. I need to identify the top three factors that contribute to turnover. The business is also interested in learning about any job role specific trends that may exist in the data set.
YouTube Presentation: https://www.youtube.com/watch?v=Te3EcaLsR48

Importing Data & Initial Inspection

There are 870 data entries
There aren’t any missing values
There are columns only have one unique value, and won’t be much use for our analysis, delete column to generate cleaner dataset

#Loading libraries needed
library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.0
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggplot2)
#Loading in Employee data
employeeData = read.csv("/Users/mingyang/Desktop/SMU/DoingDS_Fall2020/CaseStudy2DDS/CaseStudy2-data.csv",header = TRUE)
summary(employeeData)

##        ID             Age         Attrition         BusinessTravel    
##  Min.   :  1.0   Min.   :18.00   Length:870         Length:870        
##  1st Qu.:218.2   1st Qu.:30.00   Class :character   Class :character  
##  Median :435.5   Median :35.00   Mode  :character   Mode  :character  
##  Mean   :435.5   Mean   :36.83                                        
##  3rd Qu.:652.8   3rd Qu.:43.00                                        
##  Max.   :870.0   Max.   :60.00                                        
##    DailyRate       Department        DistanceFromHome   Education    
##  Min.   : 103.0   Length:870         Min.   : 1.000   Min.   :1.000  
##  1st Qu.: 472.5   Class :character   1st Qu.: 2.000   1st Qu.:2.000  
##  Median : 817.5   Mode  :character   Median : 7.000   Median :3.000  
##  Mean   : 815.2                      Mean   : 9.339   Mean   :2.901  
##  3rd Qu.:1165.8                      3rd Qu.:14.000   3rd Qu.:4.000  
##  Max.   :1499.0                      Max.   :29.000   Max.   :5.000  
##  EducationField     EmployeeCount EmployeeNumber   EnvironmentSatisfaction
##  Length:870         Min.   :1     Min.   :   1.0   Min.   :1.000          
##  Class :character   1st Qu.:1     1st Qu.: 477.2   1st Qu.:2.000          
##  Mode  :character   Median :1     Median :1039.0   Median :3.000          
##                     Mean   :1     Mean   :1029.8   Mean   :2.701          
##                     3rd Qu.:1     3rd Qu.:1561.5   3rd Qu.:4.000          
##                     Max.   :1     Max.   :2064.0   Max.   :4.000          
##     Gender            HourlyRate     JobInvolvement     JobLevel    
##  Length:870         Min.   : 30.00   Min.   :1.000   Min.   :1.000  
##  Class :character   1st Qu.: 48.00   1st Qu.:2.000   1st Qu.:1.000  
##  Mode  :character   Median : 66.00   Median :3.000   Median :2.000  
##                     Mean   : 65.61   Mean   :2.723   Mean   :2.039  
##                     3rd Qu.: 83.00   3rd Qu.:3.000   3rd Qu.:3.000  
##                     Max.   :100.00   Max.   :4.000   Max.   :5.000  
##    JobRole          JobSatisfaction MaritalStatus      MonthlyIncome  
##  Length:870         Min.   :1.000   Length:870         Min.   : 1081  
##  Class :character   1st Qu.:2.000   Class :character   1st Qu.: 2840  
##  Mode  :character   Median :3.000   Mode  :character   Median : 4946  
##                     Mean   :2.709                      Mean   : 6390  
##                     3rd Qu.:4.000                      3rd Qu.: 8182  
##                     Max.   :4.000                      Max.   :19999  
##   MonthlyRate    NumCompaniesWorked    Over18            OverTime        
##  Min.   : 2094   Min.   :0.000      Length:870         Length:870        
##  1st Qu.: 8092   1st Qu.:1.000      Class :character   Class :character  
##  Median :14074   Median :2.000      Mode  :character   Mode  :character  
##  Mean   :14326   Mean   :2.728                                           
##  3rd Qu.:20456   3rd Qu.:4.000                                           
##  Max.   :26997   Max.   :9.000                                           
##  PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours
##  Min.   :11.0      Min.   :3.000     Min.   :1.000            Min.   :80   
##  1st Qu.:12.0      1st Qu.:3.000     1st Qu.:2.000            1st Qu.:80   
##  Median :14.0      Median :3.000     Median :3.000            Median :80   
##  Mean   :15.2      Mean   :3.152     Mean   :2.707            Mean   :80   
##  3rd Qu.:18.0      3rd Qu.:3.000     3rd Qu.:4.000            3rd Qu.:80   
##  Max.   :25.0      Max.   :4.000     Max.   :4.000            Max.   :80   
##  StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
##  Min.   :0.0000   Min.   : 0.00     Min.   :0.000         Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000  
##  Median :1.0000   Median :10.00     Median :3.000         Median :3.000  
##  Mean   :0.7839   Mean   :11.05     Mean   :2.832         Mean   :2.782  
##  3rd Qu.:1.0000   3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000  
##  Max.   :3.0000   Max.   :40.00     Max.   :6.000         Max.   :4.000  
##  YearsAtCompany   YearsInCurrentRole YearsSinceLastPromotion
##  Min.   : 0.000   Min.   : 0.000     Min.   : 0.000         
##  1st Qu.: 3.000   1st Qu.: 2.000     1st Qu.: 0.000         
##  Median : 5.000   Median : 3.000     Median : 1.000         
##  Mean   : 6.962   Mean   : 4.205     Mean   : 2.169         
##  3rd Qu.:10.000   3rd Qu.: 7.000     3rd Qu.: 3.000         
##  Max.   :40.000   Max.   :18.000     Max.   :15.000         
##  YearsWithCurrManager
##  Min.   : 0.00       
##  1st Qu.: 2.00       
##  Median : 3.00       
##  Mean   : 4.14       
##  3rd Qu.: 7.00       
##  Max.   :17.00

#See which column has only unique value
sapply(employeeData,function(col) length(unique(col)))

##                       ID                      Age                Attrition 
##                      870                       43                        2 
##           BusinessTravel                DailyRate               Department 
##                        3                      627                        3 
##         DistanceFromHome                Education           EducationField 
##                       29                        5                        6 
##            EmployeeCount           EmployeeNumber  EnvironmentSatisfaction 
##                        1                      870                        4 
##                   Gender               HourlyRate           JobInvolvement 
##                        2                       71                        4 
##                 JobLevel                  JobRole          JobSatisfaction 
##                        5                        9                        4 
##            MaritalStatus            MonthlyIncome              MonthlyRate 
##                        3                      826                      852 
##       NumCompaniesWorked                   Over18                 OverTime 
##                       10                        1                        2 
##        PercentSalaryHike        PerformanceRating RelationshipSatisfaction 
##                       15                        2                        4 
##            StandardHours         StockOptionLevel        TotalWorkingYears 
##                        1                        4                       39 
##    TrainingTimesLastYear          WorkLifeBalance           YearsAtCompany 
##                        7                        4                       32 
##       YearsInCurrentRole  YearsSinceLastPromotion     YearsWithCurrManager 
##                       19                       16                       17

#Delete column that has only one unique value
to.be.deleted = which(sapply(employeeData,function(col) length(unique(col))==1))
employeeData = employeeData[,-to.be.deleted]

#observe the cleaner set again
summary(employeeData)

##        ID             Age         Attrition         BusinessTravel    
##  Min.   :  1.0   Min.   :18.00   Length:870         Length:870        
##  1st Qu.:218.2   1st Qu.:30.00   Class :character   Class :character  
##  Median :435.5   Median :35.00   Mode  :character   Mode  :character  
##  Mean   :435.5   Mean   :36.83                                        
##  3rd Qu.:652.8   3rd Qu.:43.00                                        
##  Max.   :870.0   Max.   :60.00                                        
##    DailyRate       Department        DistanceFromHome   Education    
##  Min.   : 103.0   Length:870         Min.   : 1.000   Min.   :1.000  
##  1st Qu.: 472.5   Class :character   1st Qu.: 2.000   1st Qu.:2.000  
##  Median : 817.5   Mode  :character   Median : 7.000   Median :3.000  
##  Mean   : 815.2                      Mean   : 9.339   Mean   :2.901  
##  3rd Qu.:1165.8                      3rd Qu.:14.000   3rd Qu.:4.000  
##  Max.   :1499.0                      Max.   :29.000   Max.   :5.000  
##  EducationField     EmployeeNumber   EnvironmentSatisfaction    Gender         
##  Length:870         Min.   :   1.0   Min.   :1.000           Length:870        
##  Class :character   1st Qu.: 477.2   1st Qu.:2.000           Class :character  
##  Mode  :character   Median :1039.0   Median :3.000           Mode  :character  
##                     Mean   :1029.8   Mean   :2.701                             
##                     3rd Qu.:1561.5   3rd Qu.:4.000                             
##                     Max.   :2064.0   Max.   :4.000                             
##    HourlyRate     JobInvolvement     JobLevel       JobRole         
##  Min.   : 30.00   Min.   :1.000   Min.   :1.000   Length:870        
##  1st Qu.: 48.00   1st Qu.:2.000   1st Qu.:1.000   Class :character  
##  Median : 66.00   Median :3.000   Median :2.000   Mode  :character  
##  Mean   : 65.61   Mean   :2.723   Mean   :2.039                     
##  3rd Qu.: 83.00   3rd Qu.:3.000   3rd Qu.:3.000                     
##  Max.   :100.00   Max.   :4.000   Max.   :5.000                     
##  JobSatisfaction MaritalStatus      MonthlyIncome    MonthlyRate   
##  Min.   :1.000   Length:870         Min.   : 1081   Min.   : 2094  
##  1st Qu.:2.000   Class :character   1st Qu.: 2840   1st Qu.: 8092  
##  Median :3.000   Mode  :character   Median : 4946   Median :14074  
##  Mean   :2.709                      Mean   : 6390   Mean   :14326  
##  3rd Qu.:4.000                      3rd Qu.: 8182   3rd Qu.:20456  
##  Max.   :4.000                      Max.   :19999   Max.   :26997  
##  NumCompaniesWorked   OverTime         PercentSalaryHike PerformanceRating
##  Min.   :0.000      Length:870         Min.   :11.0      Min.   :3.000    
##  1st Qu.:1.000      Class :character   1st Qu.:12.0      1st Qu.:3.000    
##  Median :2.000      Mode  :character   Median :14.0      Median :3.000    
##  Mean   :2.728                         Mean   :15.2      Mean   :3.152    
##  3rd Qu.:4.000                         3rd Qu.:18.0      3rd Qu.:3.000    
##  Max.   :9.000                         Max.   :25.0      Max.   :4.000    
##  RelationshipSatisfaction StockOptionLevel TotalWorkingYears
##  Min.   :1.000            Min.   :0.0000   Min.   : 0.00    
##  1st Qu.:2.000            1st Qu.:0.0000   1st Qu.: 6.00    
##  Median :3.000            Median :1.0000   Median :10.00    
##  Mean   :2.707            Mean   :0.7839   Mean   :11.05    
##  3rd Qu.:4.000            3rd Qu.:1.0000   3rd Qu.:15.00    
##  Max.   :4.000            Max.   :3.0000   Max.   :40.00    
##  TrainingTimesLastYear WorkLifeBalance YearsAtCompany   YearsInCurrentRole
##  Min.   :0.000         Min.   :1.000   Min.   : 0.000   Min.   : 0.000    
##  1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000   1st Qu.: 2.000    
##  Median :3.000         Median :3.000   Median : 5.000   Median : 3.000    
##  Mean   :2.832         Mean   :2.782   Mean   : 6.962   Mean   : 4.205    
##  3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.:10.000   3rd Qu.: 7.000    
##  Max.   :6.000         Max.   :4.000   Max.   :40.000   Max.   :18.000    
##  YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000          Min.   : 0.00       
##  1st Qu.: 0.000          1st Qu.: 2.00       
##  Median : 1.000          Median : 3.00       
##  Mean   : 2.169          Mean   : 4.14       
##  3rd Qu.: 3.000          3rd Qu.: 7.00       
##  Max.   :15.000          Max.   :17.00

#Convert some values into factors
cols.to.factor = c("Attrition","BusinessTravel","Department","EducationField","EnvironmentSatisfaction",
                   "Gender","JobInvolvement","JobLevel","JobRole","JobSatisfaction","MaritalStatus","NumCompaniesWorked",
                   "OverTime","PerformanceRating","RelationshipSatisfaction","StockOptionLevel","TrainingTimesLastYear",
                   "WorkLifeBalance")
employeeData[cols.to.factor] = lapply(employeeData[cols.to.factor],factor)
str(employeeData)

## 'data.frame':    870 obs. of  33 variables:
##  $ ID                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age                     : int  32 40 35 32 24 27 41 37 34 34 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 2 3 2 2 3 3 3 2 ...
##  $ DailyRate               : int  117 1308 200 801 567 294 1283 309 1333 653 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 3 2 2 2 3 3 2 ...
##  $ DistanceFromHome        : int  13 14 18 1 2 10 5 10 10 10 ...
##  $ Education               : int  4 3 2 4 1 2 5 4 4 4 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 4 2 3 6 2 4 2 2 6 ...
##  $ EmployeeNumber          : int  859 1128 1412 2016 1646 733 1448 1105 1055 1597 ...
##  $ EnvironmentSatisfaction : Factor w/ 4 levels "1","2","3","4": 2 3 3 3 1 4 2 4 3 4 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 2 1 1 2 ...
##  $ HourlyRate              : int  73 44 60 48 32 32 90 88 87 92 ...
##  $ JobInvolvement          : Factor w/ 4 levels "1","2","3","4": 3 2 3 3 3 3 4 2 3 2 ...
##  $ JobLevel                : Factor w/ 5 levels "1","2","3","4",..: 2 5 3 3 1 3 1 2 1 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 6 5 8 7 5 7 8 9 1 ...
##  $ JobSatisfaction         : Factor w/ 4 levels "1","2","3","4": 4 3 4 4 4 1 3 4 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 1 3 3 2 3 1 2 1 2 2 ...
##  $ MonthlyIncome           : int  4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
##  $ MonthlyRate             : int  9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
##  $ NumCompaniesWorked      : Factor w/ 10 levels "0","1","2","3",..: 3 2 3 2 2 2 3 3 2 2 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 2 2 1 ...
##  $ PercentSalaryHike       : int  11 14 11 19 13 21 12 14 19 14 ...
##  $ PerformanceRating       : Factor w/ 2 levels "3","4": 1 1 1 1 1 2 1 1 1 1 ...
##  $ RelationshipSatisfaction: Factor w/ 4 levels "1","2","3","4": 3 1 3 3 3 3 1 3 4 2 ...
##  $ StockOptionLevel        : Factor w/ 4 levels "0","1","2","3": 2 1 1 3 1 3 1 4 2 2 ...
##  $ TotalWorkingYears       : int  8 21 10 14 6 9 7 8 1 8 ...
##  $ TrainingTimesLastYear   : Factor w/ 7 levels "0","1","2","3",..: 4 3 3 4 3 5 6 6 3 4 ...
##  $ WorkLifeBalance         : Factor w/ 4 levels "1","2","3","4": 2 4 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  5 20 2 14 6 9 4 1 1 8 ...
##  $ YearsInCurrentRole      : int  2 7 2 10 3 7 2 0 1 2 ...
##  $ YearsSinceLastPromotion : int  0 4 2 5 1 1 0 0 0 7 ...
##  $ YearsWithCurrManager    : int  3 9 2 7 3 7 3 0 0 7 ...

Analysis of factors that lead to attrition

16.1% of employees left, a graph comparison is drawn
Seems younger people have more numbers of Attrition than older folks, they also have an overall higher proportion of leaving
Sales has the highest Attrition rate, while Research & Development has the lowest
Distance from Home doesn’t seem to have a particular correlation with Attrition Rate
Low education Level has a higher Attrition Rate, while higher education level has lower Attrition Rate
Human Resources and Technical Degrees have higher Attrition Rate, while Life Sciences have the lowest Attrition Rate
Low environment satisfaction score of 1 has the highest attrition rate, while score of 2-4 doesn’t seem to have a big differences or change of attrition rate
Male has a slight higher attrition rate compared to female
Hourly Rate don’t seem to have a big effect on Attrition rate, it looks pretty random across the board
Job involvement levels have a great correlation with Attrition rate, the lower the Job Involvement, the higher Attrition Rate. The higher Job involvement level the lower of the attrition rate.
Job Level has some correlation with Attrition rate, but this relationship is not as strong as Job involvement
Sales Representation has a significant higher Attrition Rate, while Research Director and Manufacturing Director has significantly lower Attrition rate
There is a correlation between Attrition rate and Job Satisfaction score, the higher the Job Satisfaction score, the lower the Attrition rate
Single Marital Status has a higher Attrition Rate, while Divorced group has lowest Attrition Rate
Monthly Income does seem to have an overall trend of higher income associated with lower Attrition Rate. But this correlation isn’t perfectly strong
Monthly Rate doesn’t seem to have an overall trend of higher Monthly Rate associated with lower attrition rate.
Not super strong correlation, but it seems people have changed jobs many times, don’t have problems changing again.
It seems people with many overtime are associated with a higher attrition rate
It doesn’t seem percent of Salary Hike has a strong correlation with attrition rate. Even a high percentage of Salary Hike can still have a high Attrition rate
Performance Rating don’t have a big difference in terms of Attrition rate
Relationship Satisfaction have a sligh correlation with the attrition rate. However, the relationship isn’t super strong.
Stock option level of 0 and 3 has much higher attrition Rate compared to Stock option level of 1 and 2
In general, the more total working years, the less of a attrition rate it becomes, a lot of people start retiring after working for 30 years, almost 100% folks retire after working for 40 years
There seem to have an overall trend of more training in the last year, less like a person will leave, but this trend isn’t perfect
The higher the work Life Balance Score, the lower Attrition rate it becomes
Within 10 years of working with the company, there seem to have an overall trend of decreasing of Attrition rate with the year increase, however between 10-20 years, the attrition rate seem to fluctuate, while a lot of folks retire after year of 25.
Years In current role doesn’t seem to have a linear relationship but a quadratic relationship
There don’t seem to be any definitive relationship between years since last promotion and Attrition rate
There don’t seem to have a definitive relationship between attrition rate and years with current manager

# First discover the percentage of Attrition in the Dataset
table(employeeData$Attrition) # 730 No, 140 Yes

## 
##  No Yes 
## 730 140

attrition.percent = 140/870
attrition.percent

## [1] 0.1609195

# Visually look at percentage of Attrition
employeeData %>% ggplot() + geom_bar(aes(x=Attrition),fill="light blue") +
  ggtitle("Attrition Count comparison")+ylab("Number of Employee")

# Age with Attrition
employeeData %>% ggplot() + geom_bar(aes(x=Age,fill=Attrition)) + 
  ggtitle("Attrition with Age")+ylab("Number of Empllyee")

# Age with Attrition Proportion
employeeData %>% ggplot() + geom_bar(aes(x=Age,fill=Attrition),position="fill") + 
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition with Age")+ylab("Proportion of Attrition")

#Department
employeeData %>% ggplot()+geom_bar(aes(x=Department,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate under different Department")+ylab("Attrition Rate")

# Distance from Home
employeeData %>% ggplot()+ geom_bar(aes(x=DistanceFromHome,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate under different distance from Home")+ylab("Attrition Rate")

#Education
employeeData %>% ggplot()+ geom_bar(aes(x=Education,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate under different Education Level")+ylab("Attrition Rate")+xlab("Education Level")

#Education Field
employeeData %>% ggplot()+ geom_bar(aes(x=EducationField,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate under different Education Field")+ylab("Attrition Rate")+xlab("Education Field")

#Environment Satisfaction
employeeData %>% ggplot()+ geom_bar(aes(x=EnvironmentSatisfaction,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate under different Environment Satisfaction Score")+ylab("Attrition Rate")+xlab("Environment Satisfaction")

# Gender
employeeData %>% ggplot()+ geom_bar(aes(x=Gender,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate with Gender")+ylab("Attrition Rate")+xlab("Gender")

# Hourly Rate
employeeData %>% ggplot()+ geom_bar(aes(x=HourlyRate,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate under different Hourly Rate")+ylab("Attrition Rate")+xlab("Hourly Rate")

# Job Involvement
employeeData %>% ggplot()+ geom_bar(aes(x=JobInvolvement,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate under different Job Involvement Levels")+
  ylab("Attrition Rate")+xlab("Job Involvement Levels")

# Job Level
employeeData %>% ggplot()+ geom_bar(aes(x=JobLevel,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate under different Job Level")+ylab("Attrition Rate")+xlab("Job Level")

#Job Role
employeeData %>% ggplot()+ geom_bar(aes(x=JobRole,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate under different Job Roles")+ylab("Attrition Rate")+xlab("Job Roles")+coord_flip()

# Job Satisfaction
employeeData %>% ggplot()+ geom_bar(aes(x=JobSatisfaction,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate under different Job Satisfaction Score")+ylab("Attrition Rate")+xlab("Job Satisfaction")

# Marital Status
employeeData %>% ggplot()+ geom_bar(aes(x=MaritalStatus,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate under Marital Status")+ylab("Attrition Rate")+xlab("Marital Status")

#Monthly Income
employeeData %>% ggplot(aes(x=MonthlyIncome,fill=Attrition))+ geom_bar(color="black",stat="bin",binwidth=500)+
  ggtitle("Attrition count under different Monthly Income")+ylab("Attrition Count")+xlab("Monthly Income group")

#Monthly Rate
employeeData %>% ggplot(aes(x=MonthlyRate,fill=Attrition))+ geom_bar(color="black",stat="bin",binwidth=1000)+
  ggtitle("Attrition count under different Monthly Rate")+ylab("Attrition Count")+xlab("Monthly Rate")

#Numbers of Companies Worked
employeeData %>% ggplot()+ geom_bar(aes(x=NumCompaniesWorked,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate under different Number of Companies Worked before")+ylab("Attrition Rate")+xlab("Number of Companies Worked")

# Over Time
employeeData %>% ggplot()+ geom_bar(aes(x=OverTime,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate vs Overtime")+ylab("Attrition Rate")+xlab("OverTime Or Not")

#Percent Salary Hike
employeeData %>% ggplot()+ geom_bar(aes(x=PercentSalaryHike,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate under different Percent of Salary Hike")+ylab("Attrition Rate")+xlab("Percent of Salary Hike")

#Performance Rating
employeeData %>% ggplot()+ geom_bar(aes(x=PerformanceRating,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate under different Performance Rating")+ylab("Attrition Rate")+xlab("Performance Rating")

#Relationship Satisfaction
employeeData %>% ggplot()+ geom_bar(aes(x=RelationshipSatisfaction,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate under different Relationship Satisfaction Score")+ylab("Attrition Rate")+xlab("Relationship Satisfaction")

#Stock Option Level
employeeData %>% ggplot()+ geom_bar(aes(x=StockOptionLevel,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate under different Stock Option Level")+ylab("Attrition Rate")+xlab("Stock Option Level")

#Total Working Years
employeeData %>% ggplot()+ geom_bar(aes(x=TotalWorkingYears,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate vs Total Working Years")+ylab("Attrition Rate")+xlab("Total Working Years")

#Training Times Last Year
employeeData %>% ggplot()+ geom_bar(aes(x=TrainingTimesLastYear,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate vs Training Times Last Year")+ylab("Attrition Rate")+xlab("Training Times Last Year")

#Work Life Balance Score
employeeData %>% ggplot()+ geom_bar(aes(x=WorkLifeBalance,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate vs Work Life Balance Score")+ylab("Attrition Rate")+xlab("Work Life Balance Score")

# Years At Company
employeeData %>% ggplot()+ geom_bar(aes(x=YearsAtCompany,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate vs. Years At Company")+ylab("Attrition Rate")+xlab("Years At Company")

# Years in current Role
employeeData %>% ggplot()+ geom_bar(aes(x=YearsInCurrentRole,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate vs Years in Current Role")+ylab("Attrition Rate")+xlab("Years in current role")

# Years Since Last Promotion
employeeData %>% ggplot()+ geom_bar(aes(x=YearsSinceLastPromotion,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate vs. Years Since last promotion")+ylab("Attrition Rate")+xlab("Years Since Last Promotion")

# Years With Current Manager
employeeData %>% ggplot()+ geom_bar(aes(x=YearsWithCurrManager,fill=Attrition),position="fill")+
  scale_y_continuous(labels = scales::percent)+
  ggtitle("Attrition rate vs. Years With Current Manager")+ylab("Attrition Rate")+xlab("Years With Current Manager")

Manuel Variable Selection for predicting Attrition

Based on the results above, we have a basic understanding on what can affect attrition rate
Visually inspecting all relevance, Job Involvement Levels, Job Roles, Overtime seem to have the most direct/significant correlation with Attrition Rate
Due to the amount of categorical variables involved, I consider Naive Bayes Classifier would make the best model to predict Attrition
Manually picking out variables used to predict Attrition based on the relationship explored above are below:
Age, Department, Education Level, Job Involvement, Job Level, Job Role, Job Satisfaction, Marital Status, Monthly Income, Number of Companies Worked, OverTime, Stock Option Level, Total Working Years, Work Life Balance, Years at Company, Years in Current Role, Years With Current Manager

Run a automated variable selection using caret - compare with my visual inspection

As we can see the top three most important variable according to the automated variable selection using caret are: OverTime, MonthlyIncome, TotalWorkingYears. Apparently, visually, we only got Overtime right in that sense.
According to the importance of variables, we can use the following variables to construct our Naive Bayes Model: OverTime, MonthlyIncome, TotalWorkingYears, YearsAtCompany, StockOptionLevel, MaritalStatus, JobLevel, YearsInCurrentRole, YearsWithCurrManager, Age, JobInvolvement, JobSatisfaction, JobRole, Department, all of these variables are confirmed by the above visual inspection. In addition, I want to add Education, WorkLifeBalance, Environment Satisfaction since they were confirmed of importance visually.

#Load in libraries
library(lattice)
library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(mlbench)
#prepare training scheme
control = trainControl(method="repeatedcv", number=10, repeats=3)
#train the model
model = train(Attrition~.,data=employeeData,method="lvq",preProcess="scale", trControl=control)
#estimate variable importance
importance = varImp(model,scale=FALSE)
#summarize importance
print(importance)

## ROC curve variable importance
## 
##   only 20 most important variables shown (out of 32)
## 
##                         Importance
## OverTime                    0.6679
## MonthlyIncome               0.6567
## TotalWorkingYears           0.6563
## YearsAtCompany              0.6470
## StockOptionLevel            0.6455
## MaritalStatus               0.6438
## JobLevel                    0.6406
## YearsInCurrentRole          0.6403
## YearsWithCurrManager        0.6291
## Age                         0.6265
## JobInvolvement              0.6159
## JobSatisfaction             0.5833
## JobRole                     0.5829
## Department                  0.5605
## DistanceFromHome            0.5586
## EnvironmentSatisfaction     0.5532
## WorkLifeBalance             0.5491
## TrainingTimesLastYear       0.5428
## Education                   0.5384
## ID                          0.5371

#plot importance
plot(importance)

Extract variables select to build my NB model

Initial test we got Sensitivity rate of 85% correct, while Specificity at 70% correct. Overall Accuracy is around 83%.
By running 100 iterations, we get an average Accuracy rate of 81%, Average Sensitivity of 85%, and average Specificity of 61%.
This tells us it is possible for certain random combination to cause Specificity go below 60%, however this is due to chance, the overall prediction rate should be slighly greater than 60%.

#Load NB libraries
library(e1071)
#select variables decided to predict Attrition
data.nb = employeeData %>% select(Attrition, OverTime, MonthlyIncome, TotalWorkingYears, YearsAtCompany, StockOptionLevel, MaritalStatus, JobLevel, YearsInCurrentRole, YearsWithCurrManager, Age, JobInvolvement, JobSatisfaction, JobRole, Department,Education, WorkLifeBalance, EnvironmentSatisfaction)

set.seed(12)
splitPercent = 0.80
trainIndex = sample(1:dim(data.nb)[1],round(splitPercent * dim(data.nb)[1]))
train.nb = data.nb[trainIndex,]
test.nb = data.nb[-trainIndex,]

model.nb = naiveBayes(Attrition~.,data=train.nb, laplace = 1)
predict.nb = predict(model.nb,test.nb)
table(predict.nb,test.nb$Attrition)

##           
## predict.nb  No Yes
##        No  128   7
##        Yes  23  16

confusionMatrix(predict.nb,test.nb$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  128   7
##        Yes  23  16
##                                           
##                Accuracy : 0.8276          
##                  95% CI : (0.7631, 0.8805)
##     No Information Rate : 0.8678          
##     P-Value [Acc > NIR] : 0.94914         
##                                           
##                   Kappa : 0.4196          
##                                           
##  Mcnemar's Test P-Value : 0.00617         
##                                           
##             Sensitivity : 0.8477          
##             Specificity : 0.6957          
##          Pos Pred Value : 0.9481          
##          Neg Pred Value : 0.4103          
##              Prevalence : 0.8678          
##          Detection Rate : 0.7356          
##    Detection Prevalence : 0.7759          
##       Balanced Accuracy : 0.7717          
##                                           
##        'Positive' Class : No              
##

# Now I want to run 100 iterations to get an average accuracy, sensitivity, and specificity rate
set.seed(13)
iterations = 100
masterAcc = matrix(nrow = iterations,ncol=3)
splitPerc = .8 #Training / Test split Percentage
for(j in 1:iterations)
{
  
  trainIndices = sample(1:dim(data.nb)[1],round(splitPerc * dim(data.nb)[1]))
  train = data.nb[trainIndices,]
  test = data.nb[-trainIndices,]
  
  model = naiveBayes(Attrition~.,data=train,laplace = 1)
  CM = confusionMatrix(table(predict(model,test),test$Attrition))
  masterAcc[j,1] = CM$overall[1]
  masterAcc[j,2]=CM$byClass[1]
  masterAcc[j,3] = CM$byClass[2]
}

MeanAcc = colMeans(masterAcc)
#Mean Accuracy Sensitivity and Specificity
MeanAcc

## [1] 0.8111494 0.8493323 0.6122321

We will now read in CaseStudy2CompSet No Attrition.csv and get a prediction result under Case2PredictionsYU Attrition.csv

#read in data of no Attrition
noAttrition = read.csv("/Users/mingyang/Desktop/SMU/DoingDS_Fall2020/CaseStudy2DDS/CaseStudy2CompSet No Attrition.csv",header = TRUE)
head(noAttrition)

##     ID Age    BusinessTravel DailyRate             Department DistanceFromHome
## 1 1171  35     Travel_Rarely       750 Research & Development               28
## 2 1172  33     Travel_Rarely       147        Human Resources                2
## 3 1173  26     Travel_Rarely      1330 Research & Development               21
## 4 1174  55     Travel_Rarely      1311 Research & Development                2
## 5 1175  29     Travel_Rarely      1246                  Sales               19
## 6 1176  51 Travel_Frequently      1456 Research & Development                1
##   Education  EducationField EmployeeCount EmployeeNumber
## 1         3   Life Sciences             1           1596
## 2         3 Human Resources             1           1207
## 3         3         Medical             1           1107
## 4         3   Life Sciences             1            505
## 5         3   Life Sciences             1           1497
## 6         4         Medical             1            145
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       2   Male         46              4        2
## 2                       2   Male         99              3        1
## 3                       1   Male         37              3        1
## 4                       3 Female         97              3        4
## 5                       3   Male         77              2        2
## 6                       1 Female         30              2        3
##                     JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1     Laboratory Technician               3       Married          3407
## 2           Human Resources               3       Married          3600
## 3     Laboratory Technician               3      Divorced          2377
## 4                   Manager               4        Single         16659
## 5           Sales Executive               3      Divorced          8620
## 6 Healthcare Representative               1        Single          7484
##   MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1       25348                  1      Y       No                17
## 2        8429                  1      Y       No                13
## 3       19373                  1      Y       No                20
## 4       23258                  2      Y      Yes                13
## 5       23757                  1      Y       No                14
## 6       25796                  3      Y       No                20
##   PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## 1                 3                        4            80                2
## 2                 3                        4            80                1
## 3                 4                        3            80                1
## 4                 3                        3            80                0
## 5                 3                        3            80                2
## 6                 4                        3            80                0
##   TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 1                10                     3               2             10
## 2                 5                     2               3              5
## 3                 1                     0               2              1
## 4                30                     2               3              5
## 5                10                     3               3             10
## 6                23                     1               2             13
##   YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 1                  9                       6                    8
## 2                  4                       1                    4
## 3                  1                       0                    0
## 4                  4                       1                    2
## 5                  7                       0                    4
## 6                 12                      12                    8

#Change certain column to factors
cols.to.factor2 = c("BusinessTravel","Department","EducationField","EnvironmentSatisfaction",
                   "Gender","JobInvolvement","JobLevel","JobRole","JobSatisfaction","MaritalStatus","NumCompaniesWorked",
                   "OverTime","PerformanceRating","RelationshipSatisfaction","StockOptionLevel","TrainingTimesLastYear",
                   "WorkLifeBalance")
noAttrition[cols.to.factor2] = lapply(noAttrition[cols.to.factor2],factor)
#Create prediction set
predictionSet = noAttrition %>% select(OverTime, MonthlyIncome, TotalWorkingYears, YearsAtCompany, StockOptionLevel, MaritalStatus, JobLevel, YearsInCurrentRole, YearsWithCurrManager, Age, JobInvolvement, JobSatisfaction, JobRole, Department,Education, WorkLifeBalance, EnvironmentSatisfaction)

#Retrain the NB model with entire dataset to improve prediction accuracy
model.nb = naiveBayes(Attrition~.,data=data.nb, laplace = 1)
prediction.NoAttrition = predict(model.nb,noAttrition)
noAttrition$Attrition = prediction.NoAttrition
results = noAttrition%>%select(ID,Attrition)
#write results into Case2PredictionsYU Attrition.csv
write.csv(results,"Case2PredictionsYU Attrition.csv",row.names = FALSE)

Now, we’d like to predict Montly Income with Multiple Linear Regression

I used internal 10-fold cross-validation accompanied with Stepwise selection using AIC to find the optimal multi-linear regression model
RMSE using 10-fold cross-validation is 1025.488
As we can see we got Adjusted R-squared of 0.9532 of our model, which is pretty good
As we have used this optimized model to generate predictions for Monthly Salary for CaseStudy2CompSet No Salary.csv, and exported to Case2PredictionsYU Salary.csv

#Load library to run stepwise regression method to choose an optimal simple model
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

#Build the model with internel verfication 
set.seed(24)
train.control <- trainControl(method = "cv", number = 10)
step.model = train(MonthlyIncome~., data=employeeData,
                   method="lmStepAIC",
                   trControl = train.control,
                   trace=FALSE)
#Model Accuracy
step.model$results

##   parameter     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      none 1023.433 0.9490945 784.3238 128.5067 0.01824883 86.87328

step.model$finalModel

## 
## Call:
## lm(formula = .outcome ~ ID + BusinessTravelTravel_Frequently + 
##     BusinessTravelTravel_Rarely + DailyRate + DepartmentSales + 
##     JobLevel2 + JobLevel3 + JobLevel4 + JobLevel5 + `JobRoleHuman Resources` + 
##     `JobRoleLaboratory Technician` + JobRoleManager + `JobRoleResearch Director` + 
##     `JobRoleResearch Scientist` + `JobRoleSales Executive` + 
##     `JobRoleSales Representative` + NumCompaniesWorked3 + TotalWorkingYears + 
##     TrainingTimesLastYear2 + TrainingTimesLastYear3 + TrainingTimesLastYear4, 
##     data = dat)
## 
## Coefficients:
##                     (Intercept)                               ID  
##                       3471.4383                          -0.2277  
## BusinessTravelTravel_Frequently      BusinessTravelTravel_Rarely  
##                        190.9083                         336.6109  
##                       DailyRate                  DepartmentSales  
##                          0.1751                        -546.2948  
##                       JobLevel2                        JobLevel3  
##                       1710.2424                        4946.8537  
##                       JobLevel4                        JobLevel5  
##                       8281.5586                       10910.9675  
##        `JobRoleHuman Resources`   `JobRoleLaboratory Technician`  
##                      -1127.4707                       -1270.4388  
##                  JobRoleManager       `JobRoleResearch Director`  
##                       3553.2458                        3458.5846  
##     `JobRoleResearch Scientist`         `JobRoleSales Executive`  
##                      -1071.0152                         482.7060  
##   `JobRoleSales Representative`              NumCompaniesWorked3  
##                       -751.8712                         248.7191  
##               TotalWorkingYears           TrainingTimesLastYear2  
##                         43.9571                        -209.1826  
##          TrainingTimesLastYear3           TrainingTimesLastYear4  
##                       -147.4277                        -210.0349

summary(step.model$finalModel)

## 
## Call:
## lm(formula = .outcome ~ ID + BusinessTravelTravel_Frequently + 
##     BusinessTravelTravel_Rarely + DailyRate + DepartmentSales + 
##     JobLevel2 + JobLevel3 + JobLevel4 + JobLevel5 + `JobRoleHuman Resources` + 
##     `JobRoleLaboratory Technician` + JobRoleManager + `JobRoleResearch Director` + 
##     `JobRoleResearch Scientist` + `JobRoleSales Executive` + 
##     `JobRoleSales Representative` + NumCompaniesWorked3 + TotalWorkingYears + 
##     TrainingTimesLastYear2 + TrainingTimesLastYear3 + TrainingTimesLastYear4, 
##     data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3108.3  -597.6   -60.2   575.3  4095.9 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      3.471e+03  2.153e+02  16.125  < 2e-16 ***
## ID                              -2.277e-01  1.360e-01  -1.674  0.09454 .  
## BusinessTravelTravel_Frequently  1.909e+02  1.315e+02   1.452  0.14689    
## BusinessTravelTravel_Rarely      3.366e+02  1.112e+02   3.026  0.00255 ** 
## DailyRate                        1.751e-01  8.481e-02   2.065  0.03925 *  
## DepartmentSales                 -5.463e+02  2.879e+02  -1.898  0.05807 .  
## JobLevel2                        1.710e+03  1.379e+02  12.401  < 2e-16 ***
## JobLevel3                        4.947e+03  1.853e+02  26.696  < 2e-16 ***
## JobLevel4                        8.282e+03  2.795e+02  29.633  < 2e-16 ***
## JobLevel5                        1.091e+04  3.303e+02  33.036  < 2e-16 ***
## `JobRoleHuman Resources`        -1.127e+03  2.366e+02  -4.765 2.23e-06 ***
## `JobRoleLaboratory Technician`  -1.270e+03  1.528e+02  -8.312 3.71e-16 ***
## JobRoleManager                   3.553e+03  2.511e+02  14.151  < 2e-16 ***
## `JobRoleResearch Director`       3.459e+03  1.924e+02  17.980  < 2e-16 ***
## `JobRoleResearch Scientist`     -1.071e+03  1.568e+02  -6.829 1.63e-11 ***
## `JobRoleSales Executive`         4.827e+02  3.071e+02   1.572  0.11633    
## `JobRoleSales Representative`   -7.519e+02  3.532e+02  -2.129  0.03356 *  
## NumCompaniesWorked3              2.487e+02  1.128e+02   2.204  0.02779 *  
## TotalWorkingYears                4.396e+01  7.720e+00   5.694 1.71e-08 ***
## TrainingTimesLastYear2          -2.092e+02  9.385e+01  -2.229  0.02608 *  
## TrainingTimesLastYear3          -1.474e+02  9.473e+01  -1.556  0.12001    
## TrainingTimesLastYear4          -2.100e+02  1.401e+02  -1.500  0.13408    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 994.8 on 848 degrees of freedom
## Multiple R-squared:  0.9543, Adjusted R-squared:  0.9532 
## F-statistic: 843.6 on 21 and 848 DF,  p-value: < 2.2e-16

#load in CaseStudy2CompSet No Salary.csv to predict monthly salary
noSalary = read.csv("/Users/mingyang/Desktop/SMU/DoingDS_Fall2020/CaseStudy2DDS/CaseStudy2CompSet No Salary.csv",header = TRUE)
cols.to.factor3 = c("Attrition","BusinessTravel","Department","EducationField","EnvironmentSatisfaction",
                   "Gender","JobInvolvement","JobLevel","JobRole","JobSatisfaction","MaritalStatus","NumCompaniesWorked",
                   "OverTime","PerformanceRating","RelationshipSatisfaction","StockOptionLevel","TrainingTimesLastYear",
                   "WorkLifeBalance")
noSalary[cols.to.factor] = lapply(noSalary[cols.to.factor],factor)
prediction.monthlyS = predict(step.model,noSalary)
prediction.monthlyS

##         1         2         3         4         5         6         7         8 
##  5667.412  2877.699 12719.802  2310.953  2726.669  5261.316  5192.232  2114.693 
##         9        10        11        12        13        14        15        16 
##  2851.347 13658.769  9787.549  2803.404  5723.346  5325.265  5695.491  5453.018 
##        17        18        19        20        21        22        23        24 
##  5779.578  5754.350  4701.013  2624.968  4521.582  9210.684  8853.339  5468.689 
##        25        26        27        28        29        30        31        32 
## 10188.107  9240.707  8798.383 15964.144  5778.377  2706.053  2552.283  5444.199 
##        33        34        35        36        37        38        39        40 
##  5917.314  2631.109 16288.177  5300.990  9437.522  5662.776  2522.258  2509.445 
##        41        42        43        44        45        46        47        48 
## 19269.761  2678.623  2745.314 12952.562  5757.764  4138.688  2725.817  5355.613 
##        49        50        51        52        53        54        55        56 
##  2544.940  2940.062  2400.917  2377.687  4401.471  4167.326 17365.697  3072.708 
##        57        58        59        60        61        62        63        64 
##  5578.954 13017.835  2020.920  2498.636  5658.842 12867.853  8822.231  2300.468 
##        65        66        67        68        69        70        71        72 
##  2182.097  2212.168  9254.189  9179.071  4835.846  2397.635  3202.394  9068.966 
##        73        74        75        76        77        78        79        80 
##  9811.282  5454.760  2227.742  5595.192  2249.817  2741.875  2535.885  5361.091 
##        81        82        83        84        85        86        87        88 
##  2934.359  5494.719  5775.978  2518.148  3957.118  7627.025 10261.168  2407.480 
##        89        90        91        92        93        94        95        96 
##  9646.801 19558.217  2898.433  4704.092  5659.831  4576.089  2620.209 19196.431 
##        97        98        99       100       101       102       103       104 
##  2671.183  8730.030 16784.576  2581.113  5708.391  5635.251  2319.499  2727.172 
##       105       106       107       108       109       110       111       112 
##  5727.285  2700.128  6179.544  2631.968  9176.392 19043.250  2795.791  9176.673 
##       113       114       115       116       117       118       119       120 
##  2661.801  2500.815  5569.048  2799.713  2630.520  5495.276  9263.432  2714.953 
##       121       122       123       124       125       126       127       128 
##  4769.642  2832.707  4700.187  5300.632  5181.913  4892.850  2739.799 16618.340 
##       129       130       131       132       133       134       135       136 
##  8936.622  2210.202  4822.534  4396.598  4619.543  6131.805  2994.393  5851.277 
##       137       138       139       140       141       142       143       144 
##  2420.909  5639.663 19648.311  4665.657  9214.549  4777.472  5689.295  4248.363 
##       145       146       147       148       149       150       151       152 
##  5532.011  2831.315  2582.705  8867.610  9012.098  2009.268  5568.672  5341.081 
##       153       154       155       156       157       158       159       160 
##  4459.104  9602.167 16127.775 18496.114  4819.892  5449.039  5370.181  8688.859 
##       161       162       163       164       165       166       167       168 
##  5335.752  2927.915 12273.590  2482.552  4314.764  2783.629  5636.974  8163.939 
##       169       170       171       172       173       174       175       176 
## 12444.541  5732.661  2263.217  3222.048  8781.776  6170.746  2428.014  2165.226 
##       177       178       179       180       181       182       183       184 
##  5536.563 12338.984  2664.315  5447.774  9906.418  2406.066  5740.726  2725.865 
##       185       186       187       188       189       190       191       192 
##  2718.369  2681.407  5478.332 15856.378  4991.833  2804.353  4638.894  5453.582 
##       193       194       195       196       197       198       199       200 
##  6056.415  2265.099  5325.803  2419.336  3373.154 19553.588  5151.517  2288.251 
##       201       202       203       204       205       206       207       208 
##  2501.870  5929.721  2258.844  5868.040  5578.961 16118.711 19247.014  4287.267 
##       209       210       211       212       213       214       215       216 
##  4520.611  4836.603  2864.896  2716.774  2125.861  5766.473  8630.384  9069.882 
##       217       218       219       220       221       222       223       224 
##  5499.176  2799.995  2663.808  5358.748  5330.041  5371.501 12374.608  5322.878 
##       225       226       227       228       229       230       231       232 
##  5105.139  5489.334  3203.456  2653.700  2614.991  5562.176 18908.699  5688.374 
##       233       234       235       236       237       238       239       240 
##  5919.508  9946.783  2870.029  2567.802  2926.337  4772.857  5372.521  5375.972 
##       241       242       243       244       245       246       247       248 
##  5624.512 12563.500 12728.545  5983.697  9659.183  2677.777  6074.552  4354.745 
##       249       250       251       252       253       254       255       256 
##  9630.426  2494.838  9239.601 16311.203  4682.472 16226.657  2799.968  2257.194 
##       257       258       259       260       261       262       263       264 
##  2414.401  4471.251 15655.874  2346.736 12290.851 16170.336  5598.793  2188.040 
##       265       266       267       268       269       270       271       272 
##  2690.483  2367.151  3039.869  5458.563  9200.099 15982.690  5745.099  5546.091 
##       273       274       275       276       277       278       279       280 
##  9771.368  5366.413  4557.821  5463.508  4212.996  2534.546  2784.174  5642.997 
##       281       282       283       284       285       286       287       288 
##  2417.278  5638.533  5119.399  2368.844  5831.761 13053.043  2378.533  4709.101 
##       289       290       291       292       293       294       295       296 
##  2683.833  5599.711  2718.513  8749.298  2563.737  9009.864  2623.892  2025.481 
##       297       298       299       300 
##  9365.880  5641.997  2874.366  2725.665

noSalary$MonthlyIncome = prediction.monthlyS
results1 = noSalary%>% dplyr::select(ID,MonthlyIncome)
#Write results1 to CaseStudy2CompSet No Salary.csv
write.csv(results1,"Case2PredictionsYU Salary.csv",row.names = FALSE)

Now we’d like to see some job role specific trend within the data

Healthcare Representation has the highest average Job Satisfaction Score
Research Director has the lowest average Job Satisfaction Score

#Look at all the job roles
unique(employeeData$JobRole)

## [1] Sales Executive           Research Director        
## [3] Manufacturing Director    Research Scientist       
## [5] Sales Representative      Healthcare Representative
## [7] Manager                   Human Resources          
## [9] Laboratory Technician    
## 9 Levels: Healthcare Representative Human Resources ... Sales Representative

#Loading in Employee data
employeeData = read.csv("/Users/mingyang/Desktop/SMU/DoingDS_Fall2020/CaseStudy2DDS/CaseStudy2-data.csv",header = TRUE)
#summary(employeeData)

#See which column has only unique value
#sapply(employeeData,function(col) length(unique(col)))

#Delete column that has only one unique value
to.be.deleted = which(sapply(employeeData,function(col) length(unique(col))==1))
employeeData = employeeData[,-to.be.deleted]
satisfaction.by.jobrole = employeeData %>% group_by(JobRole) %>% summarize(mean(JobSatisfaction))

## `summarise()` ungrouping output (override with `.groups` argument)

#Change Column name
names(satisfaction.by.jobrole)[2]="JobSatisfaction"
satisfaction.by.jobrole

## # A tibble: 9 x 2
##   JobRole                   JobSatisfaction
##   <chr>                               <dbl>
## 1 Healthcare Representative            2.83
## 2 Human Resources                      2.56
## 3 Laboratory Technician                2.69
## 4 Manager                              2.51
## 5 Manufacturing Director               2.72
## 6 Research Director                    2.49
## 7 Research Scientist                   2.80
## 8 Sales Executive                      2.72
## 9 Sales Representative                 2.70

satisfaction.by.jobrole%>% ggplot(aes(reorder(JobRole,JobSatisfaction),JobSatisfaction))+
  geom_bar(fill="blue",stat="identity") +
  ggtitle("Job role vs Average Job Satisfaction Score")+
  xlab("Job Roles")+ ylab("Average Job Satisfaction Score")+
  coord_flip()

This part I want to explore which Job Role has better Work Life Balance Score

According to our graph, HumanResources has the highest Work Life Balance Score
Healthcare Representative has the lowest Work Life Balance Score, however we know from above, they have the highest Job Satisfaction!

work.life.balance = employeeData %>% group_by(JobRole) %>% summarize(mean(WorkLifeBalance))

## `summarise()` ungrouping output (override with `.groups` argument)

names(work.life.balance)[2]="WorkLifeBalance"
work.life.balance

## # A tibble: 9 x 2
##   JobRole                   WorkLifeBalance
##   <chr>                               <dbl>
## 1 Healthcare Representative            2.67
## 2 Human Resources                      2.96
## 3 Laboratory Technician                2.76
## 4 Manager                              2.76
## 5 Manufacturing Director               2.85
## 6 Research Director                    2.86
## 7 Research Scientist                   2.69
## 8 Sales Executive                      2.82
## 9 Sales Representative                 2.87

work.life.balance%>% ggplot(aes(reorder(JobRole,WorkLifeBalance),WorkLifeBalance))+
  geom_bar(fill="blue",stat="identity") +
  ggtitle("Job role vs Average Work Life Balance Score")+
  xlab("Job Roles")+ ylab("Average Work Life Balance Score")+
  coord_flip()

This Section I want to explore if there are possible Gender Discrimination in the workplace based on Monthly salary

Null Hypothesis: There is no Gender Discrimination Mu(Male) = Mu(Female)
Alternative Hypothesis: There is significant evidence of gender discrimination, Mu(Male)!=Mu(Female)
Using alpha = 0.05, By doing a welch t-test, we have found no significant difference between population mean (p-value = 0.11), which means 0 is plausible value between the two mean differences with a 95% confidence interval.
The Female group has a higher mean of Monthly Income compared to Male
Scope: Not sure how these data is sampled, any inference to the population need to remain speculative

t.test(MonthlyIncome ~ Gender, data = employeeData, conf.level=0.95)

## 
##  Welch Two Sample t-test
## 
## data:  MonthlyIncome by Gender
## t = 1.6207, df = 747.5, p-value = 0.1055
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -109.0107 1140.8560
## sample estimates:
## mean in group Female   mean in group Male 
##             6696.260             6180.337

This Section I will explore if there are differences between Percentage Salary Hike between Gender

Null Hypothesis: There is no difference of percentage salary hike between Gender
Alternative Hypothesis: There are significant difference of percentage salary hike between Male and Female
Conclusion: by using alpha = 0.05 level of significance, we discovered that there is no significant difference (p-value=0.91) of percentage salary hike between Gender
Scope: Not sure how these data is sampled, any inference to the population need to remain speculative

t.test(PercentSalaryHike~Gender,data=employeeData,conf.level=0.95)

## 
##  Welch Two Sample t-test
## 
## data:  PercentSalaryHike by Gender
## t = -0.10789, df = 735.01, p-value = 0.9141
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5302728  0.4750239
## sample estimates:
## mean in group Female   mean in group Male 
##             15.18362             15.21124

This Section to explore some relationship between continuous variables

By exploring the pairwise continuous variables, it doesn’t seem like any two variables has strongo correlation
For example: between MonthlyIncome and PercentSalaryHike, under any MonthlyIncome range, there are various PercentSalaryHike range available, so, maybe it is based on employee performance to determine a specific PercentSalaryHike
More years at the company will in general correlates to more Monthly Income in a positive direction. However, Job Roles seem to have a bigger effect on beginning Monthly Salary. Some High Paying Jobs have lower percentage increase compared to some other roles that start at a lower Monthly Income Salary, such as, Manager or Research Director have high start salary, but the percentage increase as the years at the company increases is low. While Healthcare Representative and Manufacturing Director have lower beggining monthly Income, but the percentage increase as the year goes by is significantly higher.

#import library 
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

employeeData %>% dplyr::select(HourlyRate, MonthlyIncome, MonthlyRate,PercentSalaryHike)%>%
  ggpairs()

#Does more years in the company result in higher MonthlyIncome?
employeeData%>% dplyr::select(YearsAtCompany,MonthlyIncome,JobRole)%>%
  ggplot(aes(x=YearsAtCompany,y=MonthlyIncome,color=JobRole))+
  geom_point()+
  geom_smooth(method='lm',formula=y~x) + ggtitle("MonthlyIncome vs Years At Company across different Job Roles")+
  xlab("Years At Company")+ylab("Monthly Income")

Conclusion

By exploring importance of variables using automated software, we have found OverTime, MonthlyIncome, TotalWorkingYears, to have the highest relations with Attrition, visually however, we see more relations with Job Involvement Levels, Job Roles, Overtime. We were able to build a prediction model using Naive Bayes, by using variables that have stronger relations to predict Attrition, with an average Accuracy rate around 81%.
By using internal 10-fold cross-validation accompanied with Stepwise automated selection, I found the optimal multi-linear regression model to predict MonthlyIncome. The model has generated 0.9532 adjusted r-squared value, which tells me the correlation between MonthlyIncome and the predictors generated by stepwise selection are very good.
I further analyzed several factors of Job Roles that Frito Lay is interested in exploring and found Healthcare Representation has the highest average Job Satisfaction Score, while HumanResources has the highest Work Life Balance Score.
No gender discrimination has been found based on Monthly Income and Percentage Salary Hike according to my statistical analysis with the data that is provided.

CaseStudy2DDS

Mingyang Nick YU

11/14/2020

Introduction to the Project

Importing Data & Initial Inspection

Analysis of factors that lead to attrition

Manuel Variable Selection for predicting Attrition

Run a automated variable selection using caret - compare with my visual inspection

Extract variables select to build my NB model

We will now read in CaseStudy2CompSet No Attrition.csv and get a prediction result under Case2PredictionsYU Attrition.csv

Now, we’d like to predict Montly Income with Multiple Linear Regression

Now we’d like to see some job role specific trend within the data

This part I want to explore which Job Role has better Work Life Balance Score

This Section I want to explore if there are possible Gender Discrimination in the workplace based on Monthly salary

This Section I will explore if there are differences between Percentage Salary Hike between Gender

This Section to explore some relationship between continuous variables

Conclusion