Introduction to the Project
- DDSAnalytics that specializes in talent management solutions for Fortune 100 companies. This is a process of developing and retaining employees. It may include workforce planning, employee training programs, identifying high-potential employees and reducing/preventing voluntary employee turnover (attrition). To gain a competitive edge over its competition, DDSAnalytics is planning to leverage data science for talent management. The executive leadership has identified predicting employee turnover as its first application of data science for talent management. Before the business gree lights the project, they have tasked me to conduct an analysis of existing employee data.
- I have been provided (CaseStudy2-data.csv) to do a data analysis to identify factors that lead to attrition. I need to identify the top three factors that contribute to turnover. The business is also interested in learning about any job role specific trends that may exist in the data set.
- YouTube Presentation: https://www.youtube.com/watch?v=Te3EcaLsR48
Importing Data & Initial Inspection
- There are 870 data entries
- There aren’t any missing values
- There are columns only have one unique value, and won’t be much use for our analysis, delete column to generate cleaner dataset
#Loading libraries needed
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.0
## ✓ tidyr 1.1.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
#Loading in Employee data
employeeData = read.csv("/Users/mingyang/Desktop/SMU/DoingDS_Fall2020/CaseStudy2DDS/CaseStudy2-data.csv",header = TRUE)
summary(employeeData)
## ID Age Attrition BusinessTravel
## Min. : 1.0 Min. :18.00 Length:870 Length:870
## 1st Qu.:218.2 1st Qu.:30.00 Class :character Class :character
## Median :435.5 Median :35.00 Mode :character Mode :character
## Mean :435.5 Mean :36.83
## 3rd Qu.:652.8 3rd Qu.:43.00
## Max. :870.0 Max. :60.00
## DailyRate Department DistanceFromHome Education
## Min. : 103.0 Length:870 Min. : 1.000 Min. :1.000
## 1st Qu.: 472.5 Class :character 1st Qu.: 2.000 1st Qu.:2.000
## Median : 817.5 Mode :character Median : 7.000 Median :3.000
## Mean : 815.2 Mean : 9.339 Mean :2.901
## 3rd Qu.:1165.8 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :1499.0 Max. :29.000 Max. :5.000
## EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction
## Length:870 Min. :1 Min. : 1.0 Min. :1.000
## Class :character 1st Qu.:1 1st Qu.: 477.2 1st Qu.:2.000
## Mode :character Median :1 Median :1039.0 Median :3.000
## Mean :1 Mean :1029.8 Mean :2.701
## 3rd Qu.:1 3rd Qu.:1561.5 3rd Qu.:4.000
## Max. :1 Max. :2064.0 Max. :4.000
## Gender HourlyRate JobInvolvement JobLevel
## Length:870 Min. : 30.00 Min. :1.000 Min. :1.000
## Class :character 1st Qu.: 48.00 1st Qu.:2.000 1st Qu.:1.000
## Mode :character Median : 66.00 Median :3.000 Median :2.000
## Mean : 65.61 Mean :2.723 Mean :2.039
## 3rd Qu.: 83.00 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :100.00 Max. :4.000 Max. :5.000
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## Length:870 Min. :1.000 Length:870 Min. : 1081
## Class :character 1st Qu.:2.000 Class :character 1st Qu.: 2840
## Mode :character Median :3.000 Mode :character Median : 4946
## Mean :2.709 Mean : 6390
## 3rd Qu.:4.000 3rd Qu.: 8182
## Max. :4.000 Max. :19999
## MonthlyRate NumCompaniesWorked Over18 OverTime
## Min. : 2094 Min. :0.000 Length:870 Length:870
## 1st Qu.: 8092 1st Qu.:1.000 Class :character Class :character
## Median :14074 Median :2.000 Mode :character Mode :character
## Mean :14326 Mean :2.728
## 3rd Qu.:20456 3rd Qu.:4.000
## Max. :26997 Max. :9.000
## PercentSalaryHike PerformanceRating RelationshipSatisfaction StandardHours
## Min. :11.0 Min. :3.000 Min. :1.000 Min. :80
## 1st Qu.:12.0 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:80
## Median :14.0 Median :3.000 Median :3.000 Median :80
## Mean :15.2 Mean :3.152 Mean :2.707 Mean :80
## 3rd Qu.:18.0 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:80
## Max. :25.0 Max. :4.000 Max. :4.000 Max. :80
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## Min. :0.0000 Min. : 0.00 Min. :0.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000
## Median :1.0000 Median :10.00 Median :3.000 Median :3.000
## Mean :0.7839 Mean :11.05 Mean :2.832 Mean :2.782
## 3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :3.0000 Max. :40.00 Max. :6.000 Max. :4.000
## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 0.000
## Median : 5.000 Median : 3.000 Median : 1.000
## Mean : 6.962 Mean : 4.205 Mean : 2.169
## 3rd Qu.:10.000 3rd Qu.: 7.000 3rd Qu.: 3.000
## Max. :40.000 Max. :18.000 Max. :15.000
## YearsWithCurrManager
## Min. : 0.00
## 1st Qu.: 2.00
## Median : 3.00
## Mean : 4.14
## 3rd Qu.: 7.00
## Max. :17.00
#See which column has only unique value
sapply(employeeData,function(col) length(unique(col)))
## ID Age Attrition
## 870 43 2
## BusinessTravel DailyRate Department
## 3 627 3
## DistanceFromHome Education EducationField
## 29 5 6
## EmployeeCount EmployeeNumber EnvironmentSatisfaction
## 1 870 4
## Gender HourlyRate JobInvolvement
## 2 71 4
## JobLevel JobRole JobSatisfaction
## 5 9 4
## MaritalStatus MonthlyIncome MonthlyRate
## 3 826 852
## NumCompaniesWorked Over18 OverTime
## 10 1 2
## PercentSalaryHike PerformanceRating RelationshipSatisfaction
## 15 2 4
## StandardHours StockOptionLevel TotalWorkingYears
## 1 4 39
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 7 4 32
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 19 16 17
#Delete column that has only one unique value
to.be.deleted = which(sapply(employeeData,function(col) length(unique(col))==1))
employeeData = employeeData[,-to.be.deleted]
#observe the cleaner set again
summary(employeeData)
## ID Age Attrition BusinessTravel
## Min. : 1.0 Min. :18.00 Length:870 Length:870
## 1st Qu.:218.2 1st Qu.:30.00 Class :character Class :character
## Median :435.5 Median :35.00 Mode :character Mode :character
## Mean :435.5 Mean :36.83
## 3rd Qu.:652.8 3rd Qu.:43.00
## Max. :870.0 Max. :60.00
## DailyRate Department DistanceFromHome Education
## Min. : 103.0 Length:870 Min. : 1.000 Min. :1.000
## 1st Qu.: 472.5 Class :character 1st Qu.: 2.000 1st Qu.:2.000
## Median : 817.5 Mode :character Median : 7.000 Median :3.000
## Mean : 815.2 Mean : 9.339 Mean :2.901
## 3rd Qu.:1165.8 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :1499.0 Max. :29.000 Max. :5.000
## EducationField EmployeeNumber EnvironmentSatisfaction Gender
## Length:870 Min. : 1.0 Min. :1.000 Length:870
## Class :character 1st Qu.: 477.2 1st Qu.:2.000 Class :character
## Mode :character Median :1039.0 Median :3.000 Mode :character
## Mean :1029.8 Mean :2.701
## 3rd Qu.:1561.5 3rd Qu.:4.000
## Max. :2064.0 Max. :4.000
## HourlyRate JobInvolvement JobLevel JobRole
## Min. : 30.00 Min. :1.000 Min. :1.000 Length:870
## 1st Qu.: 48.00 1st Qu.:2.000 1st Qu.:1.000 Class :character
## Median : 66.00 Median :3.000 Median :2.000 Mode :character
## Mean : 65.61 Mean :2.723 Mean :2.039
## 3rd Qu.: 83.00 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :100.00 Max. :4.000 Max. :5.000
## JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## Min. :1.000 Length:870 Min. : 1081 Min. : 2094
## 1st Qu.:2.000 Class :character 1st Qu.: 2840 1st Qu.: 8092
## Median :3.000 Mode :character Median : 4946 Median :14074
## Mean :2.709 Mean : 6390 Mean :14326
## 3rd Qu.:4.000 3rd Qu.: 8182 3rd Qu.:20456
## Max. :4.000 Max. :19999 Max. :26997
## NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## Min. :0.000 Length:870 Min. :11.0 Min. :3.000
## 1st Qu.:1.000 Class :character 1st Qu.:12.0 1st Qu.:3.000
## Median :2.000 Mode :character Median :14.0 Median :3.000
## Mean :2.728 Mean :15.2 Mean :3.152
## 3rd Qu.:4.000 3rd Qu.:18.0 3rd Qu.:3.000
## Max. :9.000 Max. :25.0 Max. :4.000
## RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## Min. :1.000 Min. :0.0000 Min. : 0.00
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.: 6.00
## Median :3.000 Median :1.0000 Median :10.00
## Mean :2.707 Mean :0.7839 Mean :11.05
## 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.:15.00
## Max. :4.000 Max. :3.0000 Max. :40.00
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000
## Median :3.000 Median :3.000 Median : 5.000 Median : 3.000
## Mean :2.832 Mean :2.782 Mean : 6.962 Mean : 4.205
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:10.000 3rd Qu.: 7.000
## Max. :6.000 Max. :4.000 Max. :40.000 Max. :18.000
## YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 2.00
## Median : 1.000 Median : 3.00
## Mean : 2.169 Mean : 4.14
## 3rd Qu.: 3.000 3rd Qu.: 7.00
## Max. :15.000 Max. :17.00
#Convert some values into factors
cols.to.factor = c("Attrition","BusinessTravel","Department","EducationField","EnvironmentSatisfaction",
"Gender","JobInvolvement","JobLevel","JobRole","JobSatisfaction","MaritalStatus","NumCompaniesWorked",
"OverTime","PerformanceRating","RelationshipSatisfaction","StockOptionLevel","TrainingTimesLastYear",
"WorkLifeBalance")
employeeData[cols.to.factor] = lapply(employeeData[cols.to.factor],factor)
str(employeeData)
## 'data.frame': 870 obs. of 33 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 32 40 35 32 24 27 41 37 34 34 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 2 3 2 2 3 3 3 2 ...
## $ DailyRate : int 117 1308 200 801 567 294 1283 309 1333 653 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 3 2 2 2 3 3 2 ...
## $ DistanceFromHome : int 13 14 18 1 2 10 5 10 10 10 ...
## $ Education : int 4 3 2 4 1 2 5 4 4 4 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 4 2 3 6 2 4 2 2 6 ...
## $ EmployeeNumber : int 859 1128 1412 2016 1646 733 1448 1105 1055 1597 ...
## $ EnvironmentSatisfaction : Factor w/ 4 levels "1","2","3","4": 2 3 3 3 1 4 2 4 3 4 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 2 1 1 2 ...
## $ HourlyRate : int 73 44 60 48 32 32 90 88 87 92 ...
## $ JobInvolvement : Factor w/ 4 levels "1","2","3","4": 3 2 3 3 3 3 4 2 3 2 ...
## $ JobLevel : Factor w/ 5 levels "1","2","3","4",..: 2 5 3 3 1 3 1 2 1 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 6 5 8 7 5 7 8 9 1 ...
## $ JobSatisfaction : Factor w/ 4 levels "1","2","3","4": 4 3 4 4 4 1 3 4 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 1 3 3 2 3 1 2 1 2 2 ...
## $ MonthlyIncome : int 4403 19626 9362 10422 3760 8793 2127 6694 2220 5063 ...
## $ MonthlyRate : int 9250 17544 19944 24032 17218 4809 5561 24223 18410 15332 ...
## $ NumCompaniesWorked : Factor w/ 10 levels "0","1","2","3",..: 3 2 3 2 2 2 3 3 2 2 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 2 2 1 ...
## $ PercentSalaryHike : int 11 14 11 19 13 21 12 14 19 14 ...
## $ PerformanceRating : Factor w/ 2 levels "3","4": 1 1 1 1 1 2 1 1 1 1 ...
## $ RelationshipSatisfaction: Factor w/ 4 levels "1","2","3","4": 3 1 3 3 3 3 1 3 4 2 ...
## $ StockOptionLevel : Factor w/ 4 levels "0","1","2","3": 2 1 1 3 1 3 1 4 2 2 ...
## $ TotalWorkingYears : int 8 21 10 14 6 9 7 8 1 8 ...
## $ TrainingTimesLastYear : Factor w/ 7 levels "0","1","2","3",..: 4 3 3 4 3 5 6 6 3 4 ...
## $ WorkLifeBalance : Factor w/ 4 levels "1","2","3","4": 2 4 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 5 20 2 14 6 9 4 1 1 8 ...
## $ YearsInCurrentRole : int 2 7 2 10 3 7 2 0 1 2 ...
## $ YearsSinceLastPromotion : int 0 4 2 5 1 1 0 0 0 7 ...
## $ YearsWithCurrManager : int 3 9 2 7 3 7 3 0 0 7 ...
Analysis of factors that lead to attrition
- 16.1% of employees left, a graph comparison is drawn
- Seems younger people have more numbers of Attrition than older folks, they also have an overall higher proportion of leaving
- Sales has the highest Attrition rate, while Research & Development has the lowest
- Distance from Home doesn’t seem to have a particular correlation with Attrition Rate
- Low education Level has a higher Attrition Rate, while higher education level has lower Attrition Rate
- Human Resources and Technical Degrees have higher Attrition Rate, while Life Sciences have the lowest Attrition Rate
- Low environment satisfaction score of 1 has the highest attrition rate, while score of 2-4 doesn’t seem to have a big differences or change of attrition rate
- Male has a slight higher attrition rate compared to female
- Hourly Rate don’t seem to have a big effect on Attrition rate, it looks pretty random across the board
- Job involvement levels have a great correlation with Attrition rate, the lower the Job Involvement, the higher Attrition Rate. The higher Job involvement level the lower of the attrition rate.
- Job Level has some correlation with Attrition rate, but this relationship is not as strong as Job involvement
- Sales Representation has a significant higher Attrition Rate, while Research Director and Manufacturing Director has significantly lower Attrition rate
- There is a correlation between Attrition rate and Job Satisfaction score, the higher the Job Satisfaction score, the lower the Attrition rate
- Single Marital Status has a higher Attrition Rate, while Divorced group has lowest Attrition Rate
- Monthly Income does seem to have an overall trend of higher income associated with lower Attrition Rate. But this correlation isn’t perfectly strong
- Monthly Rate doesn’t seem to have an overall trend of higher Monthly Rate associated with lower attrition rate.
- Not super strong correlation, but it seems people have changed jobs many times, don’t have problems changing again.
- It seems people with many overtime are associated with a higher attrition rate
- It doesn’t seem percent of Salary Hike has a strong correlation with attrition rate. Even a high percentage of Salary Hike can still have a high Attrition rate
- Performance Rating don’t have a big difference in terms of Attrition rate
- Relationship Satisfaction have a sligh correlation with the attrition rate. However, the relationship isn’t super strong.
- Stock option level of 0 and 3 has much higher attrition Rate compared to Stock option level of 1 and 2
- In general, the more total working years, the less of a attrition rate it becomes, a lot of people start retiring after working for 30 years, almost 100% folks retire after working for 40 years
- There seem to have an overall trend of more training in the last year, less like a person will leave, but this trend isn’t perfect
- The higher the work Life Balance Score, the lower Attrition rate it becomes
- Within 10 years of working with the company, there seem to have an overall trend of decreasing of Attrition rate with the year increase, however between 10-20 years, the attrition rate seem to fluctuate, while a lot of folks retire after year of 25.
- Years In current role doesn’t seem to have a linear relationship but a quadratic relationship
- There don’t seem to be any definitive relationship between years since last promotion and Attrition rate
- There don’t seem to have a definitive relationship between attrition rate and years with current manager
# First discover the percentage of Attrition in the Dataset
table(employeeData$Attrition) # 730 No, 140 Yes
##
## No Yes
## 730 140
attrition.percent = 140/870
attrition.percent
## [1] 0.1609195
# Visually look at percentage of Attrition
employeeData %>% ggplot() + geom_bar(aes(x=Attrition),fill="light blue") +
ggtitle("Attrition Count comparison")+ylab("Number of Employee")

# Age with Attrition
employeeData %>% ggplot() + geom_bar(aes(x=Age,fill=Attrition)) +
ggtitle("Attrition with Age")+ylab("Number of Empllyee")

# Age with Attrition Proportion
employeeData %>% ggplot() + geom_bar(aes(x=Age,fill=Attrition),position="fill") +
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition with Age")+ylab("Proportion of Attrition")

#Department
employeeData %>% ggplot()+geom_bar(aes(x=Department,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate under different Department")+ylab("Attrition Rate")

# Distance from Home
employeeData %>% ggplot()+ geom_bar(aes(x=DistanceFromHome,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate under different distance from Home")+ylab("Attrition Rate")

#Education
employeeData %>% ggplot()+ geom_bar(aes(x=Education,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate under different Education Level")+ylab("Attrition Rate")+xlab("Education Level")

#Education Field
employeeData %>% ggplot()+ geom_bar(aes(x=EducationField,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate under different Education Field")+ylab("Attrition Rate")+xlab("Education Field")

#Environment Satisfaction
employeeData %>% ggplot()+ geom_bar(aes(x=EnvironmentSatisfaction,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate under different Environment Satisfaction Score")+ylab("Attrition Rate")+xlab("Environment Satisfaction")

# Gender
employeeData %>% ggplot()+ geom_bar(aes(x=Gender,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate with Gender")+ylab("Attrition Rate")+xlab("Gender")

# Hourly Rate
employeeData %>% ggplot()+ geom_bar(aes(x=HourlyRate,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate under different Hourly Rate")+ylab("Attrition Rate")+xlab("Hourly Rate")

# Job Involvement
employeeData %>% ggplot()+ geom_bar(aes(x=JobInvolvement,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate under different Job Involvement Levels")+
ylab("Attrition Rate")+xlab("Job Involvement Levels")

# Job Level
employeeData %>% ggplot()+ geom_bar(aes(x=JobLevel,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate under different Job Level")+ylab("Attrition Rate")+xlab("Job Level")

#Job Role
employeeData %>% ggplot()+ geom_bar(aes(x=JobRole,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate under different Job Roles")+ylab("Attrition Rate")+xlab("Job Roles")+coord_flip()

# Job Satisfaction
employeeData %>% ggplot()+ geom_bar(aes(x=JobSatisfaction,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate under different Job Satisfaction Score")+ylab("Attrition Rate")+xlab("Job Satisfaction")

# Marital Status
employeeData %>% ggplot()+ geom_bar(aes(x=MaritalStatus,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate under Marital Status")+ylab("Attrition Rate")+xlab("Marital Status")

#Monthly Income
employeeData %>% ggplot(aes(x=MonthlyIncome,fill=Attrition))+ geom_bar(color="black",stat="bin",binwidth=500)+
ggtitle("Attrition count under different Monthly Income")+ylab("Attrition Count")+xlab("Monthly Income group")

#Monthly Rate
employeeData %>% ggplot(aes(x=MonthlyRate,fill=Attrition))+ geom_bar(color="black",stat="bin",binwidth=1000)+
ggtitle("Attrition count under different Monthly Rate")+ylab("Attrition Count")+xlab("Monthly Rate")

#Numbers of Companies Worked
employeeData %>% ggplot()+ geom_bar(aes(x=NumCompaniesWorked,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate under different Number of Companies Worked before")+ylab("Attrition Rate")+xlab("Number of Companies Worked")

# Over Time
employeeData %>% ggplot()+ geom_bar(aes(x=OverTime,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate vs Overtime")+ylab("Attrition Rate")+xlab("OverTime Or Not")

#Percent Salary Hike
employeeData %>% ggplot()+ geom_bar(aes(x=PercentSalaryHike,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate under different Percent of Salary Hike")+ylab("Attrition Rate")+xlab("Percent of Salary Hike")

#Performance Rating
employeeData %>% ggplot()+ geom_bar(aes(x=PerformanceRating,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate under different Performance Rating")+ylab("Attrition Rate")+xlab("Performance Rating")

#Relationship Satisfaction
employeeData %>% ggplot()+ geom_bar(aes(x=RelationshipSatisfaction,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate under different Relationship Satisfaction Score")+ylab("Attrition Rate")+xlab("Relationship Satisfaction")

#Stock Option Level
employeeData %>% ggplot()+ geom_bar(aes(x=StockOptionLevel,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate under different Stock Option Level")+ylab("Attrition Rate")+xlab("Stock Option Level")

#Total Working Years
employeeData %>% ggplot()+ geom_bar(aes(x=TotalWorkingYears,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate vs Total Working Years")+ylab("Attrition Rate")+xlab("Total Working Years")

#Training Times Last Year
employeeData %>% ggplot()+ geom_bar(aes(x=TrainingTimesLastYear,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate vs Training Times Last Year")+ylab("Attrition Rate")+xlab("Training Times Last Year")

#Work Life Balance Score
employeeData %>% ggplot()+ geom_bar(aes(x=WorkLifeBalance,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate vs Work Life Balance Score")+ylab("Attrition Rate")+xlab("Work Life Balance Score")

# Years At Company
employeeData %>% ggplot()+ geom_bar(aes(x=YearsAtCompany,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate vs. Years At Company")+ylab("Attrition Rate")+xlab("Years At Company")

# Years in current Role
employeeData %>% ggplot()+ geom_bar(aes(x=YearsInCurrentRole,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate vs Years in Current Role")+ylab("Attrition Rate")+xlab("Years in current role")

# Years Since Last Promotion
employeeData %>% ggplot()+ geom_bar(aes(x=YearsSinceLastPromotion,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate vs. Years Since last promotion")+ylab("Attrition Rate")+xlab("Years Since Last Promotion")

# Years With Current Manager
employeeData %>% ggplot()+ geom_bar(aes(x=YearsWithCurrManager,fill=Attrition),position="fill")+
scale_y_continuous(labels = scales::percent)+
ggtitle("Attrition rate vs. Years With Current Manager")+ylab("Attrition Rate")+xlab("Years With Current Manager")

Manuel Variable Selection for predicting Attrition
- Based on the results above, we have a basic understanding on what can affect attrition rate
- Visually inspecting all relevance, Job Involvement Levels, Job Roles, Overtime seem to have the most direct/significant correlation with Attrition Rate
- Due to the amount of categorical variables involved, I consider Naive Bayes Classifier would make the best model to predict Attrition
- Manually picking out variables used to predict Attrition based on the relationship explored above are below:
- Age, Department, Education Level, Job Involvement, Job Level, Job Role, Job Satisfaction, Marital Status, Monthly Income, Number of Companies Worked, OverTime, Stock Option Level, Total Working Years, Work Life Balance, Years at Company, Years in Current Role, Years With Current Manager
Run a automated variable selection using caret - compare with my visual inspection
- As we can see the top three most important variable according to the automated variable selection using caret are: OverTime, MonthlyIncome, TotalWorkingYears. Apparently, visually, we only got Overtime right in that sense.
- According to the importance of variables, we can use the following variables to construct our Naive Bayes Model: OverTime, MonthlyIncome, TotalWorkingYears, YearsAtCompany, StockOptionLevel, MaritalStatus, JobLevel, YearsInCurrentRole, YearsWithCurrManager, Age, JobInvolvement, JobSatisfaction, JobRole, Department, all of these variables are confirmed by the above visual inspection. In addition, I want to add Education, WorkLifeBalance, Environment Satisfaction since they were confirmed of importance visually.
#Load in libraries
library(lattice)
library(caret)
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(mlbench)
#prepare training scheme
control = trainControl(method="repeatedcv", number=10, repeats=3)
#train the model
model = train(Attrition~.,data=employeeData,method="lvq",preProcess="scale", trControl=control)
#estimate variable importance
importance = varImp(model,scale=FALSE)
#summarize importance
print(importance)
## ROC curve variable importance
##
## only 20 most important variables shown (out of 32)
##
## Importance
## OverTime 0.6679
## MonthlyIncome 0.6567
## TotalWorkingYears 0.6563
## YearsAtCompany 0.6470
## StockOptionLevel 0.6455
## MaritalStatus 0.6438
## JobLevel 0.6406
## YearsInCurrentRole 0.6403
## YearsWithCurrManager 0.6291
## Age 0.6265
## JobInvolvement 0.6159
## JobSatisfaction 0.5833
## JobRole 0.5829
## Department 0.5605
## DistanceFromHome 0.5586
## EnvironmentSatisfaction 0.5532
## WorkLifeBalance 0.5491
## TrainingTimesLastYear 0.5428
## Education 0.5384
## ID 0.5371
#plot importance
plot(importance)

We will now read in CaseStudy2CompSet No Attrition.csv and get a prediction result under Case2PredictionsYU Attrition.csv
#read in data of no Attrition
noAttrition = read.csv("/Users/mingyang/Desktop/SMU/DoingDS_Fall2020/CaseStudy2DDS/CaseStudy2CompSet No Attrition.csv",header = TRUE)
head(noAttrition)
## ID Age BusinessTravel DailyRate Department DistanceFromHome
## 1 1171 35 Travel_Rarely 750 Research & Development 28
## 2 1172 33 Travel_Rarely 147 Human Resources 2
## 3 1173 26 Travel_Rarely 1330 Research & Development 21
## 4 1174 55 Travel_Rarely 1311 Research & Development 2
## 5 1175 29 Travel_Rarely 1246 Sales 19
## 6 1176 51 Travel_Frequently 1456 Research & Development 1
## Education EducationField EmployeeCount EmployeeNumber
## 1 3 Life Sciences 1 1596
## 2 3 Human Resources 1 1207
## 3 3 Medical 1 1107
## 4 3 Life Sciences 1 505
## 5 3 Life Sciences 1 1497
## 6 4 Medical 1 145
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Male 46 4 2
## 2 2 Male 99 3 1
## 3 1 Male 37 3 1
## 4 3 Female 97 3 4
## 5 3 Male 77 2 2
## 6 1 Female 30 2 3
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1 Laboratory Technician 3 Married 3407
## 2 Human Resources 3 Married 3600
## 3 Laboratory Technician 3 Divorced 2377
## 4 Manager 4 Single 16659
## 5 Sales Executive 3 Divorced 8620
## 6 Healthcare Representative 1 Single 7484
## MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1 25348 1 Y No 17
## 2 8429 1 Y No 13
## 3 19373 1 Y No 20
## 4 23258 2 Y Yes 13
## 5 23757 1 Y No 14
## 6 25796 3 Y No 20
## PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## 1 3 4 80 2
## 2 3 4 80 1
## 3 4 3 80 1
## 4 3 3 80 0
## 5 3 3 80 2
## 6 4 3 80 0
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 1 10 3 2 10
## 2 5 2 3 5
## 3 1 0 2 1
## 4 30 2 3 5
## 5 10 3 3 10
## 6 23 1 2 13
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 1 9 6 8
## 2 4 1 4
## 3 1 0 0
## 4 4 1 2
## 5 7 0 4
## 6 12 12 8
#Change certain column to factors
cols.to.factor2 = c("BusinessTravel","Department","EducationField","EnvironmentSatisfaction",
"Gender","JobInvolvement","JobLevel","JobRole","JobSatisfaction","MaritalStatus","NumCompaniesWorked",
"OverTime","PerformanceRating","RelationshipSatisfaction","StockOptionLevel","TrainingTimesLastYear",
"WorkLifeBalance")
noAttrition[cols.to.factor2] = lapply(noAttrition[cols.to.factor2],factor)
#Create prediction set
predictionSet = noAttrition %>% select(OverTime, MonthlyIncome, TotalWorkingYears, YearsAtCompany, StockOptionLevel, MaritalStatus, JobLevel, YearsInCurrentRole, YearsWithCurrManager, Age, JobInvolvement, JobSatisfaction, JobRole, Department,Education, WorkLifeBalance, EnvironmentSatisfaction)
#Retrain the NB model with entire dataset to improve prediction accuracy
model.nb = naiveBayes(Attrition~.,data=data.nb, laplace = 1)
prediction.NoAttrition = predict(model.nb,noAttrition)
noAttrition$Attrition = prediction.NoAttrition
results = noAttrition%>%select(ID,Attrition)
#write results into Case2PredictionsYU Attrition.csv
write.csv(results,"Case2PredictionsYU Attrition.csv",row.names = FALSE)
Now, we’d like to predict Montly Income with Multiple Linear Regression
- I used internal 10-fold cross-validation accompanied with Stepwise selection using AIC to find the optimal multi-linear regression model
- RMSE using 10-fold cross-validation is 1025.488
- As we can see we got Adjusted R-squared of 0.9532 of our model, which is pretty good
- As we have used this optimized model to generate predictions for Monthly Salary for CaseStudy2CompSet No Salary.csv, and exported to Case2PredictionsYU Salary.csv
#Load library to run stepwise regression method to choose an optimal simple model
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
#Build the model with internel verfication
set.seed(24)
train.control <- trainControl(method = "cv", number = 10)
step.model = train(MonthlyIncome~., data=employeeData,
method="lmStepAIC",
trControl = train.control,
trace=FALSE)
#Model Accuracy
step.model$results
## parameter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 none 1023.433 0.9490945 784.3238 128.5067 0.01824883 86.87328
step.model$finalModel
##
## Call:
## lm(formula = .outcome ~ ID + BusinessTravelTravel_Frequently +
## BusinessTravelTravel_Rarely + DailyRate + DepartmentSales +
## JobLevel2 + JobLevel3 + JobLevel4 + JobLevel5 + `JobRoleHuman Resources` +
## `JobRoleLaboratory Technician` + JobRoleManager + `JobRoleResearch Director` +
## `JobRoleResearch Scientist` + `JobRoleSales Executive` +
## `JobRoleSales Representative` + NumCompaniesWorked3 + TotalWorkingYears +
## TrainingTimesLastYear2 + TrainingTimesLastYear3 + TrainingTimesLastYear4,
## data = dat)
##
## Coefficients:
## (Intercept) ID
## 3471.4383 -0.2277
## BusinessTravelTravel_Frequently BusinessTravelTravel_Rarely
## 190.9083 336.6109
## DailyRate DepartmentSales
## 0.1751 -546.2948
## JobLevel2 JobLevel3
## 1710.2424 4946.8537
## JobLevel4 JobLevel5
## 8281.5586 10910.9675
## `JobRoleHuman Resources` `JobRoleLaboratory Technician`
## -1127.4707 -1270.4388
## JobRoleManager `JobRoleResearch Director`
## 3553.2458 3458.5846
## `JobRoleResearch Scientist` `JobRoleSales Executive`
## -1071.0152 482.7060
## `JobRoleSales Representative` NumCompaniesWorked3
## -751.8712 248.7191
## TotalWorkingYears TrainingTimesLastYear2
## 43.9571 -209.1826
## TrainingTimesLastYear3 TrainingTimesLastYear4
## -147.4277 -210.0349
summary(step.model$finalModel)
##
## Call:
## lm(formula = .outcome ~ ID + BusinessTravelTravel_Frequently +
## BusinessTravelTravel_Rarely + DailyRate + DepartmentSales +
## JobLevel2 + JobLevel3 + JobLevel4 + JobLevel5 + `JobRoleHuman Resources` +
## `JobRoleLaboratory Technician` + JobRoleManager + `JobRoleResearch Director` +
## `JobRoleResearch Scientist` + `JobRoleSales Executive` +
## `JobRoleSales Representative` + NumCompaniesWorked3 + TotalWorkingYears +
## TrainingTimesLastYear2 + TrainingTimesLastYear3 + TrainingTimesLastYear4,
## data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3108.3 -597.6 -60.2 575.3 4095.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.471e+03 2.153e+02 16.125 < 2e-16 ***
## ID -2.277e-01 1.360e-01 -1.674 0.09454 .
## BusinessTravelTravel_Frequently 1.909e+02 1.315e+02 1.452 0.14689
## BusinessTravelTravel_Rarely 3.366e+02 1.112e+02 3.026 0.00255 **
## DailyRate 1.751e-01 8.481e-02 2.065 0.03925 *
## DepartmentSales -5.463e+02 2.879e+02 -1.898 0.05807 .
## JobLevel2 1.710e+03 1.379e+02 12.401 < 2e-16 ***
## JobLevel3 4.947e+03 1.853e+02 26.696 < 2e-16 ***
## JobLevel4 8.282e+03 2.795e+02 29.633 < 2e-16 ***
## JobLevel5 1.091e+04 3.303e+02 33.036 < 2e-16 ***
## `JobRoleHuman Resources` -1.127e+03 2.366e+02 -4.765 2.23e-06 ***
## `JobRoleLaboratory Technician` -1.270e+03 1.528e+02 -8.312 3.71e-16 ***
## JobRoleManager 3.553e+03 2.511e+02 14.151 < 2e-16 ***
## `JobRoleResearch Director` 3.459e+03 1.924e+02 17.980 < 2e-16 ***
## `JobRoleResearch Scientist` -1.071e+03 1.568e+02 -6.829 1.63e-11 ***
## `JobRoleSales Executive` 4.827e+02 3.071e+02 1.572 0.11633
## `JobRoleSales Representative` -7.519e+02 3.532e+02 -2.129 0.03356 *
## NumCompaniesWorked3 2.487e+02 1.128e+02 2.204 0.02779 *
## TotalWorkingYears 4.396e+01 7.720e+00 5.694 1.71e-08 ***
## TrainingTimesLastYear2 -2.092e+02 9.385e+01 -2.229 0.02608 *
## TrainingTimesLastYear3 -1.474e+02 9.473e+01 -1.556 0.12001
## TrainingTimesLastYear4 -2.100e+02 1.401e+02 -1.500 0.13408
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 994.8 on 848 degrees of freedom
## Multiple R-squared: 0.9543, Adjusted R-squared: 0.9532
## F-statistic: 843.6 on 21 and 848 DF, p-value: < 2.2e-16
#load in CaseStudy2CompSet No Salary.csv to predict monthly salary
noSalary = read.csv("/Users/mingyang/Desktop/SMU/DoingDS_Fall2020/CaseStudy2DDS/CaseStudy2CompSet No Salary.csv",header = TRUE)
cols.to.factor3 = c("Attrition","BusinessTravel","Department","EducationField","EnvironmentSatisfaction",
"Gender","JobInvolvement","JobLevel","JobRole","JobSatisfaction","MaritalStatus","NumCompaniesWorked",
"OverTime","PerformanceRating","RelationshipSatisfaction","StockOptionLevel","TrainingTimesLastYear",
"WorkLifeBalance")
noSalary[cols.to.factor] = lapply(noSalary[cols.to.factor],factor)
prediction.monthlyS = predict(step.model,noSalary)
prediction.monthlyS
## 1 2 3 4 5 6 7 8
## 5667.412 2877.699 12719.802 2310.953 2726.669 5261.316 5192.232 2114.693
## 9 10 11 12 13 14 15 16
## 2851.347 13658.769 9787.549 2803.404 5723.346 5325.265 5695.491 5453.018
## 17 18 19 20 21 22 23 24
## 5779.578 5754.350 4701.013 2624.968 4521.582 9210.684 8853.339 5468.689
## 25 26 27 28 29 30 31 32
## 10188.107 9240.707 8798.383 15964.144 5778.377 2706.053 2552.283 5444.199
## 33 34 35 36 37 38 39 40
## 5917.314 2631.109 16288.177 5300.990 9437.522 5662.776 2522.258 2509.445
## 41 42 43 44 45 46 47 48
## 19269.761 2678.623 2745.314 12952.562 5757.764 4138.688 2725.817 5355.613
## 49 50 51 52 53 54 55 56
## 2544.940 2940.062 2400.917 2377.687 4401.471 4167.326 17365.697 3072.708
## 57 58 59 60 61 62 63 64
## 5578.954 13017.835 2020.920 2498.636 5658.842 12867.853 8822.231 2300.468
## 65 66 67 68 69 70 71 72
## 2182.097 2212.168 9254.189 9179.071 4835.846 2397.635 3202.394 9068.966
## 73 74 75 76 77 78 79 80
## 9811.282 5454.760 2227.742 5595.192 2249.817 2741.875 2535.885 5361.091
## 81 82 83 84 85 86 87 88
## 2934.359 5494.719 5775.978 2518.148 3957.118 7627.025 10261.168 2407.480
## 89 90 91 92 93 94 95 96
## 9646.801 19558.217 2898.433 4704.092 5659.831 4576.089 2620.209 19196.431
## 97 98 99 100 101 102 103 104
## 2671.183 8730.030 16784.576 2581.113 5708.391 5635.251 2319.499 2727.172
## 105 106 107 108 109 110 111 112
## 5727.285 2700.128 6179.544 2631.968 9176.392 19043.250 2795.791 9176.673
## 113 114 115 116 117 118 119 120
## 2661.801 2500.815 5569.048 2799.713 2630.520 5495.276 9263.432 2714.953
## 121 122 123 124 125 126 127 128
## 4769.642 2832.707 4700.187 5300.632 5181.913 4892.850 2739.799 16618.340
## 129 130 131 132 133 134 135 136
## 8936.622 2210.202 4822.534 4396.598 4619.543 6131.805 2994.393 5851.277
## 137 138 139 140 141 142 143 144
## 2420.909 5639.663 19648.311 4665.657 9214.549 4777.472 5689.295 4248.363
## 145 146 147 148 149 150 151 152
## 5532.011 2831.315 2582.705 8867.610 9012.098 2009.268 5568.672 5341.081
## 153 154 155 156 157 158 159 160
## 4459.104 9602.167 16127.775 18496.114 4819.892 5449.039 5370.181 8688.859
## 161 162 163 164 165 166 167 168
## 5335.752 2927.915 12273.590 2482.552 4314.764 2783.629 5636.974 8163.939
## 169 170 171 172 173 174 175 176
## 12444.541 5732.661 2263.217 3222.048 8781.776 6170.746 2428.014 2165.226
## 177 178 179 180 181 182 183 184
## 5536.563 12338.984 2664.315 5447.774 9906.418 2406.066 5740.726 2725.865
## 185 186 187 188 189 190 191 192
## 2718.369 2681.407 5478.332 15856.378 4991.833 2804.353 4638.894 5453.582
## 193 194 195 196 197 198 199 200
## 6056.415 2265.099 5325.803 2419.336 3373.154 19553.588 5151.517 2288.251
## 201 202 203 204 205 206 207 208
## 2501.870 5929.721 2258.844 5868.040 5578.961 16118.711 19247.014 4287.267
## 209 210 211 212 213 214 215 216
## 4520.611 4836.603 2864.896 2716.774 2125.861 5766.473 8630.384 9069.882
## 217 218 219 220 221 222 223 224
## 5499.176 2799.995 2663.808 5358.748 5330.041 5371.501 12374.608 5322.878
## 225 226 227 228 229 230 231 232
## 5105.139 5489.334 3203.456 2653.700 2614.991 5562.176 18908.699 5688.374
## 233 234 235 236 237 238 239 240
## 5919.508 9946.783 2870.029 2567.802 2926.337 4772.857 5372.521 5375.972
## 241 242 243 244 245 246 247 248
## 5624.512 12563.500 12728.545 5983.697 9659.183 2677.777 6074.552 4354.745
## 249 250 251 252 253 254 255 256
## 9630.426 2494.838 9239.601 16311.203 4682.472 16226.657 2799.968 2257.194
## 257 258 259 260 261 262 263 264
## 2414.401 4471.251 15655.874 2346.736 12290.851 16170.336 5598.793 2188.040
## 265 266 267 268 269 270 271 272
## 2690.483 2367.151 3039.869 5458.563 9200.099 15982.690 5745.099 5546.091
## 273 274 275 276 277 278 279 280
## 9771.368 5366.413 4557.821 5463.508 4212.996 2534.546 2784.174 5642.997
## 281 282 283 284 285 286 287 288
## 2417.278 5638.533 5119.399 2368.844 5831.761 13053.043 2378.533 4709.101
## 289 290 291 292 293 294 295 296
## 2683.833 5599.711 2718.513 8749.298 2563.737 9009.864 2623.892 2025.481
## 297 298 299 300
## 9365.880 5641.997 2874.366 2725.665
noSalary$MonthlyIncome = prediction.monthlyS
results1 = noSalary%>% dplyr::select(ID,MonthlyIncome)
#Write results1 to CaseStudy2CompSet No Salary.csv
write.csv(results1,"Case2PredictionsYU Salary.csv",row.names = FALSE)
Now we’d like to see some job role specific trend within the data
- Healthcare Representation has the highest average Job Satisfaction Score
- Research Director has the lowest average Job Satisfaction Score
#Look at all the job roles
unique(employeeData$JobRole)
## [1] Sales Executive Research Director
## [3] Manufacturing Director Research Scientist
## [5] Sales Representative Healthcare Representative
## [7] Manager Human Resources
## [9] Laboratory Technician
## 9 Levels: Healthcare Representative Human Resources ... Sales Representative
#Loading in Employee data
employeeData = read.csv("/Users/mingyang/Desktop/SMU/DoingDS_Fall2020/CaseStudy2DDS/CaseStudy2-data.csv",header = TRUE)
#summary(employeeData)
#See which column has only unique value
#sapply(employeeData,function(col) length(unique(col)))
#Delete column that has only one unique value
to.be.deleted = which(sapply(employeeData,function(col) length(unique(col))==1))
employeeData = employeeData[,-to.be.deleted]
satisfaction.by.jobrole = employeeData %>% group_by(JobRole) %>% summarize(mean(JobSatisfaction))
## `summarise()` ungrouping output (override with `.groups` argument)
#Change Column name
names(satisfaction.by.jobrole)[2]="JobSatisfaction"
satisfaction.by.jobrole
## # A tibble: 9 x 2
## JobRole JobSatisfaction
## <chr> <dbl>
## 1 Healthcare Representative 2.83
## 2 Human Resources 2.56
## 3 Laboratory Technician 2.69
## 4 Manager 2.51
## 5 Manufacturing Director 2.72
## 6 Research Director 2.49
## 7 Research Scientist 2.80
## 8 Sales Executive 2.72
## 9 Sales Representative 2.70
satisfaction.by.jobrole%>% ggplot(aes(reorder(JobRole,JobSatisfaction),JobSatisfaction))+
geom_bar(fill="blue",stat="identity") +
ggtitle("Job role vs Average Job Satisfaction Score")+
xlab("Job Roles")+ ylab("Average Job Satisfaction Score")+
coord_flip()

This part I want to explore which Job Role has better Work Life Balance Score
- According to our graph, HumanResources has the highest Work Life Balance Score
- Healthcare Representative has the lowest Work Life Balance Score, however we know from above, they have the highest Job Satisfaction!
work.life.balance = employeeData %>% group_by(JobRole) %>% summarize(mean(WorkLifeBalance))
## `summarise()` ungrouping output (override with `.groups` argument)
names(work.life.balance)[2]="WorkLifeBalance"
work.life.balance
## # A tibble: 9 x 2
## JobRole WorkLifeBalance
## <chr> <dbl>
## 1 Healthcare Representative 2.67
## 2 Human Resources 2.96
## 3 Laboratory Technician 2.76
## 4 Manager 2.76
## 5 Manufacturing Director 2.85
## 6 Research Director 2.86
## 7 Research Scientist 2.69
## 8 Sales Executive 2.82
## 9 Sales Representative 2.87
work.life.balance%>% ggplot(aes(reorder(JobRole,WorkLifeBalance),WorkLifeBalance))+
geom_bar(fill="blue",stat="identity") +
ggtitle("Job role vs Average Work Life Balance Score")+
xlab("Job Roles")+ ylab("Average Work Life Balance Score")+
coord_flip()

This Section I want to explore if there are possible Gender Discrimination in the workplace based on Monthly salary
- Null Hypothesis: There is no Gender Discrimination Mu(Male) = Mu(Female)
- Alternative Hypothesis: There is significant evidence of gender discrimination, Mu(Male)!=Mu(Female)
- Using alpha = 0.05, By doing a welch t-test, we have found no significant difference between population mean (p-value = 0.11), which means 0 is plausible value between the two mean differences with a 95% confidence interval.
- The Female group has a higher mean of Monthly Income compared to Male
- Scope: Not sure how these data is sampled, any inference to the population need to remain speculative
t.test(MonthlyIncome ~ Gender, data = employeeData, conf.level=0.95)
##
## Welch Two Sample t-test
##
## data: MonthlyIncome by Gender
## t = 1.6207, df = 747.5, p-value = 0.1055
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -109.0107 1140.8560
## sample estimates:
## mean in group Female mean in group Male
## 6696.260 6180.337
This Section I will explore if there are differences between Percentage Salary Hike between Gender
- Null Hypothesis: There is no difference of percentage salary hike between Gender
- Alternative Hypothesis: There are significant difference of percentage salary hike between Male and Female
- Conclusion: by using alpha = 0.05 level of significance, we discovered that there is no significant difference (p-value=0.91) of percentage salary hike between Gender
- Scope: Not sure how these data is sampled, any inference to the population need to remain speculative
t.test(PercentSalaryHike~Gender,data=employeeData,conf.level=0.95)
##
## Welch Two Sample t-test
##
## data: PercentSalaryHike by Gender
## t = -0.10789, df = 735.01, p-value = 0.9141
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5302728 0.4750239
## sample estimates:
## mean in group Female mean in group Male
## 15.18362 15.21124
This Section to explore some relationship between continuous variables
- By exploring the pairwise continuous variables, it doesn’t seem like any two variables has strongo correlation
- For example: between MonthlyIncome and PercentSalaryHike, under any MonthlyIncome range, there are various PercentSalaryHike range available, so, maybe it is based on employee performance to determine a specific PercentSalaryHike
- More years at the company will in general correlates to more Monthly Income in a positive direction. However, Job Roles seem to have a bigger effect on beginning Monthly Salary. Some High Paying Jobs have lower percentage increase compared to some other roles that start at a lower Monthly Income Salary, such as, Manager or Research Director have high start salary, but the percentage increase as the years at the company increases is low. While Healthcare Representative and Manufacturing Director have lower beggining monthly Income, but the percentage increase as the year goes by is significantly higher.
#import library
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
employeeData %>% dplyr::select(HourlyRate, MonthlyIncome, MonthlyRate,PercentSalaryHike)%>%
ggpairs()

#Does more years in the company result in higher MonthlyIncome?
employeeData%>% dplyr::select(YearsAtCompany,MonthlyIncome,JobRole)%>%
ggplot(aes(x=YearsAtCompany,y=MonthlyIncome,color=JobRole))+
geom_point()+
geom_smooth(method='lm',formula=y~x) + ggtitle("MonthlyIncome vs Years At Company across different Job Roles")+
xlab("Years At Company")+ylab("Monthly Income")

Conclusion
- By exploring importance of variables using automated software, we have found OverTime, MonthlyIncome, TotalWorkingYears, to have the highest relations with Attrition, visually however, we see more relations with Job Involvement Levels, Job Roles, Overtime. We were able to build a prediction model using Naive Bayes, by using variables that have stronger relations to predict Attrition, with an average Accuracy rate around 81%.
- By using internal 10-fold cross-validation accompanied with Stepwise automated selection, I found the optimal multi-linear regression model to predict MonthlyIncome. The model has generated 0.9532 adjusted r-squared value, which tells me the correlation between MonthlyIncome and the predictors generated by stepwise selection are very good.
- I further analyzed several factors of Job Roles that Frito Lay is interested in exploring and found Healthcare Representation has the highest average Job Satisfaction Score, while HumanResources has the highest Work Life Balance Score.
- No gender discrimination has been found based on Monthly Income and Percentage Salary Hike according to my statistical analysis with the data that is provided.