Case Study Two

Diabetes Readmission Study

David Grijalva, Nicole Norelli, & Mingyang Nick YU

9/17/2021

Abstract

The following deliverable investigated and predicted diabetes patients for hospital readmission. Different aspects of each patient's visit were collected and logistic regression with regularization was the primary machine learning method used. Due to the nature of the area, missing data and ethical concerns were addressed. Cross validation was utilized for model selection and model evaluation between different tuning parameters. Feature importance, highlighting the most important features contributing to each target category, as well as results and benefits of the prediction model were later discussed.

1. Introduction

This case study focused on predicting whether an admitted diabetes patient would be readmitted based on the characteristics of each patient and hospital outcomes. Information was extracted based on several criteria, including whether it was a diabetic encounter, if the stay at the hospital was between 1-14 days, if lab tests were performed, if medications were administered, etc. Details can be found here.

The data comes from 130 U.S. hospitals between the years of 1999-2008. It includes over 100,000 hospital admissions and 47 attributes including the target. The three target outcomes were: "<30" readmitted less than 30 days, ">30" readmitted greater than 30 days, and "NO" without readmission. There were eight numeric variables, including time in hospital, number of lab procedures, number of medications etc. and 42 categorical variables including encounter id (unique to each encounter), patient number (unique to each patient), race, gender, etc.


Logistic Regression

Logistic Regression was utilized to perform the classification task for this case study. It uses the sigmoid function to convert the target variable between zero and one. Similar to Multiple Linear Regression, Logistic Regression can utilize both categorical variables and numeric variables as predictors. Another benefit of using Logistic Regression is the interpretation of variable importance. When predictors are transformed to the same range, the coefficient of each variable can be used to determine how much it is contributing to predicting the relevant outcome.

For binary classification problems, Logistic Regression takes a two-part loss function. When the target $y$ is 1 loss is: $-y\log(p)$, when the target is 0 loss is: $ (1-y) \log(1-p)$. Multi-class classification problems can be broken down into binary classification problems, using a strategy such as One vs. Rest. For each class, the modle is trained on a true/false basis against the rest of the classes. The Scikit-learn package used for this study supports this option. One potential downfall for this method is that it can suffer from unbalanced negative examples when the class being compared has few observations.

Another Logistic regression model for a multi-class classification problem supplied by Scikit-learn is multinomial solver. It uses the idea of maximum likelihood. Conditional likelihood of $G$ given $X$. Since $Pr(G|X)$ completely specifies the conditional distribution, the multinomial distribution is appropriate. Below is the formula for the log-likelihood for $N$ observations (see details in textbook section 4.4.1):

$l({\Theta}) = \sum_{i=1}^{N} \log(p_{g_{i}}) * (x_{i};\Theta)$


where $p_{k}(x_{i};\Theta) = Pr(G = k|X= x_{i};\Theta)$.


Ridge Regularization (L2)

Ridge regularization, or L2, is utilized as a penalty term for Logistic Regression in Scikit-learn by default. The penalty for this is the squared coefficients multiplied by Lambda, which controls the strength of the penalty. Unlike L1 regularization, L2 does not provide feature selection. All features are penalized uniformly but will never reach to zero. In general, L2 is the primary regularization method used to prevent overfitting the model.

Penalty Term
$\lambda \sum\limits_{j=0}^k m_j ^2 $
Where $\lambda $ is the strength of the penalty. If $\lambda=0 $ then there would be no penalty applied and the original coefficient would be returned.


Target Class Imbalance

Target class imbalance was a potential concern for this study with class "NO" being the majority class (around 54% of all encounters). In order to give each minority class a fair chance to be evaluated during the grid search for best performing parameters, scoring="roc_auc_ovr_weighted" (area under the curve using one-vs-rest comparison with average metrics of each label weighted by support) was utilized. The goal was to predict target classes "<30", ">30" and "NO" equally well, instead of only achieving a higher accuracy score by predicting everything to the "NO" majority class.

2. Methods

Initial Data Observations

This dataset contained 49 features for 101,766 patient encounters. The feature "encounter_id" was unique to each encounter, and the feature "patient_nbr" identified each of the 71,518 unique patients. Of the patients, 54,745 had only one encounter, while the remaining patients had multiple encounters. Repeat encounters from the same individual violate the independence assumption associated with logistic regression; however the information available in these repeat encounters was valuable. An initial exploration of repeat encounters showed the same individuals receiving different types of diagnoses, medications, and tests from different specialties with different outcomes. Although it potentially biases the results of the study, the data from these repeat patients was retained.

The features "encounter_id" and "patient_nbr" were specific to encounters and patients, so they were deleted from the dataset prior to analysis.

Consolidation of Categories

Four of the features (medical_specialty, diag_1, diag_2, and diag_3) had a large number of categories. The medical_specialty feature contained many categories with only a few values. The 15 specialties with the largest number of values were selected to remain, and all other (non-missing) categories were consolidated into a category "Others."

The "diag_1", "diag_2", and "diag_3" features had 717, 749, and 790 categories, respectively. Values in each of these diagnoses categories are International Classification of Diseases (ICD-9) codes. These codes can be grouped into general categories (example found here), and inspiration for this technique was taken from Strack et al. (article). For each of the three diagnoses features, values were consolidated into 19 categories plus one category for missing values. A diagnosis of diabetes mellitus was the only specific diagnosis to keep its own category, as it was the focus of the study.

Missing Data

Seven features contained missing values (Table 1). Because the feature "weight" was missing more than 96% of its values, imputation was not attempted and it was deleted from the dataset prior to analysis. For the remaining six features with missing data, two different imputation strategies were attempted, and the results were compared after fitting the logistic regression model.

The first imputation strategy involved replacing the missing values in each of the six features with a flag value ("NA") to mark values as missing.

The second imputation strategy used Scikit-learn's SimpleImputer to replace each of the missing values with the most frequent value in each feature, as only categorical data was missing.

Missing Data
Feature Percentage Missing
race 2.2336%
weight 96.8585%
payer_code 39.5574%
medical_specialty 49.0822%
diag_1 0.0206%
diag_2 0.3518%
diag_3 1.3983%

Table 1: Percentage of missing data for each variable


Creating Models Using Pipeline

Each model was created using a pipeline. A pipeline allowed for the streamlining of scaling using MinMaxScaler(), one hot encoding using OneHotEncoder(), and tuning the hyperparameters of the logistic regression model. Additionally, SimpleImputer() was incorporated into one pipeline to implement and compare the different imputation methods. Pipelines prevent data leakage when using grid search with 10-fold cross validation in order to narrow down the best regularization parameter. Because the outcome classes were imbalanced, a stratified cross validation was used and the logistic regression parameter for class weight was set to 'balanced'. A random state was also set for reproducibility.

First Model: Logistic Regression with Flag Value Imputation

To compare the results of two imputation methods, two pipelines were created. The first used flag values for each of the variables with missing data. A pipeline was created under the variable 'pipeline_lr' to tune the inverse regularization strength parameter (C) as well as the multiclass method ('ovr' or 'multinomial'), each with an appropriate solver algorithm ('liblinear' or 'lbfgs'). Values of [10, 15, 20, 25, 30] for C were tried, and stratified 10-fold cross validation was used. The best combination (highest value for area under the curve using one-vs-rest comparison with average metrics of each label weighted by support) was C = 30 using 'ovr.'

Second Model: Logistic Regression with SimpleImputer (Mode Imputation)

The second model used imputation of the most common value for each variable with missing data. This pipeline included Scikit-learn's SimpleImputer. The imputed values for the features "race", "payer_code", "diag_1", "diag_2", and "diag_3" were "Caucasian", "MC", "DOCS", "DOCS", and "DOCS" respectively. Of note, imputing "Caucasian" in the race feature raised some ethical concerns in this model. A pipeline was created under the variable 'pipeline_lr_impute' to tune the the inverse regularization strength parameter (C) as well as the multiclass method ('ovr' or 'multinomial'), each with an appropriate solver algorithm ('liblinear' or 'lbfgs'). Values of [10, 15, 20, 25, 30] for C were tried, and stratified 10-fold cross validation was used. The best combination (highest value for area under the curve using one-vs-rest comparison with average metrics of each label weighted by support) was C = 30 using 'ovr.'

Alternative Model

Concerns over the use of a "race" feature resulted in the creation of an alternative model. Race was deleted from the dataset and the remaining flag values were used for imputation. The pipeline from Model 1 was implemented. The best combination (highest value for area under the curve using one-vs-rest comparison with average metrics of each label weighted by support) was C = 30 using 'ovr.'

3. Results

Models

Table 2 shows a comparison of the two different imputation models built. This shows the best performing model was Model 1, which used the flag value imputation method describe above. For both models, the performance was assessed using the weighted AUC. This metric was chosen due to the target class imbalance present in the dataset. The weighted AUC calculates the AUC per class, then averages it weighted by the number of instances in each class. The best mean score was calculated using the mean test score (weighted AUC score) from 10-fold cross-validation provided by the Scikit-Learn GridSearch object. Model 1 had the best mean test score of 0.67998. Model 2 had a slightly worse but not significant decrease in performance. Model 2 weighted AUC score was 0.67372.


Model Weighted AUC
Model 1 0.67998
Model 2 0.67372
Table 2: Weighted AUC per model

The confusion matrices for Model 1 (Fig. 1) and Model 2 (Fig. 2) demonstrate the prediction details of each model when making predictions on the entire dataset. Precision and recall for each class label can be calculated through the matrix if needed. For example, under Model 1: Precision for "Not Readmitted" class can be calculated by:
$ Precision = 40829\div(40829+17530+5045) = 0.64 $


Recall for "Not Readmitted" class can be calculated by:

$ Recall = 40829\div(40829+11149+2886) = 0.74 $
Figure 1: Confusion matrix for Model 1 (flag value imputation)
Figure 2: Confusion matrix for Model 2 (mode imputation)

Feature Importance

There was a great deal of overlap between the feature importance contributing to each target outcome in the two models. For the "<30" days readmission target, both Model 1 (Fig. 3) and Model 2 (Fig. 4) had the same top 10 features with slight differences in coefficients, hinting that the imputation method did not significantly change what the model considered important features. For the ">30" days readmission target, both Model 1 (Fig. 5) and Model 2 (Fig. 6) had the same top five features. For the remaining five features, there were some slight differences in order. For example, in Model 1 glyburide_meltformin_Down was at position number 10, while in Model 2 this feature was at position 8. For the "No Readmission" target, the top 8 features are the same in Model 1 (Fig. 7) and Model 2 (Fig. 8) with some slight differences in coefficients.

Figure 3: Top ten most important features for Model 1 readmission under 30 days outcome
Figure 4: Top ten most important features for Model 2 readmission under 30 days outcome
Figure 5: Top ten most important features for Model 1 readmission over 30 days outcome
Figure 6: Top ten most important features for Model 2 readmission over 30 days outcome
Figure 7: Top ten most important features for Model 1 no readmission outcome
Figure 8: Top ten most important features for Model 2 no readmission outcome

As can be seen in Tables 3 and 4, both models had the same top five features for both the ">30" and "<30" target variables while the "No readmissions" target had a somewhat different set of top five variables. This was to be expected as ">30" and "<30" targets happened when a patient was actually readmitted. It appears that for the top five features, there were no differences in important characteristics between both readmitted targets.

Model 1 Model 2
discharge_disposition_id_11 discharge_disposition_id_11
number_inpatient number_inpatient
number_emergency number_emergency
admission_type_id_7 admission_type_id_7
discharge_disposition_id_12 discharge_disposition_id_12

Table 3: Top 5 features per model for <30 and >30 targets

Model 1 Model 2
number_emergency number_emergency
discharge_disposition_id_11 discharge_disposition_id_11
number_inpatient number_inpatient
admission_type_id_7 admission_type_id_7
number_outpatient number_outpatient

Table 4: Top 5 features per model for No Readmission target

Alternative Model

To explore the necessity of retaining a feature for race, an alternative model was based on Model 1, the best-performing model. The only difference between them was the deletion of the "race" feature for the alternative model. Results were very similar to that of Model 1, with a weighted AUC of 0.67904 for the alternative model and 0.67998 for Model 1.

This study chose to retain the race feature, as the subject matter was medical and different demographics can be affected by medical issues differently. Also, retaining the race feature allowed for the possibility that patients of different races could have been given different qualities of medical care. An examination of feature importance in Models 1 and 2 showed that race was not one of the more important features. The very small effect its removal had on model results also supports this. While it may still contribute a small amount of useful information for prediction, a race feature does not appear to be essential. While the recommended model (Model 1) retains it, the alternative model is a viable option if there are ethical concerns about the implementation and use of the model.

4. Conclusion

Model 1 was the best model for predicting the hospital readmission of diabetes patients. Both models had very similar performance. This might hint that for this problem or data, the imputation method contributed very little to a higher AUC score. For this study, no feature selection was used, meaning both models used the same number of variables.

Regarding the imputation method used, as seen by the AUC scores, there was very little advantage between using either method. Something to note is that generally speaking, the imputation used in model 2 is less prone to data leakage, meaning that some information from the training data leaks into the test data. The reason for this is that when using the GridSearch objective in each cross validation split, the data is being divided into different train and test groups. The imputation values are calculated from the training group, and those same values are then applied to the test group. So no information from the data points on the test group is used to calculate the imputation values. Future improvements to increase the accuracy of this classification problem could include attempting to do some feature selection or trying different algorithms such as xgboost, which can handle missing data without imputation.

Appendix - Code

Initial Data Read In and Conversion

Exploratory Data Analysis

Numerical Data

Categorical Data

Data Deletion, Imputation and Conversion for & Fitting Logistic Regression Model - Missing Data converted to Individual Levels

Data Deletion, Imputation and Conversion for & Fitting Logistic Regression Model - Imputation using SimpleImputer

Method 2 of imputation