Case Study Four

Financial Delinquency Project

David Grijalva, Nicole Norelli, & Mingyang Nick YU

10/15/2021

Abstract
The following deliverable investigated and predicted a company's bankruptcy based on a variety of financial factors. The predictive models used and explored were Random Forest and XGBoost. A comparison of different models and their performance resulted in a recommendation of using the XGBoost in favor of the Random Forest.

1. Introduction

This case study explored predictive models in the finance and bankruptcy domain. Every year, hundreds of companies go bankrupt for a variety of financial and macro economical factors. Certain key financial indicators could potentially predict the likelihood of bankruptcy. The goal of this study was to predict if a particular company would go bankrupt based on the financial data available.

This dataset contains 43,405 data points describing 64 financial indicators as independent variables. The dependent variable the model predicted was whether the company went bankrupt within five years. It is important to note that this dataset is not meant to be a time series for each company but rather just a general overview of the financial performance for each company within the 5 year time frame. Due to the nature of how the dataset was built using yearly information about the same company, there is a high correlation between many of the variables.

RandomForest

A Random Forest is an ensemble statistical learning method composed of several decision trees (known as weak learners) for which each individual prediction is combined to make a final overall prediction. Random Forest uses the concept of bootstrap aggregation, also called bagging. In simple terms, bootstrap is a sampling method with replacement (it can select the same instance on multiple samples), meaning, it randomly samples a subset of the training dataset population. When an algorithm is using bagging, it simply means training a weak learner (Decision Tree) for each of the subsamples selected. Different from bagging, Random Forest randomly subsamples parameters to be considered at each split. This produces a forest of weak learners that are less correlated from one another, and usually generates better prediction results compared to predictions using bagging alone.

To understand how Random Forest works, it is important to understand how Decision Trees work. Decisions Trees are a supervised method that can be used for regression or classification. The key concept is that for every feature available, the Decision Tree splits the data into two branches, and for each split, it measures the information gain. The algorithm recursively repeats this operation until the stopping criteria is met.

There are two main methods to measure information gain: Gini and Entropy.

Gini Impurity

Gini Impurity is a measurement of the probability of incorrect classification for a new data point. The lower the Gini impurity is the more information is gained.

$$ G = \sum \limits _{i=1} ^{C} P(i) \dot (1-P(i)) $$

Entropy

Entropy is a mathematical formula to measure randomness. The lower the Entropy is the more information is gained due to having less randomness.

$$ H(X) = -\sum \limits _{i=1} ^{n} P(x_{i}) log_{b} P(x_{i}) $$

Generally, due to their simplicity, Random Forests are a great starting point as a baseline model which can be used to compare prediction performance with more complex models such as XGBoost.

XGBoost

XGBoost is a popular algorithm created by Tianqi Chen and released in 2016. This algorithm can be found on many production applications used by major companies. There are three key aspects that make XGBoost an effective learner:
1) The use of boosting
2) The use of gradients.
3) L2 regularization

Similar to Random Forest, boosting is also an ensemble method that uses weak learners to generate a strong one. Unlike Random Forest, that generates multiple and uncorrelated weak learners, boosting is composed of rounds/iterations of weak learners. Unique part of boosting is that each round tries to improve on the previous prediction and each iteration is sequential to previous iterations rather than in parallel (Random Forest). Each of the iterations learns from its previous errors by using the gradient of the loss function. The way this works is that after the first iteration the weak learners are no longer optimizing for the dependent variable, but rather for the residuals. These weak learner iterations keep going until the stopping criteria is met.

To account for the prediction errors XGBoost uses an approximate loss function which can be used with first and second-order partial derivatives.

$$ J= \sum \limits _{i}l(p _{i}, y_{i}) + \sum \limits _{k} \Omega (f _{k}) $$

This first equation displays a general loss function calculating the residuals or difference between the dependent variable and the prediction

$$ \Omega (f) = \gamma T + \frac{1}{2} \lambda \rvert\lvert w \rvert\rvert^2 $$

This second equation displays the penalty imposed to reduce complexity in the terminal leaves.

Another key feature of XGBoost is that it has L2 regularization by default. The penalty imposed by the L2 is the squared coefficients multiplied by Lambda, which controls the strength of the penalty. Unlike L1 regularization, L2 does not provide feature selection. All features are penalized uniformly but will never reach zero. In general, L2 is the primary regularization method used to prevent overfitting the model.

$$ \lambda \sum\limits_{j=0}^k m_j ^2 $$


Where $\lambda $ is the strength of the penalty. If $\lambda=0 $ then there would be no penalty applied.

Model Selection

Each learning algorithm has an array of hyperparameters that are used to tune it in order to prevent over and underfitting of the data. Model selection is the practice of automating this fine-tuning to select the model with the best parameters. There are several ways to do this, but in this study, two of the most used automated hyperparameter selection techniques were explored. It is very common to combine hyperparameter search methods with cross-validation in order to reduce the bias of each estimate score.

Grid Search is an exhaustive hyperparameter search method where all possible combinations for each hyperparameter value passed are considered. Because grid search attempts every possible combination, it is considered a very expensive (in both time and resources) method. On the upside, using grid search guarantees that the best parameter combination will be found.

Randomized Search only considers a sample of possible hyperparameter combinations from the grid space without replacement (same parameter combo won't be chosen twice). This makes it a much more efficient method, both in time and computer resources because it does not have to fit every possible combination of hyperparameters. The downside of using a randomized search is that it is not guaranteed to find the best possible combination but rather an approximation. In most cases, the approximation should be good enough with marginal differences in performance.

A good strategy is to use both randomized and grid searches together. One can narrow the hyperparameter combinations’ possibilities by using randomized search and then further tune the model using grid search with a much narrower parameter space.

2. Methods

Data Retrieval

A duration of five years of bankruptcy data was provided in five different files in the .arff format. Each year was an individual file. In order to read in and preserve defined data formats, the arff.loadarff() function was used. This function was provided through the scipy.io package. pandas.concat() was then utilized to vertically concatenate all five years of data. The data was concatenated in order from the first to the fifth year. To accomplish this, the file names were sorted first before being read in. Initial target class labels were in an odd format, so the method convert_target was created to clean up the target class into the int data type, where 0 represents "Not Bankrupt", and 1 represents "Bankrupt".

EDA & Data Preparation

The entire dataset consisted of 43,405 entries (companies) and 64 columns (attributes), not including the target class column. Initial observations of the simple statistics indicated the mean and median for different attributes were mostly different from each other. This may be due to the nature of financial data, which can be skewed. This observation promoted median as a more reliable imputation strategy than mean, since mean can be influenced easily by the outliers of a distribution. It was also observed that the target class was heavily imbalanced, with 2,091 (bankrupcy class) to 41,314 (Not bankrupcy class). This is almost a 1 to 20 ratio, thus certain measures such as stratified split strategies on the target class as well as other methods to balance class weight during prediction were considered.

Many columns had missing data. In particular, two attributes had over 10% missing data: Attr21 (sales (n) / sales (n-1)) had 13.4869% missing, and Attr37 ((current assets - inventories) / long-term liabilities) had 43.7369% missing. The rest of the attributes had much smaller percentages of missing data, mostly within one to two percent. Additionally, there were many attributes with strong correlations between each other (>0.99 correlation) such as Attr14 ((gross profit + interest) / total assets) and Attr18 (gross profit / total assets), or Attr18 and Attr7 (EBIT / total assets). Considering this was company financial data, it makes intuitive sense that many of the variables had extremely high correlations with each other.

In order to solve some of the missing data issues as well as very high correlations between some variables at the same time, the variables were first ordered in ascending order based on percentage of missing data. Then an algorithm to delete the variables with correlations greater than 0.95 (with the retained variable chosen because it had less missing data) was deployed. This eliminated attributes with very high correlations (>0.95) with the retained attribute having less missing data than the deleted attribute. Correlation thresholds of 0.99 and 0.95 were both explored; however, the prediction performance was better with a threshold of 0.95.


Helper Functions & Data Split

A few helper functions were created to avoid coding repetition because many functions were used repeatedly. get_acc_score retrieves model default scoring using model.score(). plot_roc_curve_custom plots receiver operating characteristic (ROC) on the test data. get_classification_report is a comprehensive overall report. It prints out training accuracy, test accuracy, Scikit-learn defined classification_report on test data, which includes each target class precision, recall, and F1 scores, and plots the ROC curve as well as the confusion matrix on the test data. cv_common and cv_summary are helper functions to assist in printing out either GridSearchCV results or RandomizedSearchCV results in a pandas data frame format for easier comparison.

Initial exploration indicated predictions on the training data for Random Forest or XGBoost generated very high scores that were unrealistic for future predictions. In order to evaluate performance in a more objective manner, train/test data was created. The entire data set was first shuffled to generate more randomness because the data were in time order. Stratified split based on the target class was also utilized due to the heavy target class imbalance. Twenty percent of the data was saved under X_test, y_test, and 80 percent of the data was saved under X_train, y_train. Random state was set to ensure reproducible work.

A data pipeline called preprocessing was created. It utilized a median strategy to impute missing data for each attribute and then used MinMaxScaler provided by Scikit-learn to scale variables into a range of zero to one.


First Model - Random Forest

Random Forest was utilized as a baseline model. A pipeline incorporating both the preprocessing pipeline and the RandomForestClassifier was created under the variable rf_pipeline. RandomizedSearchCV was utilized due to the many different parameter combinations for tuning. Ten sets of different parameters were sampled from the defined grid without replacement, and StratifiedKFold was utilized for 10-fold cross validation due to the heavy target class imbalance. roc_auc (area under the curve) was utilized as the scoring metric to evaluate both classes more fairly during search and extract the best parameter set because area under the curve considers both the False Positive Rate of one class and the True Positive Rate of the other class. max_features was set to "auto," which utilized the square root of the number of features. This has empirically generated good results for the Random Forest Classifier. class_weight was adjusted to "balanced" to automatically adjust weights inversely proportional to class frequencies to help adjust target class imbalance. Other parameters sampled by RandomizedSearchCV are displayed under the variable params_rf:

The best parameter combination was 350 estimators, with criterion using entropy, max depth of 15 for each individual tree in the forest, and a minimum samples split of 9.

Explore XGBoost early stopping round

Before tuning XGBoost using the Scikit-learn wrapper and utilizing Scikit-learn RandomizedSearchCV to search for near best parameter combinations, early stopping rounds was explored to estimate the time for each run. xgb.cv from the XGBoost library cross validation method was utilized to explore early stopping rounds. To utilize xgb.cv, X_train was first fit_transformed using the previously defined preprocessing pipeline into the variable X_train_xg, and X_test was transformed into the variable X_test_xg. The dtrain and dtest variables were generated by xgb.DMatrix to be built into the evaluation list under the variable evallist.

An initial 1000 rounds were tried, with the parameters max_depth of 10, objective function set to 'binary:logistic' (as this was binary classification problem), eval_metric parameter set to 'logloss', and learning rate of 0.1. The early stopping rounds criteria was set to five, which means when the test set performance stops improving for five consecutive rounds, early stopping for the XGBoost algorithm is triggered. Stratified five-fold cross validation was utilized due to the target class imbalance. See Fig. 1 for a plot of the train/test error versus the number of rounds. The train and test error initially dropped, with the drop slowing down at around 15 rounds, and the early stopping triggered at around 70 rounds. It took 15.1 seconds to run the 5-fold cross validation, thus the training time for this dataset was not expected to take very long for each round of parameters.

Figure 1: Training and Test Error versus Number of Rounds for XGBoost Exploration Model

Second Model - XGBoost

A pipeline incorporating both the preprocessing pipeline and the XGBClassifier (XGBoost Scikit-learn wrapper) was created under the variable xgb_pipeline to be used for the search. RandomizedSearchCV was utilized due to the many different parameter combinations needed for tuning. Ten sets of different parameters were sampled from the defined grid without replacement, and StratifiedKFold was utilized for 10-fold cross validation due to the heavy target class imbalance. roc_auc (area under the curve) was utilized as the scoring metric to evaluate both classes more fairly during the search to extract the best parameter set because area under the curve considers both the False Positive Rate of one class and the True Positive Rate of the other class. The objective function for XGBoost was set to "binary:logistic" because probability was needed to calculate the scoring of roc_auc. n_estimators for the XGBClassifier was set to 1000. This did not prolong the running time because early_stopping_rounds was set during the fit, and it monitors [X_test_xg, y_test] (eval_set) to stop the fit process early once the eval_set validation score stops improving for the specified number of rounds. A value of 20 was chosen for early_stopping_rounds after some experimentation as it allowed for continued improvement to the cross validation scores when compared to smaller values. The eval_set also utilized "auc" to keep consistent evaluation from cross validation scores.

After trial and error exploration, the final parameters sampled by RandomizedSearchCV are displayed under the variable search_space:

The best parameter combination was learning_rate at 0.1, max_depth (maximum tree depth for base learner) at 8, subsample (subsample ratio of the training instance) at 90%, and gamma (minimum loss reduction required to make further partition on leaf node of tree) at 0.4. The ten-fold cross validation score was similar to the final test set score, as discussed in the results section.

The final XGBoost model's best iteration occurred at 178 boosting rounds with an early stopping criteria of 20 rounds, meaning the validation score stopped improving after round 178 for the subsequent 20 rounds.

3. Results

The Random Forest and XGBoost models were assessed using the test set held out at the beginning of the case study. This test set was composed of 20% of the original data, selected after a shuffle to avoid any time order effects and then stratified so the test set contained a similar ratio of bankrupt to not bankrupt data as the training set. After tuning the hyperparameters, the optimal parameter combination for each model was used to make predictions on the test set. The ten-fold cross validation auc score for the best Random Forest model was 0.86, and the test score was 0.85. The ten-fold cross validation auc score for the best XGBoost Model was 0.94, and the test score was 0.96. As expected, validation and test scores were very similar.

A Random Forest model was constructed as a baseline, with the expectation that a properly tuned XGBoost model would outperform it. A comparison of the two models can be seen in Table 1.


XGBoost vs Random Forest Models
Model Accuracy Precision (not bankrupt) Recall (not bankrupt) Precision (bankrupt) Recall (bankrupt) AUC
XGBoost 0.97 0.97 1.00 0.88 0.45 0.95585
Random Forest 0.94 0.97 0.97 0.34 0.32 0.85353

Table 1: Comparison of XGBoost and Random Forest Models


The XGBoost model performed better than, or equal to, the Random Forest model on all metrics. Although overall accuracy was better with the XGBoost model, it is more important to note the large improvement in precision and recall for the bankrupt category predictions. Bankruptcy predictions were the essential function of the model, so improvement of these specific metrics was most relevant to assessing the best model. There was a slight improvement in predictions of non-bankrupt businesses with the XGBoost model as well. Of the 418 bankrupt businesses in the test set, the XGBoost model identified 190 of them (Fig. 2), which resulted in a recall of 0.45. This was superior to the Random Forest model (Fig. 3), where only 133 were identified (recall 0.32). The difference in precision between the two models was much greater. Of the 215 companies predicted to be bankrupt in the XGBoost model, 190 were correct. The Random Forest model was much less precise, predicting 395 companies to be bankrupt when only 190 were correct.

Figure 2: Confusion Matrix for XGBoost Model
Figure 3: Confusion Matrix for Random Forest Model

4. Conclusion

The final XGBoost model is recommended to the finance department for the financial deliquency project. It provides the best classification predictions of all models explored. In particular, it correctly identifies more bankrupt companies than the Random Forest model. Several methods could be attempted to improve results further based on additional feedback from the finance department. Expansion of the search grid to include more parameters to tune, in combination with GridSearchCV, which attempts all parameter combinations rather than a random selection, could result in a slighly better model. However it is worth noting that this will require more time and company resources, and the improvement may not be significant from the current result. Another more efficient fine-tuning approach of the XGBoost model would involve trade-offs and guidance from the finance department regarding preferences for balancing risk. The cut-off for predictions of bankruptcy among companies can be adjusted to allow for the identification of more true positives at the cost of more false positives, if so desired.

The final model to be deployed could be retrained on the entire dataset available once the financial department agrees with the current estimated performance. Future models could also incorporate time series techniques to account for the time series nature of financial data. This option could be explored with the goal of further improving bankruptcy predictions.

Appendix - Code

Helper Function Models

Building Random Forest Model using MinMaxScaling and SimpleImpute

Explore XGBoost early stopping round

Search for best parameters for XGBoost using SKLearn wrapper

Early Stopping at 10

Early Stopping at 20