Case Study One

Superconductor Analysis and Prediction

David Grijalva, Nicole Norelli, & Mingyang Nick YU

9/6/2021

Abstract

The following deliverable investigated and predicted superconducting temperatures for a variety of materials. Linear regression with regularization was the primary statistical method used. A comparison of different models and their performance resulted in a recommendation of a combined model that used L1 regularization for feature selection followed by L2 regularization to fit the model. Feature importance, highlighting the most important features contributing to the critical temperature for superconducting, was then discussed.

1. Introduction

This case study explores linear predictive models and feature importance in the superconductor domain. Superconductors are materials that provide minimal or no resistance to electrical current. These materials are used in a variety of applications such as delivering fast-speed connections between computer microchips. The goal of this study is to predict the temperature at which a material will become a superconductor as well as identify important features that contribute to this critical temperature.

This dataset contains 21,263 data points describing material properties and their composition as independent variables. The dependent variable the model predicted was the critical superconducting temperature.

Linear regression with two types of regularization, L1 and L2, was used to build models for this case study. Linear regression is a widely used, efficient, and highly interpretable algorithm used by many scientists to model the relationship between dependent and independent variables. Typically, linear regression uses mean squared error (MSE) as a loss function. Loss functions are an iterative step where the regression calculates the prediction error and attempts to minimize it. This case study used negative mean squared error in order to maintain consistency with the Scikit-learn application programming interface (API).

The following is the formula used by a typical multiple regression (1), where $ m_0 $ represents the intercept and $ m_n \cdot x_n $ represents the slopes:

$y= m_0 + m_1 \cdot x_1 + m_2 \cdot x_2 + m_3 \cdot x_3... + m_n \cdot x_n (1)$


The dataset contains 158 independent variables, meaning that without any feature selection there would be 158 slopes. In simple terms, regularization adds a penalty to the regression coefficients or slopes. The purpose of this penalty is to prevent overfitting, which happens when the model fits too well on the training data rather than on the problem.


Lasso Regression (L1)

The first regularization type is Lasso or L1. The penalty for this is the absolute value of the coefficients multiplied by Lambda, which controls the strength of the penalty. L1 regularization can also be used as a feature selection tool. The penalty magnitude can be large enough that it can turn the regression coefficient into zero. A regression coefficient of zero indicates that the feature was not important for the relationship between independent and dependent variables.

Penalty Term
$\lambda \sum\limits_{j=0}^k \mid m_j \mid $
Where $\lambda $ is the strength of the penalty. If $\lambda=0 $ then there would be no penalty applied and the original coefficient would be returned.

Complete Formula
$y= m_0 + m_1 \cdot x_1 + m_2 \cdot x_2 + m_3 \cdot x_3... + m_n \cdot x_n + \lambda \sum\limits_{j=0}^k \mid m_j \mid$


Ridge Regression (L2)

The second regularization type is Ridge or L2. The penalty for this is the squared coefficients multiplied by Lambda, which controls the strength of the penalty. Unlike L1 regularization, L2 does not provide feature selection. All features are penalized uniformly but will never reach to zero. In general, L2 is the primary regularization method used to prevent overfitting the model.

Penalty Term
$\lambda \sum\limits_{j=0}^k m_j ^2 $
Where $\lambda $ is the strength of the penalty. If $\lambda=0 $ then there would be no penalty applied and the original coefficient would be returned.

Complete Formula
$y= m_0 + m_1 \cdot x_1 + m_2 \cdot x_2 + m_3 \cdot x_3... + m_n \cdot x_n + \lambda \sum\limits_{j=0}^k m_j ^2 $

2. Methods

Initial Data Observations

There was no missing data in this study.

Within the dataset, 86 features represented elements. Each indicated if a particular element was present or not in the material composition of each superconductor. Nine of these element features contained only zeros (Table 1), indicating the elements were in none of the superconductors in the dataset. These elements ['He', 'Ne', 'Ar', 'Kr', 'Xe', 'Pm', 'Po', 'At', 'Rn'] were deleted from the dataset prior to analysis.

Table 1: Variable Names and Statistics for Deleted Features

Data Merging

The dataset included two files. The first described the physical properties of each material. The second represented the material composition of each superconductor, with each column indicating the amount of a particular element in each superconductor. The rows and indexes of both files referred to the same superconductor material, so the two files were merged by index. Next, the redundant "critical_temp" variable was dropped as it was a duplication between the files. Finally, the "material" column from the second file was also dropped because all the data within the material name was also represented in the material composition features.


Exploratory Data Analysis

High correlations can be problematic when interpreting feature importance, which was one of the main goals of this study. When two variables are very highly correlated, it is likely that one is derived from the other. An algorithm was employed to detect and delete one of the variables within each pair that had a correlation greater than 0.95. This idea is borrowed and shown here. Without detailed knowledge of each variable, the decision of which to delete was determined by the order of the variable columns. As a result, 23 of the highly correlated variables were deleted from the dataset.

Variables Deleted: ['wtd_gmean_atomic_mass', 'std_atomic_mass', 'gmean_fie', 'wtd_gmean_fie', 'entropy_fie', 'std_fie', 'wtd_gmean_atomic_radius', 'entropy_atomic_radius', 'wtd_entropy_atomic_radius', 'std_atomic_radius', 'wtd_std_atomic_radius', 'wtd_gmean_Density', 'std_Density', 'std_ElectronAffinity', 'wtd_gmean_FusionHeat', 'std_FusionHeat', 'std_ThermalConductivity', 'wtd_std_ThermalConductivity', 'gmean_Valence', 'wtd_gmean_Valence', 'entropy_Valence', 'wtd_entropy_Valence', 'std_Valence']


An exploration of the distributions of the remaining 136 variables highlighted a lack of normal distribution for many variables. For example, Fig. 1 shows the bimodal distribution of "range_atomic_mass", the left skewed distribution of "range_atomic_radius", and the right skewed distribution of "gmean_Density." This implies a unit variance transformation could result in many outliers and make it more difficult to interpret feature importance. Instead, scaling all variables between the range of (0 to 1) using MixMaxScaler from Scikit-learn was chosen to make feature importance comparison more manageable.

Figure 1: Examples of variable distributions lacking normality

Creating Models using Pipeline

Each model was created using a pipeline. A pipeline allowed for the streamlining of scaling using MinMaxScaler() and fitting the linear models with either L1 or L2 regularization. Pipelines prevent data leakage when using grid search with 10-fold cross validation in order to narrow down the best regularization parameter (alpha) for either L1 or L2. During the initial trials, there was evidence that the data was structured or ordered. To address this, data was first shuffled before 10-fold cross validation to introduce more randomness for each cross validation split. This shuffle reduced the differences between test scores on the cross validation splits. A random state was also set for reproducibility.

First Model: Linear Model with LASSO (L1 regularization)

A pipeline was created for LASSO under the variable 'pipe_lasso' to tune the alpha hyperparameter and assess predictions. An initial wider grid of alpha values [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 1, 1.5, 2, 2.5, 5] was first tried using 10-fold cross validation. The best alpha value (highest negative mean squared error on mean test score) during the first trial was alpha = 0.1 (Fig. 2).

To further fine-tune alpha, this process was repeated with more refined grids. Grids of alpha values [0.075, 0.085, 0.095, 0.1, 0.105, 0.11, 0.115], and [0.085, 0.086, 0.087, 0.088, 0.089, 0.09] were tried. The ultimate alpha value was 0.089 with a negative mean test score of -379.84.

As previously mentioned, LASSO can be used as feature selection method. To this end, the features with non-zero coefficients under the best performing alpha value of 0.089 were recorded for creating the third model that will be introduced below.

Figure 2: Lasso learning curve

Second Model: Linear Model with Ridge (L2 regularization)

A pipeline was created for Ridge (l2) under the variable 'pipe_ridge' to tune the alpha hyperparameter and assess predictions. An initial wider grid of alpha values [100, 10, 1, 0.1, 0.01, 0.001] was first tried using 10-fold cross validation. The best alpha value (highest negative mean squared error on mean test score) during the first trial was alpha = 100 (Fig. 3).

To further fine-tune alpha, this process was repeated with more refined grids. Subsequent grids of alpha values [75, 100, 125, 150, 175, 200, 225, 250], [50, 60, 70, 80, 90] and [78, 79, 80, 81, 82, 83, 84] were tried, and the best performing alpha was 82 with a negative mean test score of -394.62.

Figure 3: Ridge learning curve

Third Model: Ridge with LASSO as feature selection

The third model used a combination of LASSO and Ridge regression. LASSO provided feature selection and the subset of features were then fit using Ridge regression. Using the LASSO model with the best alpha (0.089), feature selection was conducted by identifying those features with non-zero coefficients. Using these 24 features, a sub dataset was created.

A pipeline was created for Ridge with feature selection named 'pipe_ridge_mm'. An initial wider grid of alpha values [0.01, 0.1, 0.5, 1, 1.25, 1.5, 5, 25, 50, 75] was first tried using 10-fold cross validation. The best alpha value (highest negative mean squared error on mean test score) during the first trial was alpha = 0.1 (Fig. 4). This alpha value was drastically different from the alpha value obtained through the second model using Ridge regression. It seems the required penalty term is much smaller with only important features.

To further fine-tune alpha, this process was repeated with more refined grids. A grid of alpha values [0.05, 0.07, 0.09, 0.1, 0.11, 0.12] was then tried. The best alpha value was alpha = 0.11, with a negative mean test score of -335.33.

Since this model generated the best score during 10-fold cross validation, the coefficients of the variables were then extracted to be analyzed for feature importance.

Figure 4: Ridge with Lasso feature selection learning curve

3. Results

Models

A comparison of the three models (Table 2) shows the best performing model was Model 3, which used LASSO for feature selection followed by Ridge to fit the model. Model performance was assessed using negative mean squared error. Best mean test score was calculated using the mean test score (negative mean squared error) from a 10-fold cross validation. Model 3 had the best mean test score of -335.33. The LASSO model was next with a mean test score of -379.84, and Ridge (without feature selection) performed the worst, with a mean test score of -394.62. Model 3 also had the lowest test score standard deviation, indicating the performance among the cross validation splits was consistent.

Model Best Alpha Best Mean Test Score Test Score Standard Deviation
Model 3: Ridge with Feature Selection 0.11 -335.33 15.73
Model 1: LASSO 0.089 -379.84 55.15
Model 2: Ridge 82 -394.62 55.15

Table 2: Comparison of model results


In addition to achieving the best performance, Model 3 has the advantage of using significantly fewer features. This model was built with only 24 features, whereas the other two models contained 135 features. A smaller number of features could be a benefit when using the model, as less data may be necessary to make predictions regarding critical superconductor temperatures. Ultimately, Model 3 provides the best predictions of new superconductors and the temperature at which they operate with the added benefit of requiring fewer inputs to make these predictions.

Feature Importance

The importance of features contributing to the critical superconductor temperature are displayed in Fig. 5. The most important variable is "Ba", or the element barium. This element has a positive relationship with the outcome, meaning that if Ba increases, so too does the critical temperature. The second most important variable is the weighted mean thermal conductivity, which also has a positive relationship with the outcome. Next is the weighted geometric mean thermal conductivity, which has an inverse relationship with the outcome. This means that as the weighted geometric mean thermal conductivity increases, the critical superconductor temperature decreases. The importance of the remaining features and the direction of their relationship to the critical superconductor temperature can be examined below.

Figure 5: Importance of features contributing to critical temperature

A comparison of feature importance among the three models (Table 3) provides additional confidence in the assessment of the top two most important features. It is not surprising that the top four features are identical and in the same order when comparing Model 3 and LASSO, as the optimal LASSO model was used for feature selection in Model 3. Barium (Ba), however, is the most important feature in all models tested, and weighted mean thermal conductivity is in the top three most important variables in all models.

Top Five Features per Model
Model 3 LASSO Ridge
Ba Ba Ba
wtd_mean_ThermalConductivity wtd_mean_ThermalConductivity wtd_std_Valence
wtd_gmean_ThermalConductivity wtd_gmean_ThermalConductivity wtd_entropy_atomic_mass
Bi Bi wtd_mean_ThermalConductivity
wtd_std_atomic_mass wtd_gmean_ElectronAffinity range_atomic_mass

Table 3: Top five most important features in each model


4. Conclusion

Model 3 is the best model for predicting new superconductors and the temperature at which they operate. It had the most accurate estimates of critical temperatures of all the models that were tested, and it also showed the lowest standard deviation in cross validation test scores, indicating the least overfit and best generalization to new data. Model 3 also requires less attributes, and therefore less data, to make predictions. It is interpretable, as all features were scaled so direct comparisons among the magnitudes of the coefficients can be made. The larger the magnitude of the coefficient, the more important the variable. If a coefficient is positive, its relationship with the outcome is positive, and if it is negative, its relationship with the outcome is inverse. The most important features were Barium and weighted mean thermal conductivity.

This model does have a limitation, however. As none of the sample data included nine of the elements, these elements were excluded from our model. Our model would not apply to any new superconductors with these nine elements. Any future improvements to this model could include consultation with a subject matter expert to further optimize the removal of highly correlated variables.

Appendix - Code

Delete All Zero Column

Merge Two Data Files

Some EDA

Resource: Ideas of how to create upper triangle correlation plot as well as deleting columns that have too high in correlation with existing columns are from here: https://www.dezyre.com/recipes/drop-out-highly-correlated-features-in-python

Getting Data Ready for training

Create Model with LASSO

Features selected by LASSO under best model through cross validation

Create Model with Ridge

Try Using LASSO Result as feature selection and Ridge on top of it

Feature Importance With the best result