R: Variable Selection in Logistic Regression Modeling

10 minute read

Background

The objective of this project was to determine an optimal logistic regression model for predicting whether a patient had adult-onset diabetes based on certain diagnostic measurements. This dataset was downloaded from Kaggle and originally comes from the National Institute of Diabetes and Digestive and Kidney Diseases. All patients in the data set are females of Pima Indian heritage and at least 21 years old. I was assisted on this project by Dr. Qingning Zhou from the Mathematics and Statistics Department at UNC Charlotte.

Exploratory Data Analysis

We must first import the dataset from Kaggle. Three of the field names from the original data set were rather lengthy, so we can alias them with more appropriate names.

library(readr)
dia <- read_csv("~/Diabetes Research/diabetes.csv")
names(dia)[c(3,4,7)] = c("Bpsi", "Skin", "DPF")

Pregnancies expresses the number of pregnancies per individual. Glucose expresses their glucose level. Bpsi expresses their blood pressure measurement. Skin expresses the thickness of their skin. Insulin expresses the insulin level detected in their blood. BMI expresses body mass index. DPF expresses a numeric value calculated by a diabetes pedigree function that determines the risk of type 2 diabetes based on family history (higher values indicate a higher risk). Age expresses the individual’s age. Outcome is a binary variable expressing the result of having Type 2 diabetes (denoted by 1) or not (denoted by 0).

Next, we can take a look at the distributions of each field in the dataset.

summary(dia)
dia.df = as.data.frame(dia) # changing the list into a data frame 
par(mfrow=c(3,3))
for(i in 1:9){
  hist(dia.df[,i], xlab=colnames(dia.df[i]), main=paste("Histogram of", colnames(dia.df[i])), col="lightgreen", breaks=30)
  }
##  Pregnancies        Glucose           Bpsi             Skin          Insulin           BMI             DPF              Age           Outcome     
## Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00   Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00   Min.   :0.000  
## 1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00   1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00   1st Qu.:0.000  
## Median : 3.000   Median :117.0   Median : 72.00   Median :23.00   Median : 30.5   Median :32.00   Median :0.3725   Median :29.00   Median :0.000  
## Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54   Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24   Mean   :0.349  
## 3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00   3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00   3rd Qu.:1.000  
## Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00   Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00   Max.   :1.000  

We observe that Glucose, Bpsi, Skin, Insulin, and BMI have minimum values of 0.0 in the dataset. This is clearly an impossible metric for these diagnostic measurements, so we must be dealing with missing data.

sum(dia$Glucose==0) / length(dia$Glucose) # proportion of missing glucose entries
sum(dia$Bpsi==0) / length(dia$Bpsi) # proportion of missing blood pressure entries
sum(dia$Skin==0) / length(dia$Skin) # proportion of missing skin thickness entries
sum(dia$Insulin==0) / length(dia$Insulin) # proportion of missing insulin entries
sum(dia$BMI==0) / length(dia$BMI) # proportion of missing body mass index entries
##  > sum(dia$Glucose==0) / length(dia$Glucose) # proportion of missing glucose entries
##  [1] 0.006510417
##  > sum(dia$Bpsi==0) / length(dia$Bpsi) # proportion of missing blood pressure entries
##  [1] 0.04557292
##  > sum(dia$Skin==0) / length(dia$Skin) # proportion of missing skin thickness entries
##  [1] 0.2955729
##  > sum(dia$Insulin==0) / length(dia$Insulin) # proportion of missing insulin entries
##  [1] 0.4869792
##  > sum(dia$BMI==0) / length(dia$BMI) # proportion of missing body mass index entries
##  [1] 0.01432292

Approximately 30% of SkinThickness values are missing as indicated by value 0. Similarly, nearly 49% of Insulin values are missing, 1% of Glucose values are missing, 5% of BloodPressure values are missing, and 1% of BMI values are missing from the data set. We will change the appropriate missing values to NA.

for (i in 1:768){
  for (j in 2:6){ # columns for Glucose, Bpsi, Skin, Insulin, BMI
    if (dia.df[i,j]==0){
      dia.df[i,j]=NA
    }
  }
}

Rather than working with patients who are missing diagnostic measurements, we will continue this analysis by only considering patients without missing values.

dia.df.subset = na.omit(dia.df)
dim(dia.df.subset)
## [1] 392   9

This subset only leaves us with 51% of the original entries. While this is not ideal, it allows for less data manipulation when constructing the final predictive model.

Next, we can create a Pearson correlation test matrix for this updated subset.

cor.dia.df.subset = cor(dia.df.subset) # correlation matrix for updated subset
diag(cor.dia.df.subset)=0
cor.dia.df.subset
##             Pregnancies   Glucose       Bpsi      Skin    Insulin         BMI          DPF        Age   Outcome
## Pregnancies  0.000000000 0.1982910  0.2133548 0.0932094 0.07898363 -0.02534728  0.007562116 0.67960847 0.2565660
## Glucose      0.198291043 0.0000000  0.2100266 0.1988558 0.58122301  0.20951592  0.140180180 0.34364150 0.5157027
## Bpsi         0.213354775 0.2100266  0.0000000 0.2325712 0.09851150  0.30440337 -0.015971104 0.30003895 0.1926733
## Skin         0.093209397 0.1988558  0.2325712 0.0000000 0.18219906  0.66435487  0.160498526 0.16776114 0.2559357
## Insulin      0.078983625 0.5812230  0.0985115 0.1821991 0.00000000  0.22639652  0.135905781 0.21708199 0.3014292
## BMI         -0.025347276 0.2095159  0.3044034 0.6643549 0.22639652  0.00000000  0.158771043 0.06981380 0.2701184
## DPF          0.007562116 0.1401802 -0.0159711 0.1604985 0.13590578  0.15877104  0.000000000 0.08502911 0.2093295
## Age          0.679608470 0.3436415  0.3000389 0.1677611 0.21708199  0.06981380  0.085029106 0.00000000 0.3508038
## Outcome      0.256565956 0.5157027  0.1926733 0.2559357 0.30142922  0.27011841  0.209329511 0.35080380 0.0000000

To better visualize this correlation matrix, we can use a heat map to plot the results

library(fields) # package for heat map
clockwise90 = function(a) { t(a[nrow(a):1,]) } # function to rotate heat map
image.plot(clockwise90(cor.dia.df.subset), col=heat.colors(12), axes=FALSE) # heat map of NA-removed correlations
par(cex.axis=.65)
axis(3, at=seq(0,1, length=9), labels=abbreviate(colnames(dia.df)), lwd=0, pos=1.1)
axis(2, at=seq(1,0, length=9), labels=abbreviate(colnames(dia.df)), lwd=0, pos=-0.1)

It appears that Age and Pregnancies have the largest correlation in the heat map. This makes sense as one might expect the number of pregnancies to increase as the age of a woman increases. However, this is not a correlation of interest. We are only focusing on variables that share a high correlation with Outcome of diabetes!

sorted.cor.subset = sort(cor.dia.df.subset, decreasing = TRUE)
sorted.cor.subset[1] # largest correlation
sorted.cor.subset[3] # second-largest correlation
sorted.cor.subset[5] # third-largest correlation
sorted.cor.subset[7] # fourth-largest correlation
## > sorted.cor.subset[1] # largest correlation
## [1] 0.6796085
## > sorted.cor.subset[3] # second-largest correlation
## [1] 0.6643549
## > sorted.cor.subset[5] # third-largest correlation
## [1] 0.581223
## > sorted.cor.subset[7] # fourth-largest correlation
## [1] 0.5157027

We can confirm that Age and Pregnancies did have the largest correlation of 0.67960847. Skin and BMI place second with a correlation of 0.6643549. Glucose and Insulin come in third place with a correlation of 0.581223. Glucose and Outcome share a positive correlation of 0.5157027. This is a promising result! Intuition tells us that glucose level and the outcome of diabetes should share a positive correlation.

We can take another look at comparing the distributions of each field for those who have diabetes and those who do not. Each of the following box plots depict the distributions of each variable for patients who have diabetes (Outcome=1) and do not have diabetes (Outcome=0).

par(mfrow=c(2,4)) # Create plots for Outcome vs. Variables
for(j in 1:8){
plot(factor(dia.df.subset$Outcome), dia.df.subset[,j], main=paste("Outcome vs", colnames(dia.df.subset[j])), xlab="Outcome (0=no, 1=yes)", ylab=colnames(dia.df.subset[j])) 
  }

It can be seen again that Outcome and Glucose share some discrepancies. We can perform a t-test to see if there is a significant difference in the means of glucose for patients with and without diabetes.

t.test(dia.df.subset$Glucose[dia.df.subset$Outcome==0], dia.df.subset$Glucose[dia.df.subset$Outcome==1]) # t-test using data subset
##      Welch Two Sample t-test

## data:  dia.df.subset$Glucose[dia.df.subset$Outcome == 0] and dia.df.subset$Glucose[dia.df.subset$Outcome == 1]
## t = -11.151, df = 218.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -39.72817 -27.79385
## sample estimates:
## mean of x mean of y 
##  111.4313  145.1923 

The t-test between Outcome and Glucose has a significant p-value (< 0.01). This indicates Glucose could be a reasonable variable to select in the final logistic regression model. The results when performing similar t-tests between Outcome and all remaining variables were also significant. Thus, we need to explore more criterion when selecting our model variables.

Backward selection is one common variable selection method when performing logistic regression. This method relies on the assessment of Akaike information criterion (AIC). AIC is a calculated model performance metric that estimates prediction error. We desire to minimize the AIC of our model for an optimal outcome. The method begins with all model variables selected. During each step of this procedure, we can choose to drop a single variable or stop. The deletion of each variable will alter the AIC of the model differently. This process ends when every remaining choice of variable deletion increases the model AIC. Here is an example of this selection method in R.

fit = glm(Outcome ~ ., data = dia.df.subset, family="binomial")

# Variable selection using Forward selection
stepwise = step(fit, direction = "backward")
summary(stepwise)
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BMI + DPF + Age, 
##    family = "binomial", data = dia.df.subset)
##
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.992080   1.086866  -9.193  < 2e-16 ***
## Pregnancies  0.083953   0.055031   1.526 0.127117    
## Glucose      0.036458   0.004978   7.324 2.41e-13 ***
## BMI          0.078139   0.020605   3.792 0.000149 ***
## DPF          1.150913   0.424242   2.713 0.006670 ** 
## Age          0.034360   0.017810   1.929 0.053692 .  
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 498.10  on 391  degrees of freedom
## Residual deviance: 344.89  on 386  degrees of freedom
## AIC: 356.89
##
## Number of Fisher Scoring iterations: 5

The backward selection method chose to select the variables Insulin, BMI, DPF, and Age to predict Outcome. The model holds an AIC value of 356.89.

LASSO selection is another variable selection method used in logistic regression. LASSO stands for Least Absolute Shrinkage and Selection Operator and is an extension of ordinary least squares regression. The penalty imposed on the RSS is determined by multiplying a parameter λ with the sum of the absolute values of the non-intercept beta coefficients. This consequently increases the intensity of the penalty. Here is an example of this selection method in R.

library(glmnet)
x = model.matrix(Outcome~., data=dia)[,-1]
y = dia$Outcome
cv.lasso = cv.glmnet(x, y, family="binomial", alpha=1)
best.lambda = cv.lasso$lambda.min
best_model <- glmnet(x, y, family = "binomial", alpha = 1, lambda = best.lambda)
coef(best_model)
glm.res = glm(Outcome~Pregnancies+Glucose+Skin+BMI+DPF+Age, data = dia, family="binomial")
summary(glm.res)
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + Skin + BMI + 
##     DPF + Age, family = "binomial", data = dia)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.914004   1.090033  -9.095  < 2e-16 ***
## Pregnancies  0.083559   0.055160   1.515   0.1298    
## Glucose      0.036485   0.004988   7.314 2.59e-13 ***
## Skin         0.011590   0.017058   0.679   0.4969    
## BMI          0.067041   0.026122   2.566   0.0103 *  
## DPF          1.130919   0.425725   2.656   0.0079 ** 
## Age          0.032892   0.017978   1.830   0.0673 .  
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 498.10  on 391  degrees of freedom
## Residual deviance: 344.42  on 385  degrees of freedom
## AIC: 358.42
## 
## Number of Fisher Scoring iterations: 5

LASSO selects all variables except Bpsi and Insulin. This is similar to the backwards selection method with the addition of the variable Skin. Observe that this model holds an AIC value of 358.42, which is marginally worse than our previous model.

Exhaustive selection is another variable selection method for logistic regression. This procedure runs all possible regressions between the dependent variable and every possible subset of explanatory variables. This method is typically infeasible when dealing with many explanatory variables. However, since we are dealing with 9 total variables, we can afford this type of selection method.

library(bestglm)
library(leaps)
best.logit_AIC = bestglm(dia, IC = "AIC", family = binomial, method = "exhaustive")
best.logit_AIC$Subsets
##    Intercept Pregnancies Glucose  Bpsi  Skin Insulin   BMI   DPF   Age logLikelihood      AIC
## 0       TRUE       FALSE   FALSE FALSE FALSE   FALSE FALSE FALSE FALSE     -249.0489 498.0978
## 1       TRUE       FALSE    TRUE FALSE FALSE   FALSE FALSE FALSE FALSE     -193.3330 388.6660
## 2       TRUE       FALSE    TRUE FALSE FALSE   FALSE FALSE FALSE  TRUE     -185.3448 374.6897
## 3       TRUE       FALSE    TRUE FALSE FALSE   FALSE  TRUE FALSE  TRUE     -177.1828 360.3656
## 4       TRUE       FALSE    TRUE FALSE FALSE   FALSE  TRUE  TRUE  TRUE     -173.6175 355.2350
## 5*      TRUE        TRUE    TRUE FALSE FALSE   FALSE  TRUE  TRUE  TRUE     -172.4426 354.8851
## 6       TRUE        TRUE    TRUE FALSE  TRUE   FALSE  TRUE  TRUE  TRUE     -172.2121 356.4242
## 7       TRUE        TRUE    TRUE FALSE  TRUE    TRUE  TRUE  TRUE  TRUE     -172.0178 358.0356
## 8       TRUE        TRUE    TRUE  TRUE  TRUE    TRUE  TRUE  TRUE  TRUE     -172.0106 360.0212

Model 5 had the best AIC score selecting variables Pregnancies, Glucose, BMI, DPF, and Age. This selection of variables agrees with the backwards selection method. This model possesses the lowest possible AIC amongst all possible models with a value of 354.8851.

While there are many methods to consider while choosing variables for a logistic regression model, AIC and t-tests served as insightful and effective information criterion for this study.