Home  /  Resources & Support  /  Introduction to Stata basics  /  Logistic regression 2: Testing and variable selection

Logistic regression is a popular tool for quantifying the relationship between a binary outcome variable and one or more predictor variables. The outcome variable is often called the dependent variable, and the predictor variables are often called independent variables. The predictor variables can be binary, categorical, or continuous. Here we will learn how to use Stata's logistic command to fit logistic regression models with categorical predictor variables, and we will explore other features later.

Let's begin by opening the nhanes2l dataset. Then let's describe the variables highbp, diabetes, and hlthstat.

. webuse nhanes2l
(Second National Health and Nutrition Examination Survey)

. describe highbp age sex weight hlthstat

Variable Storage Display Value
name type format label Variable label
highbp byte %8.0g * High blood pressure
age byte %9.0g Age (years)
sex byte %9.0g sex Sex
weight float %9.0g Weight (kg)
hlthstat byte %20.0g hlth Health status

The description tells us that highbp contains information about “High blood pressure”, diabetes contains information about “Diabetes status”, and hlthstat contains information about “Health status”.

Let's tabulate our outcome variable, highbp, to learn more about it.

. tabulate highbp

High blood
pressure Freq. Percent Cum.
0 5,975 57.72 57.72
1 4,376 42.28 100.00
Total 10,351 100.00

The table shows that 57.7% of the sample is coded as 0 and 42.3% is coded as 1. Let's label 0 as “Normotensive” and 1 as “Hypertensive”, and then let's type tabulate highbp again to check our work.

. label define highbp 0 "Normotensive" 1 "Hypertensive"
. label values highbp highbp

. tabulate highbp

High blood
pressure Freq. Percent Cum.
Normotensive 5,975 57.72 57.72
Hypertensive 4,376 42.28 100.00
Total 10,351 100.00

Next let's use dtable to explore our predictor variable for each category of highbp. You can type help dtable to learn more about dtable.

. dtable age weight i.sex i.hlthstat
     , by(highbp, tests)
     continuous(, statistics(mean sd))
     nformat(%9.1f mean sd)
     sformat((%s) semean)
     factor(, statistics(fvfrequency fvpercent))
     nformat(%16.0fc fvfrequency )
     nformat(%9.1f fvpercent)
     title(Table 1: Descriptive statistics by hypertension status)
     note("Note: Mean (SD) for continuous variables, frequency (%) for categorical variables")
     export(table1.docx, replace)

note: using test regress across levels of highbp for age and weight.
note: using test pearson across levels of highbp for sex and hlthstat.

Table 1: Descriptive statistics by hypertension status
High blood pressure
Normotensive Hypertensive Total Test
N 5,975 (57.7%) 4,376 (42.3%) 10,351 (100.0%)
Age (years) 42.2 (16.8) 55.0 (14.9) 47.6 (17.2) <0.001
Weight (kg) 68.3 (13.6) 76.9 (16.2) 71.9 (15.4) <0.001
Sex
Male 2,611 (43.7%) 2,304 (52.7%) 4,915 (47.5%) <0.001
Female 3,364 (56.3%) 2,072 (47.3%) 5,436 (52.5%)
Health status
Excellent 1,649 (27.7%) 758 (17.3%) 2,407 (23.3%) <0.001
Very good 1,666 (27.9%) 925 (21.2%) 2,591 (25.1%)
Good 1,572 (26.4%) 1,366 (31.2%) 2,938 (28.4%)
Fair 766 (12.8%) 904 (20.7%) 1,670 (16.2%)
Poor 310 (5.2%) 419 (9.6%) 729 (7.1%)
Note: Mean (SD) for continuous variables, frequency (%) for categorical variables (collection DTable exported to file table1.docx)

All the predictor variables have small p-values, which suggests that we should consider them in more complex models.

Let's fit a model that includes each predictor variable with no interactions.

. logistic highbp c.age i.sex c.weight i.hlthstat

Logistic regression                                    Number of obs =  10,335
                                                       LR chi2(7)    = 2352.02
                                                       Prob > chi2   =  0.0000
Log likelihood = -5864.7164                            Pseudo R2     =  0.1670

highbp Odds ratio Std. err. z P>|z| [95% conf. interval]
age 1.049891 .0015677 32.61 0.000 1.046823 1.052969
sex
Female 1.021781 .0494017 0.45 0.656 .9294014 1.123342
weight 1.044387 .0017632 25.73 0.000 1.040937 1.047849
hlthstat
Very good 1.007905 .0675317 0.12 0.906 .883868 1.149349
Good 1.308293 .0845285 4.16 0.000 1.152681 1.484913
Fair 1.275262 .0960363 3.23 0.001 1.100266 1.47809
Poor 1.250367 .121646 2.30 0.022 1.033298 1.513037
_cons .0025079 .0004114 -36.51 0.000 .0018185 .0034589
Note: _cons estimates baseline odds.

Then let's use estat classification to estimate the sensitivity, specificity, and percent correctly classified for our model.

. estat classification

True
Classified D ~D Total
+ 2687 1457 4144
- 1685 4506 6191
Total 4372 5963 10335
Classified + if predicted Pr(D) >= .5 True D defined as highbp != 0
Sensitivity Pr( +| D) 61.46%
Specificity Pr( -|~D) 75.57%
Positive predictive value Pr( D| +) 64.84%
Negative predictive value Pr(~D| -) 72.78%
False + rate for true ~D Pr( +|~D) 24.43%
False - rate for true D Pr( -| D) 38.54%
False + rate for classified + Pr(~D| +) 35.16%
False - rate for classified - Pr( D| -) 27.22%
Correctly classified 69.60%

Next let's use etable to gather our results. You can type help etable to learn more about etable. We can include sensitivity, specificity, and percent correctly classified in our table by using the scalars returned by estat classification. You can view all of these scalars by typing return list.

. return list

scalars:
               r(P_1n) =  27.21692779841706
               r(P_0p) =  35.15926640926641
               r(P_0n) =  72.78307220158294
               r(P_1p) =  64.84073359073359
               r(P_n1) =  38.54071363220494
               r(P_p0) =  24.43400972664766
               r(P_n0) =  75.56599027335234
               r(P_p1) =  61.45928636779507
             r(P_corr) =  69.5984518626028

matrices:
             r(ctable) :  3 x 3

. etable, keep(age i.sex weight i.hlthstat)
     showstars showstarsnote                             
        cstat(_r_b, nformat(%9.2f))                       
        mstat(N,   nformat(%9.0fc) label("Observations"))  
        mstat(chi2, nformat(%9.2f))                       
        mstat(df_m, nformat(%9.0f))                       
        mstat(p, nformat(%9.4f))                        
        mstat(r2_p, nformat(%9.4f))                      
        mstat(correct =(r(P_corr)),                         
              nformat(%6.2f) label("% correctly classified"))
        mstat(Sensitivity=(r(P_p1)), nformat(%6.2f))            
        mstat(Specificity=(r(P_n0)), nformat(%6.2f))        
        mstat(aic, nformat(%6.0fc))                          
	mstat(bic, nformat(%6.0fc))

highbp
Age (years) 1.05 **
Sex
Female 1.02
Weight (kg) 1.04 **
Health status
Very good 1.01
Good 1.31 **
Fair 1.28 **
Poor 1.25 *
Observations 10,335
χ² 2352.02
Model DF 7
Model test p-value 0.0000
Pseudo R-squared 0.1670
% correctly classified 69.60
Sensitivity 61.46
Specificity 75.57
AIC 11,745
BIC 11,803
** p<.01, * p<.05

Let's consider another model that includes the interaction of age and sex. We can again use estat classification to estimate the sensitivity, specificity, and percent correctly classified for our new model. Then we can type etable, append to append the new model to our table.

. logistic highbp c.age##i.sex c.weight i.hlthstat

Logistic regression                                    Number of obs =  10,335
                                                       LR chi2(8)    = 2444.84
                                                       Prob > chi2   =  0.0000
Log likelihood = -5818.3051                            Pseudo R2     =  0.1736

highbp Odds ratio Std. err. z P>|z| [95% conf. interval]
age 1.036594 .0020228 18.42 0.000 1.032637 1.040567
sex
Female .2559988 .0392532 -8.89 0.000 .1895484 .345745
sex#c.age
Female 1.027598 .0029315 9.54 0.000 1.021869 1.03336
weight 1.044088 .0017719 25.42 0.000 1.040621 1.047567
hlthstat
Very good 1.019104 .068511 0.28 0.778 .8932953 1.162631
Good 1.337438 .0867651 4.48 0.000 1.177748 1.518779
Fair 1.300784 .0982347 3.48 0.000 1.121819 1.5083
Poor 1.309155 .1276285 2.76 0.006 1.081454 1.584798
_cons .004704 .0008202 -30.74 0.000 .0033424 .0066203
Note: _cons estimates baseline odds. . estat classification
True
Classified D ~D Total
+ 2724 1506 4230
- 1648 4457 6105
Total 4372 5963 10335
Classified + if predicted Pr(D) >= .5 True D defined as highbp != 0
Sensitivity Pr( +| D) 62.31%
Specificity Pr( -|~D) 74.74%
Positive predictive value Pr( D| +) 64.40%
Negative predictive value Pr(~D| -) 73.01%
False + rate for true ~D Pr( +|~D) 25.26%
False - rate for true D Pr( -| D) 37.69%
False + rate for classified + Pr(~D| +) 35.60%
False - rate for classified - Pr( D| -) 26.99%
Correctly classified 69.48%
. etable, append
highbp highbp
Age (years) 1.05 ** 1.04 **
Sex
Female 1.02 0.26 **
Weight (kg) 1.04 ** 1.04 **
Health status
Very good 1.01 1.02
Good 1.31 ** 1.34 **
Fair 1.28 **
Poor 1.25 * 1.31 **
Observations 10,335 10,335
χ² 2352.02 2444.84
Model DF 7 8
Model test p-value 0.0000 0.0000
Pseudo R-squared 0.1670 0.1736
% correctly classified 69.60 69.48
Sensitivity 61.46 62.31
Specificity 75.57 74.74
AIC 11,745 11,655
BIC 11,803 11,720
** p<.01, * p<.05

We can use the same strategy to fit other candidate models and append the results to our table. Note that the quietly block keeps the output from being displayed in the Results window.

. quietly {
    logistic highbp c.age##c.weight i.sex##i.hlthstat
    estat classification
    etable, append

    logistic highbp c.age c.weight i.sex##i.hlthstat
    estat classification
    etable, append

    logistic highbp c.age##i.sex c.age##c.weight i.sex##i.hlthstat
    estat classification
}

Let's add some features to our final table.

. etable, append keep(age i.sex weight i.hlthstat                  
                    c.age##i.sex c.age##c.weight i.sex##i.hlthstat)   
        column(index)                                                
        showstars showstarsnote                                     
        cstat(_r_b, nformat(%9.2f))                                
        mstat(N,   nformat(%9.0fc) label("Observations"))            
        mstat(chi2, nformat(%9.2f))                                  
        mstat(df_m, nformat(%9.0f))                                   
        mstat(p, nformat(%9.4f))                                     
        mstat(r2_p, nformat(%9.4f))                                   
        mstat(correct =(r(P_corr)),                                   
              nformat(%6.2f) label("% correctly classified"))         
        mstat(Sensitivity=(r(P_p1)), nformat(%6.2f))                 
        mstat(Specificity=(r(P_n0)), nformat(%6.2f))                  
        mstat(aic, nformat(%6.0fc))                                  
        mstat(bic, nformat(%6.0fc))                                 
        title("Table 2: Candidate logistic models for hypertension")

Table 2: Candidate logistic models for hypertension
1 2 3 4 5
Age (years) 1.05 ** 1.04 ** 1.10 ** 1.05 ** 1.08 ** Sex Female 1.02 0.26 ** 0.76 ** 0.72 ** 0.31 ** Weight (kg) 1.04 ** 1.04 ** 1.08 ** 1.04 ** 1.07 ** Health status Very good 1.01 1.02 0.94 0.93 0.97 Good 1.31 ** 1.34 ** 1.20 * 1.16 1.28 ** Fair 1.28 ** 1.30 ** 0.90 0.85 1.02 Poor 1.25 * 1.31 ** 0.90 0.86 1.06 Age (years) 1.05 ** 1.04 ** 1.10 ** 1.05 ** 1.08 ** Sex Female 1.02 0.26 ** 0.76 ** 0.72 ** 0.31 ** Sex # Age (years) Female 1.03 ** 1.02 ** Age (years) 1.05 ** 1.04 ** 1.10 ** 1.05 ** 1.08 ** Weight (kg) 1.04 ** 1.04 ** 1.08 ** 1.04 ** 1.07 ** Age (years) # Weight (kg) 1.00 ** 1.00 ** Sex Female 1.02 0.26 ** 0.76 ** 0.72 ** 0.31 ** Health status Very good 1.01 1.02 0.94 0.93 0.97 Good 1.31 ** 1.34 ** 1.20 * 1.16 1.28 ** Fair 1.28 ** 1.30 ** 0.90 0.85 1.02 Poor 1.25 * 1.31 ** 0.90 0.86 1.06 Sex # Health status Female # Very good 1.19 1.23 1.13 Female # Good 1.25 1.33 * 1.12 Female # Fair 2.00 ** 2.21 ** 1.60 ** Female # Poor 2.07 ** 2.28 ** 1.57 * Observations 10,335 10,335 10,335 10,335 10,335 χ² 2352.02 2444.84 2438.90 2392.57 2477.47 Model DF 7 8 12 11 13 Model test p-value 0.0000 0.0000 0.0000 0.0000 0.0000 Pseudo R-squared 0.1670 0.1736 0.1732 0.1699 0.1759 % correctly classified 69.60 69.48 69.96 69.79 69.62 Sensitivity 61.46 62.31 65.81 62.24 64.89 Specificity 75.57 74.74 73.00 75.33 73.08 AIC 11,745 11,655 11,669 11,713 11,632 BIC 11,803 11,720 11,763 11,800 11,733
** p<.01, * p<.05

Normally, we would use the export() option to export our table to a Microsoft Word document. But the layout(autofitcontents) option in collect style putdocx redistributes the column widths better.

. collect style putdocx, layout(autofitcontents)

. collect export table2.docx, as(docx) replace
(collection ETable exported to file table2.docx)

Likelihood-ratio test

Just to be thorough, let's use a likelihood-ratio test to compare model 5 with model 1.

. quietly logistic highbp c.age i.sex c.weight i.hlthstat
. estimates store reduced
. quietly logistic highbp c.age##i.sex c.age##c.weight i.sex##i.hlthstat
. estimates store full
. lrtest full reduced

Likelihood-ratio test
Assumption: reduced nested within full

 LR chi2(6) = 125.45
Prob > chi2 = 0.0000

The small p-value suggests that the full model that includes interactions fits the data better than the reduced model that does not include interactions.

Let's explore the results of the full model further.

. logistic highbp c.age##i.sex c.age##c.weight i.sex##i.hlthstat
note: age omitted because of collinearity.

Logistic regression                                    Number of obs =  10,335
                                                       LR chi2(13)   = 2477.47
                                                       Prob > chi2   =  0.0000
Log likelihood = -5801.9907                            Pseudo R2     =  0.1759

highbp Odds ratio Std. err. z P>|z| [95% conf. interval]
age 1.075828 .0089424 8.79 0.000 1.058443 1.093498
sex
Female .3095554 .054965 -6.60 0.000 .2185738 .4384082
sex#c.age
Female 1.019876 .0032379 6.20 0.000 1.01355 1.026242
age 1 (omitted)
weight 1.067281 .0057565 12.07 0.000 1.056058 1.078623
c.age#
c.weight .9995402 .0001042 -4.41 0.000 .9993359 .9997446
hlthstat
Very good .9721125 .0863504 -0.32 0.750 .8167823 1.156982
Good 1.282753 .1120339 2.85 0.004 1.080937 1.52225
Fair 1.023111 .1083771 0.22 0.829 .831296 1.259185
Poor 1.06005 .138177 0.45 0.655 .8210553 1.368611
sex#hlthstat
Female #
Very good 1.129374 .1548714 0.89 0.375 .8632015 1.477622
Female#Good 1.116131 .1472965 0.83 0.405 .8617503 1.445602
Female#Fair 1.602814 .2444008 3.09 0.002 1.188748 2.161107
Female#Poor 1.569438 .3080657 2.30 0.022 1.068222 2.305828
_cons .0008451 .0003694 -16.19 0.000 .0003587 .0019908
Note: _cons estimates baseline odds.

The p-values for hlthstat and the interaction of sex and hlthstat are both small and large. Some are less than 0.05, and some are greater than 0.05. The p-values in the regression table test that each individual coefficient is equal to 0. They do not tell us anything about the coefficients for all categories of the variable hlthstat or the interaction of sex and hlthstat. We can use testparm to simultaneously test all the coefficients for a categorical variable.

First, let's test the interaction of sex and hlthstat.

. testparm i.sex#i.hlthstat

 ( 1)  [highbp]2.sex#2.hlthstat = 0
 ( 2)  [highbp]2.sex#3.hlthstat = 0
 ( 3)  [highbp]2.sex#4.hlthstat = 0
 ( 4)  [highbp]2.sex#5.hlthstat = 0

           chi2(  4) =   13.20
         Prob > chi2 =    0.0103

The p-value for the interaction of sex and health status is less than 0.05, even though some of the individual coefficients are greater than 0.05.

The results are similar when we test the main effect of hlthstat.

. testparm i.hlthstat

 ( 1)  [highbp]2.hlthstat = 0
 ( 2)  [highbp]3.hlthstat = 0
 ( 3)  [highbp]4.hlthstat = 0
 ( 4)  [highbp]5.hlthstat = 0

           chi2(  4) =   13.12
         Prob > chi2 =    0.0107

Let's finish by using estat gof to check the goodness of fit for our model.

. estat gof

Goodness-of-fit test after logistic model
Variable: highbp

      Number of observations =  10,335
Number of covariate patterns =   9,998
          Pearson chi2(9984) = 9873.40
                 Prob > chi2 =  0.7826

. estat gof, group(20)
note: obs collapsed on 20 quantiles of estimated probabilities.

Goodness-of-fit test after logistic model
Variable: highbp

  Number of observations = 10,335
        Number of groups =     20
Hosmer–Lemeshow chi2(18) =  20.20
             Prob > chi2 = 0.3217

. estat gof, group(30)
note: obs collapsed on 30 quantiles of estimated probabilities.

Goodness-of-fit test after logistic model
Variable: highbp

  Number of observations = 10,335
        Number of groups =     30
Hosmer–Lemeshow chi2(28) =  32.08
             Prob > chi2 = 0.2713 

All the p-values are greater than 0.05, which suggests that our model fits the data reasonably well.

You can use many other options and postestimation commands with logistic, and you can read about them by clicking on the link to the logistic manual entry below. You can also watch a demonstration of these commands on YouTube by clicking on the links below.