Logistic regression 1: Introduction

Logistic regression is a popular tool for quantifying the relationship between a binary outcome variable and one or more predictor variables. The outcome variable is often called the dependent variable, and the predictor variables are often called independent variables. The predictor variables can be binary, categorical, or continuous. Here we will learn how to use Stata's logistic command to fit logistic regression models with categorical predictor variables, and we will explore other features later.

Let's begin by opening the nhanes2l dataset. Then let's describe the variables highbp, diabetes, and hlthstat.

. webuse nhanes2l
(Second National Health and Nutrition Examination Survey)

. describe highbp diabetes hlthstat


Variable      Storage   Display    Value                                
    name         type    format    label      Variable label            

highbp          byte    %8.0g               * High blood pressure       
diabetes        byte    %12.0g     diabetes   Diabetes status           
hlthstat        byte    %20.0g     hlth       Health status

The description tells us that highbp contains information about “High blood pressure”, diabetes contains information about “Diabetes status”, and hlthstat contains information about “Health status”.

Let's tabulate our outcome variable, highbp, to learn more about it.

. tabulate highbp


 High blood  
   pressure        Freq.     Percent        Cum.

          0       5,975       57.72       57.72
          1       4,376       42.28      100.00

       Total      10,351      100.00

The table shows that 57.7% of the sample is coded as 0 and 42.3% is coded as 1. Let's label 0 as “Normotensive” and 1 as “Hypertensive”, and then let's type tabulate highbp again to check our work.

. label define highbp 0 "Normotensive" 1 "Hypertensive"
. label values highbp highbp

. tabulate highbp


  High blood  
    pressure        Freq.     Percent        Cum.

Normotensive       5,975       57.72       57.72
Hypertensive       4,376       42.28      100.00

       Total      10,351      100.00

Next let's type tabulate highbp and diabetes. The row option tells Stata to display row percentages in addition to the frequencies.

. tabulate diabetes highbp, row


 Key            
   frequency    
 row percentage 



    Diabetes    High blood pressure           
      status   Normotens  Hypertens       Total

Not diabetic       5,795      4,055       9,850
                   58.83      41.17      100.00

    Diabetic         178        321         499
      35.67      64.33      100.00

       Total       5,973      4,376      10,349
                   57.72      42.28      100.00

The table shows that 41.2% of people without diabetes have hypertension, while 64.3% of people with diabetes have hypertension. This suggests that diabetes status may be a predictor of hypertension.

Let's test this hypothesis by fitting a logistic regression model using highbp as the binary outcome variable. We include the i. prefix for diabetes, which is factor-variable notation that tells Stata that diabetes is a binary predictor variable.

. logistic highbp i.diabetes

Logistic regression                                     Number of obs = 10,349
                                                        LR chi2(1)    = 103.12
                                                        Prob > chi2   = 0.0000
Log likelihood = -6998.1054                             Pseudo R2     = 0.0073



      highbp   Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]

    diabetes                                                                 
   Diabetic     2.577195   .2465557     9.90   0.000     2.136556     3.10871
       _cons    .6997412   .0143263   -17.44   0.000     .6722181    .7283911

Note: _cons estimates baseline odds.

The output includes a column labeled Odds ratio, and the odds ratio for diabetes is 2.58. The literal interpretation is that the odds of having hypertension are 2.58 times higher for someone with diabetes than the odds for someone without diabetes. It is tempting to misinterpret the odds ratio as the relative risk or risk ratio. But that is not correct, because odds are not the same as risk or probability.

Let's use Stata's cs command to clarify the difference between odds and risk (or probability).

. cs highbp diabetes, or


                     Diabetes status     
                    Exposed   Unexposed        Total
           Cases        321        4055         4376
        Noncases        178        5795         5973
           Total        499        9850        10349
                                                    
            Risk   .6432866    .4116751     .4228428
                                                    
                       Point estimate        [95% conf. interval]
 Risk difference          .2316114           .1884724    .2747505
      Risk ratio          1.562607           1.457737    1.675022
 Attr. frac. ex.          .0264109            .314005    .4029931
 Attr. frac. pop          .0264109                               
      Odds ratio          2.577197           2.137069     3.10796  (Cornfield)
                              chi2(1) =   104.40  Pr>chi2 = 0.0000

The output tells us that 499 people in our sample have diabetes and 9,850 do not have diabetes. The output also tells us that 321 people with diabetes also have hypertension. So the risk (or probability) of having hypertension for someone with diabetes equals 321/499 = 0.64. Similarly, the risk of having hypertension for someone without diabetes is 4055/9850 = 0.41. The ratio of these risks is labeled Risk ratio in the output and equals 0.64/0.41 = 1.56.

The odds are calculated differently. The odds of having hypertension for someone with diabetes are 321/178 = 1.80. And the odds of having hypertension for someone without diabetes are 4055/5795 = 0.70. The ratio of these odds is labeled Odds ratio in the output and equals 1.80/0.70 = 2.58. This is the same as the odds ratio reported in our logistic regression output.

It is often easier to interpret the results of logistic regression models with the margins command. margins diabetes displays the predicted probability that the outcome variable equals 1 for each category of the predictor variable. Let's try this for our model.

. margins diabetes

Adjusted predictions                                    Number of obs = 10,349
Model VCE: OIM

Expression: Pr(highbp), predict()



                           Delta-method                                        
                    Margin   std. err.      z    P>|z|     [95% conf. interval]

     diabetes                                                                 
Not diabetic     .4116751   .0049587    83.02   0.000     .4019563     .421394
    Diabetic     .6432864   .0214443    30.00   0.000     .6012563    .6853164

The output shows us that the expected probability of hypertension equals 0.41 for someone without diabetes and equals 0.64 for someone with diabetes. We can use marginsplot to graph the results.

. marginsplot

Variables that uniquely identify margins: diabetes

The default graph is often called a profile plot. We can add a few options to create a bar graph if we prefer.

. marginsplot, recast(bar) xdimension(diabetes) 
                horizontal  plotopts(barwidth(.9))

Variables that uniquely identify margins: diabetes

We can also use margins followed by nlcom to estimate the relative risk and the risk difference. The margins option post stores the results in memory, and the coeflegend option displays the names of the estimates so that we can refer to them with nlcom.

. margins diabetes, post coeflegend

Adjusted predictions                                    Number of obs = 10,349
Model VCE: OIM

Expression: Pr(highbp), predict()



                    Margin   Legend                                          

     diabetes                                                                
Not diabetic     .4116751  _b[0bn.diabetes]                                  
    Diabetic     .6432864  _b[1.diabetes]                                    



. nlcom (risk_ratio: _b[1.diabetes] / _b[0bn.diabetes])

  risk_ratio: _b[1.diabetes] / _b[0bn.diabetes]



               Coefficient  Std. err.      z    P>|z|     [95% conf. interval]

  risk_ratio    1.562607   .0553865    28.21   0.000     1.454051    1.671162



. nlcom (risk_diff: _b[1.diabetes] - _b[0bn.diabetes])

   risk_diff: _b[1.diabetes] - _b[0bn.diabetes]



               Coefficient  Std. err.      z    P>|z|     [95% conf. interval]

   risk_diff    .2316112   .0220101    10.52   0.000     .1884721    .2747503

The logistic regression output also reported a standard error, z statistic, p-value, and 95% confidence interval for the odds ratio. The z statistic is not the ratio of the odds ratio and its standard error; it is the ratio of the coefficient and its standard error. The coefficient is the natural logarithm of the odds ratio, and we can view it, and its standard error, by using the coef option (or the logit command). We can then use the coefficient and its standard error to compute the z statistic, p-value, and 95% confidence bounds. The z statistic and p-value are testing the null hypothesis that the coefficient equals 0, or, equivalently, that the odds ratio equals 1. The confidence interval for the coefficient is transformed to a confidence interval for the odds ratio.

. logistic highbp i.diabetes, coef coeflegend

Logistic regression                                     Number of obs = 10,349
                                                        LR chi2(1)    = 103.12
                                                        Prob > chi2   = 0.0000
Log likelihood = -6998.1054                             Pseudo R2     = 0.0073



      highbp   Coefficient  Legend                                          

    diabetes                                                                
   Diabetic     .9467015  _b[1.diabetes]                                    
       _cons   -.3570448  _b[_cons]                                         


. display "Odds ratio = " %9.6f exp(_b[1.diabetes])
Odds ratio =  2.577195

. display "z = " %9.2f _b[1.diabetes] / _se[1.diabetes]
z =      9.90

. display "p-value = " %9.3f 1 - normal(_b[1.diabetes] / _se[1.diabetes])
p-value =     0.000

. display "lower 95% confidence bound = " %9.6f
            exp(_b[1.diabetes] + invnormal(0.025)*_se[1.diabetes])
lower 95% confidence bound =  2.136556

. display "upper 95% confidence bound = " %9.5f
            exp(_b[1.diabetes] + invnormal(0.975)*_se[1.diabetes])
upper 95% confidence bound =   3.10871

The top of the logistic regression output shows the sample size; a likelihood-ratio test labeled LR chi2(1); its p-value labeled Prob > chi2; and a pseudo R-squared labeled Pseudo R2.

The likelihood-ratio chi-squared compares the specified model to a model with no predictor variables, which is often called the intercept-only model. We can calculate this test manually by fitting our model, storing the results in memory, fitting the intercept-only model, storing those results in memory, and finally using lrtest to compare the two models.

. quietly logistic highbp i.diabetes
. estimates store full
. quietly logistic highbp if e(sample)
. estimates store reduced
. lrtest full reduced

Likelihood-ratio test
Assumption: reduced nested within full

 LR chi2(1) = 103.12
Prob > chi2 = 0.0000

The pseudo R-squared also compares the model we fit with the intercept-only (null) model using each model's log likelihood. We can type estat ic to display the log likelihood for both models and use the results to calculate the pseudo R-squared.

. estat ic

Akaike's information criterion and Bayesian information criterion



       Model            N   ll(null)  ll(model)      df        AIC        BIC

     reduced      10,349  -7049.666  -7049.666       1   14101.33   14108.58

Note: BIC uses N = number of observations. See [R] IC note.

. disp "Pseudo R2 = " %9.4f   1 - (-6998.105 / -7049.666)
Pseudo R2 =    0.0073

Stata reports McFadden's pseudo R-squared, but there are many others. You can use the ssc install command to install the community-contributed command fitstat, which reports many different kinds of pseudo R-squareds.

. ssc install fitstat, replace
. quietly logistic highbp i.diabetes
. fitstat

Measures of Fit for logistic of highbp

Log-Lik Intercept Only:    -7049.666     Log-Lik Full Model:        -6998.105
D(10346):                  13996.211     LR(1):                       103.122
                                         Prob > LR:                     0.000
McFadden's R2:                 0.007     McFadden's Adj R2:             0.007
Maximum Likelihood R2:         0.010     Cragg > Uhler's R2:            0.013
McKelvey and Zavoina's R2:     0.012     Efron's R2:                    0.010
Variance of y*:                3.331     Variance of error:             3.290
Count R2:                      0.591     Adj Count R2:                  0.033
AIC:                           1.353     AIC*n:                     14002.211
BIC:                      -81648.888     BIC':                        -93.877

It is tempting to interpret the pseudo R-squared using the “proportion of variance explained” language we use for linear regression. Unfortunately, this is incorrect, and there is no general, intuitive interpretation for the pseudo R-squared statistic. It can, however, be used to select the best fitting model among a collection of models with the same outcome variable. Models with larger values of the same pseudo R-squared statistic fit better than models with smaller values.

We can use the results of our model to predict the probability that a person has hypertension. The predict prob, pr command generates a new variable named prob that contains the predicted probability that each person has hypertension. Then we can generate a variable named highbp_pr that classifies each person as having hypertension if their predicted probability is greater than or equal to 0.50 and classifies each person as not having hypertension if their predicted probability is less than 0.50. We can list some of the observations to check our predictions.

. quietly logistic highbp i.diabetes
. predict prob, pr
(2 missing values generated)

. generate highbp_pr = (prob >= 0.5)
. list highbp prob highbp_pr if inlist(_n,2,4,6,22), abbrev(10)


            highbp       prob   highbp_pr 
  2.  Normotensive   .4116751           0 
  4.  Hypertensive   .6432863           1 
  6.  Hypertensive   .4116751           0 
 22.  Normotensive   .6432863           1

The person in observation 2 does not have hypertension, their predicted probability of hypertension is 0.412, and we classified them correctly as not having hypertension.

The person in observation 4 does have hypertension, their predicted probability of hypertension is 0.643, and we classified them correctly as having hypertension.

The person in observation 6 has hypertension, their predicted probability of hypertension is 0.412, and we classified them incorrectly as not having hypertension.

The person in observation 22 does not have hypertension, their predicted probability of hypertension is 0.643, and we classified them incorrectly as having hypertension.

Our model does not predict hypertension perfectly. We can calculate the proportion of correctly classified people by tabulating the observed and predicted number of people with hypertension.

. tabulate highbp_pr highbp


              High blood pressure           
 highbp_pr   Normotens  Hypertens       Total

         0       5,795      4,055       9,850
         1         180        321         501

     Total       5,975      4,376      10,351

Our model correctly predicted 5,795 people without hypertension and 321 people with hypertension, which means that our model predicted 59.1% of the observations correctly.

. display "Correctly classified = " %9.3f (5795 + 321) / 10351
Correctly classified =     0.591

We can use estat classification to do all these steps for us.

. estat classification

Logistic model for highbp


                      True        
Classified           D           ~D         Total
    
     +             321           178           499
     -            4055          5795          9850
    
   Total          4376          5973         10349

Classified + if predicted Pr(D) >= .5
True D defined as highbp != 0


Sensitivity                     Pr( +| D)    7.34%
Specificity                     Pr( -|~D)   97.02%
Positive predictive value       Pr( D| +)   64.33%
Negative predictive value       Pr(~D| -)   58.83%

False + rate for true ~D        Pr( +|~D)    2.98%
False - rate for true D         Pr( -| D)   92.66%
False + rate for classified +   Pr(~D| +)   35.67%
False - rate for classified -   Pr( D| -)   41.17%

Correctly classified                        59.10%

estat classification reports many other numbers, such as sensitivity and specificity. Sensitivity is the probability that a person with hypertension is predicted to have hypertension. Specificity is the probability that a person who does not have hypertension is predicted to not have hypertension. We can calculate these values using the frequencies reported in the table.

. display "Sensitivity = " %9.4f  321/4376
Sensitivity =    0.0734

. display "Specificity = " %9.4f  5795/5973
Specificity =    0.9702

We could change the cutoff value for predicting hypertension from 0.5 (the default) to 0.4 to see if this improves our predictions.

. estat classification, cutoff(0.4)

Logistic model for highbp


                      True        
Classified           D           ~D         Total
    
     +            4376          5973         10349
     -               0             0             0
    
   Total          4376          5973         10349

Classified + if predicted Pr(D) >= .4
True D defined as highbp != 0


Sensitivity                     Pr( +| D)  100.00%
Specificity                     Pr( -|~D)    0.00%
Positive predictive value       Pr( D| +)   42.28%
Negative predictive value       Pr(~D| -)       .%

False + rate for true ~D        Pr( +|~D)  100.00%
False - rate for true D         Pr( -| D)    0.00%
False + rate for classified +   Pr(~D| +)   57.72%
False - rate for classified -   Pr( D| -)       .%

Correctly classified                        42.28%

The percentage of correctly classified observations decreased from 59.1% to 42.28%, the sensitivity increased from 7.34% to 100%, and the specificity decreased from 97.02% to 0%. Changing the cutoff from 0.5 to 0.4 resulted in poorer predictions and specificity but greatly improved the sensitivity. We might ask ourselves if there is an optimal cutpoint and how we might identify it. Stata's lsens command can help us by plotting sensitivity and specificity for various cutpoints.

. lsens

The graph shows the probability cutoff on the horizontal axis and sensitivity/specificity on the vertical axis. Sensitivity is plotted in blue and specificity is plotted in red. Sensitivity equals 1 for small values of the probability cutoff, decreases rapidly for middle values, and equals 0 for larger values. Specificity displays the opposite pattern.

The lroc command plots the receiver operating characteristic (ROC) curve with sensitivity on the vertical axis and 1 minus specificity on the horizontal axis. The area under the ROC curve has a range of 0.5 to 1.0 and is reported at the bottom of the graph. A value near 0.5 indicates that the model is no better than a random guess at distinguishing between the two possible outcomes. A model with a value above 0.8 is generally considered to be good at distinguishing between the two outcomes. The area under our ROC curve is 0.5218, which indicates that our model does not distinguish well between the two outcomes.

. lroc

Logistic model for highbp

Number of observations =    10349
Area under ROC curve   =   0.5218

We can check the goodness of fit for our model by using estat gof.

. estat gof

Goodness-of-fit test after logistic model
Variable: highbp

      Number of observations = 10,349
Number of covariate patterns =      2
             Pearson chi2(0) =   0.00
                 Prob > chi2 =      .

The p-value is missing because the degrees of freedom is calculated by subtracting the number of predictor variables, including the constant, from the number of covariate patterns. Our model has two covariate patterns and two predictors, so the degrees of freedom equals 0. Let's fit a model that includes age to increase the number of covariate patterns. Then we will run estat gof again.

. quietly logistic highbp age i.diabetes
. estat gof

Goodness-of-fit test after logistic model
Variable: highbp

      Number of observations = 10,349
Number of covariate patterns =    104
           Pearson chi2(101) = 119.76
                 Prob > chi2 = 0.0981

Now the model has 104 covariate patterns with three predictors, so the degrees of freedom equals 104. The p-value is larger than 0.05, so we do not find evidence of poor fit. A p-value less than 0.05 would suggest poor model fit.

We can add the group() option to use the Hosmer and Lemeshow goodness-of-fit test. The Hosmer and Lemeshow test is known to be sensitive to the number of quantiles, so you may wish to consider multiple values and check for agreement.

. estat gof, group(20)
note: obs collapsed on 20 quantiles of estimated probabilities.

Goodness-of-fit test after logistic model
Variable: highbp

  Number of observations = 10,349
        Number of groups =     20
Hosmer–Lemeshow chi2(18) =  44.42
             Prob > chi2 = 0.0005

. estat gof, group(30)
note: obs collapsed on 30 quantiles of estimated probabilities.

Goodness-of-fit test after logistic model
Variable: highbp

  Number of observations = 10,349
        Number of groups =     30
Hosmer–Lemeshow chi2(28) =  49.06
             Prob > chi2 = 0.0082

You can use many other options with logistic, and you can read about them by clicking on the link to the logistic manual entry below. You can also watch a demonstration of these commands on YouTube by clicking on the link below.

See it in action

Watch Fitting & interpreting regression models: Logistic regression with categorical predictors.

Tell me more

Products

New in Stata 19

Why Stata

All features

Disciplines

Stata/MP

StataNow

Order Stata

Purchase

Order Stata

Bookstore

Stata Press

Stata Journal

Gift Shop

Learn

Free webinars

NetCourses

Classroom and web training

Organizational training

Video tutorials

Third-party courses

Web resources

Teaching with Stata

Support

Training

Video tutorials

FAQs

Statalist: The Stata Forum

Resources

Technical support

Customer service

Alerts

Company

News and events

Customer service

Careers

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Privacy policy

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Required cookies

Advertising cookies

Required cookies

These cookies are essential for our website to function and do not store any personally identifiable information. These cookies cannot be disabled.
Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

Accept Cookies

High blood
pressure		Freq. Percent Cum.

0		5,975 57.72 57.72
1		4,376 42.28 100.00

Total		10,351 100.00

Diabetes	High blood pressure
status	Normotens Hypertens	Total

Not diabetic	5,795 4,055	9,850
	58.83 41.17	100.00

Diabetic	178 321	499
	35.67 64.33	100.00

Total	5,973 4,376	10,349
	57.72 42.28	100.00


highbp		Odds ratio Std. err. z P>\|z\| [95% conf. interval]

diabetes
Diabetic		2.577195 .2465557 9.90 0.000 2.136556 3.10871
_cons		.6997412 .0143263 -17.44 0.000 .6722181 .7283911

	Diabetes status
	Exposed Unexposed	Total
Cases	321 4055	4376
Noncases	178 5795	5973
Total	499 9850	10349

Risk	.6432866 .4116751	.4228428

	Point estimate	[95% conf. interval]
Risk difference	.2316114	.1884724 .2747505
Risk ratio	1.562607	1.457737 1.675022
Attr. frac. ex.	.0264109	.314005 .4029931
Attr. frac. pop	.0264109
Odds ratio	2.577197	2.137069 3.10796	(Cornfield)


		Delta-method
		Margin std. err. z P>\|z\| [95% conf. interval]

diabetes
Not diabetic		.4116751 .0049587 83.02 0.000 .4019563 .421394
Diabetic		.6432864 .0214443 30.00 0.000 .6012563 .6853164


		Margin Legend

diabetes
Not diabetic		.4116751 _b[0bn.diabetes]
Diabetic		.6432864 _b[1.diabetes]


		Coefficient Std. err. z P>\|z\| [95% conf. interval]

risk_ratio		1.562607 .0553865 28.21 0.000 1.454051 1.671162


		Coefficient Std. err. z P>\|z\| [95% conf. interval]

risk_diff		.2316112 .0220101 10.52 0.000 .1884721 .2747503


highbp		Coefficient Legend

diabetes
Diabetic		.9467015 _b[1.diabetes]
_cons		-.3570448 _b[_cons]


Model		N ll(null) ll(model) df AIC BIC

reduced		10,349 -7049.666 -7049.666 1 14101.33 14108.58

	highbp prob highbp_pr
2.	Normotensive .4116751 0
4.	Hypertensive .6432863 1
6.	Hypertensive .4116751 0
22.	Normotensive .6432863 1

	High blood pressure
highbp_pr	Normotens Hypertens	Total

0	5,795 4,055	9,850
1	180 321	501

Total	5,975 4,376	10,351

	True
Classified	D ~D	Total

+	321 178	499
-	4055 5795	9850

Total	4376 5973	10349