Home  /  Resources & Support  /  Introduction to Stata basics  /  Logistic regression 1: Introduction

Logistic regression is a popular tool for quantifying the relationship between a binary outcome variable and one or more predictor variables. The outcome variable is often called the dependent variable, and the predictor variables are often called independent variables. The predictor variables can be binary, categorical, or continuous. Here we will learn how to use Stata's logistic command to fit logistic regression models with categorical predictor variables, and we will explore other features later.

Let's begin by opening the nhanes2l dataset. Then let's describe the variables highbp, diabetes, and hlthstat.

. webuse nhanes2l
(Second National Health and Nutrition Examination Survey)

. describe highbp diabetes hlthstat

Variable Storage Display Value
name type format label Variable label
highbp byte %8.0g * High blood pressure
diabetes byte %12.0g diabetes Diabetes status
hlthstat byte %20.0g hlth Health status

The description tells us that highbp contains information about “High blood pressure”, diabetes contains information about “Diabetes status”, and hlthstat contains information about “Health status”.

Let's tabulate our outcome variable, highbp, to learn more about it.

. tabulate highbp

High blood
pressure Freq. Percent Cum.
0 5,975 57.72 57.72
1 4,376 42.28 100.00
Total 10,351 100.00

The table shows that 57.7% of the sample is coded as 0 and 42.3% is coded as 1. Let's label 0 as “Normotensive” and 1 as “Hypertensive”, and then let's type tabulate highbp again to check our work.

. label define highbp 0 "Normotensive" 1 "Hypertensive"
. label values highbp highbp

. tabulate highbp

High blood
pressure Freq. Percent Cum.
Normotensive 5,975 57.72 57.72
Hypertensive 4,376 42.28 100.00
Total 10,351 100.00

Next let's type tabulate highbp and diabetes. The row option tells Stata to display row percentages in addition to the frequencies.

. tabulate diabetes highbp, row

Key
frequency
row percentage
Diabetes High blood pressure
status Normotens Hypertens Total
Not diabetic 5,795 4,055 9,850
58.83 41.17 100.00
Diabetic 178 321 499
35.67 64.33 100.00
Total 5,973 4,376 10,349
57.72 42.28 100.00

The table shows that 41.2% of people without diabetes have hypertension, while 64.3% of people with diabetes have hypertension. This suggests that diabetes status may be a predictor of hypertension.

Let's test this hypothesis by fitting a logistic regression model using highbp as the binary outcome variable. We include the i. prefix for diabetes, which is factor-variable notation that tells Stata that diabetes is a binary predictor variable.

. logistic highbp i.diabetes

Logistic regression                                     Number of obs = 10,349
                                                        LR chi2(1)    = 103.12
                                                        Prob > chi2   = 0.0000
Log likelihood = -6998.1054                             Pseudo R2     = 0.0073

highbp Odds ratio Std. err. z P>|z| [95% conf. interval]
diabetes
Diabetic 2.577195 .2465557 9.90 0.000 2.136556 3.10871
_cons .6997412 .0143263 -17.44 0.000 .6722181 .7283911
Note: _cons estimates baseline odds.

The output includes a column labeled Odds ratio, and the odds ratio for diabetes is 2.58. The literal interpretation is that the odds of having hypertension are 2.58 times higher for someone with diabetes than the odds for someone without diabetes. It is tempting to misinterpret the odds ratio as the relative risk or risk ratio. But that is not correct, because odds are not the same as risk or probability.

Let's use Stata's cs command to clarify the difference between odds and risk (or probability).

. cs highbp diabetes, or

Diabetes status
Exposed Unexposed Total
Cases 321 4055 4376
Noncases 178 5795 5973
Total 499 9850 10349
Risk .6432866 .4116751 .4228428
Point estimate [95% conf. interval]
Risk difference .2316114 .1884724 .2747505
Risk ratio 1.562607 1.457737 1.675022
Attr. frac. ex. .0264109 .314005 .4029931
Attr. frac. pop .0264109
Odds ratio 2.577197 2.137069 3.10796 (Cornfield)
chi2(1) = 104.40 Pr>chi2 = 0.0000

The output tells us that 499 people in our sample have diabetes and 9,850 do not have diabetes. The output also tells us that 321 people with diabetes also have hypertension. So the risk (or probability) of having hypertension for someone with diabetes equals 321/499 = 0.64. Similarly, the risk of having hypertension for someone without diabetes is 4055/9850 = 0.41. The ratio of these risks is labeled Risk ratio in the output and equals 0.64/0.41 = 1.56.

The odds are calculated differently. The odds of having hypertension for someone with diabetes are 321/178 = 1.80. And the odds of having hypertension for someone without diabetes are 4055/5795 = 0.70. The ratio of these odds is labeled Odds ratio in the output and equals 1.80/0.70 = 2.58. This is the same as the odds ratio reported in our logistic regression output.

It is often easier to interpret the results of logistic regression models with the margins command. margins diabetes displays the predicted probability that the outcome variable equals 1 for each category of the predictor variable. Let's try this for our model.

. margins diabetes

Adjusted predictions                                    Number of obs = 10,349
Model VCE: OIM

Expression: Pr(highbp), predict()

Delta-method
Margin std. err. z P>|z| [95% conf. interval]
diabetes
Not diabetic .4116751 .0049587 83.02 0.000 .4019563 .421394
Diabetic .6432864 .0214443 30.00 0.000 .6012563 .6853164

The output shows us that the expected probability of hypertension equals 0.41 for someone without diabetes and equals 0.64 for someone with diabetes. We can use marginsplot to graph the results.

. marginsplot

Variables that uniquely identify margins: diabetes

The default graph is often called a profile plot. We can add a few options to create a bar graph if we prefer.

. marginsplot, recast(bar) xdimension(diabetes) 
                horizontal  plotopts(barwidth(.9))

Variables that uniquely identify margins: diabetes

We can also use margins followed by nlcom to estimate the relative risk and the risk difference. The margins option post stores the results in memory, and the coeflegend option displays the names of the estimates so that we can refer to them with nlcom.

. margins diabetes, post coeflegend

Adjusted predictions                                    Number of obs = 10,349
Model VCE: OIM

Expression: Pr(highbp), predict()

Margin Legend
diabetes
Not diabetic .4116751 _b[0bn.diabetes]
Diabetic .6432864 _b[1.diabetes]
. nlcom (risk_ratio: _b[1.diabetes] / _b[0bn.diabetes]) risk_ratio: _b[1.diabetes] / _b[0bn.diabetes]
Coefficient Std. err. z P>|z| [95% conf. interval]
risk_ratio 1.562607 .0553865 28.21 0.000 1.454051 1.671162
. nlcom (risk_diff: _b[1.diabetes] - _b[0bn.diabetes]) risk_diff: _b[1.diabetes] - _b[0bn.diabetes]
Coefficient Std. err. z P>|z| [95% conf. interval]
risk_diff .2316112 .0220101 10.52 0.000 .1884721 .2747503

The logistic regression output also reported a standard error, z statistic, p-value, and 95% confidence interval for the odds ratio. The z statistic is not the ratio of the odds ratio and its standard error; it is the ratio of the coefficient and its standard error. The coefficient is the natural logarithm of the odds ratio, and we can view it, and its standard error, by using the coef option (or the logit command). We can then use the coefficient and its standard error to compute the z statistic, p-value, and 95% confidence bounds. The z statistic and p-value are testing the null hypothesis that the coefficient equals 0, or, equivalently, that the odds ratio equals 1. The confidence interval for the coefficient is transformed to a confidence interval for the odds ratio.

. logistic highbp i.diabetes, coef coeflegend

Logistic regression                                     Number of obs = 10,349
                                                        LR chi2(1)    = 103.12
                                                        Prob > chi2   = 0.0000
Log likelihood = -6998.1054                             Pseudo R2     = 0.0073

highbp Coefficient Legend
diabetes
Diabetic .9467015 _b[1.diabetes]
_cons -.3570448 _b[_cons]
. display "Odds ratio = " %9.6f exp(_b[1.diabetes]) Odds ratio = 2.577195 . display "z = " %9.2f _b[1.diabetes] / _se[1.diabetes] z = 9.90 . display "p-value = " %9.3f 1 - normal(_b[1.diabetes] / _se[1.diabetes]) p-value = 0.000 . display "lower 95% confidence bound = " %9.6f  exp(_b[1.diabetes] + invnormal(0.025)*_se[1.diabetes]) lower 95% confidence bound = 2.136556 . display "upper 95% confidence bound = " %9.5f  exp(_b[1.diabetes] + invnormal(0.975)*_se[1.diabetes]) upper 95% confidence bound = 3.10871

The top of the logistic regression output shows the sample size; a likelihood-ratio test labeled LR chi2(1); its p-value labeled Prob > chi2; and a pseudo R-squared labeled Pseudo R2.

The likelihood-ratio chi-squared compares the specified model to a model with no predictor variables, which is often called the intercept-only model. We can calculate this test manually by fitting our model, storing the results in memory, fitting the intercept-only model, storing those results in memory, and finally using lrtest to compare the two models.

. quietly logistic highbp i.diabetes
. estimates store full
. quietly logistic highbp if e(sample)
. estimates store reduced
. lrtest full reduced

Likelihood-ratio test
Assumption: reduced nested within full

 LR chi2(1) = 103.12
Prob > chi2 = 0.0000

The pseudo R-squared also compares the model we fit with the intercept-only (null) model using each model's log likelihood. We can type estat ic to display the log likelihood for both models and use the results to calculate the pseudo R-squared.

. estat ic

Akaike's information criterion and Bayesian information criterion

Model N ll(null) ll(model) df AIC BIC
reduced 10,349 -7049.666 -7049.666 1 14101.33 14108.58
Note: BIC uses N = number of observations. See [R] IC note. . disp "Pseudo R2 = " %9.4f 1 - (-6998.105 / -7049.666) Pseudo R2 = 0.0073

Stata reports McFadden's pseudo R-squared, but there are many others. You can use the ssc install command to install the community-contributed command fitstat, which reports many different kinds of pseudo R-squareds.

. ssc install fitstat, replace
. quietly logistic highbp i.diabetes
. fitstat

Measures of Fit for logistic of highbp

Log-Lik Intercept Only:    -7049.666     Log-Lik Full Model:        -6998.105
D(10346):                  13996.211     LR(1):                       103.122
                                         Prob > LR:                     0.000
McFadden's R2:                 0.007     McFadden's Adj R2:             0.007
Maximum Likelihood R2:         0.010     Cragg > Uhler's R2:            0.013
McKelvey and Zavoina's R2:     0.012     Efron's R2:                    0.010
Variance of y*:                3.331     Variance of error:             3.290
Count R2:                      0.591     Adj Count R2:                  0.033
AIC:                           1.353     AIC*n:                     14002.211
BIC:                      -81648.888     BIC':                        -93.877

It is tempting to interpret the pseudo R-squared using the “proportion of variance explained” language we use for linear regression. Unfortunately, this is incorrect, and there is no general, intuitive interpretation for the pseudo R-squared statistic. It can, however, be used to select the best fitting model among a collection of models with the same outcome variable. Models with larger values of the same pseudo R-squared statistic fit better than models with smaller values.

We can use the results of our model to predict the probability that a person has hypertension. The predict prob, pr command generates a new variable named prob that contains the predicted probability that each person has hypertension. Then we can generate a variable named highbp_pr that classifies each person as having hypertension if their predicted probability is greater than or equal to 0.50 and classifies each person as not having hypertension if their predicted probability is less than 0.50. We can list some of the observations to check our predictions.

. quietly logistic highbp i.diabetes
. predict prob, pr
(2 missing values generated)

. generate highbp_pr = (prob >= 0.5)
. list highbp prob highbp_pr if inlist(_n,2,4,6,22), abbrev(10)

highbp prob highbp_pr
2. Normotensive .4116751 0
4. Hypertensive .6432863 1
6. Hypertensive .4116751 0
22. Normotensive .6432863 1

The person in observation 2 does not have hypertension, their predicted probability of hypertension is 0.412, and we classified them correctly as not having hypertension.

The person in observation 4 does have hypertension, their predicted probability of hypertension is 0.643, and we classified them correctly as having hypertension.

The person in observation 6 has hypertension, their predicted probability of hypertension is 0.412, and we classified them incorrectly as not having hypertension.

The person in observation 22 does not have hypertension, their predicted probability of hypertension is 0.643, and we classified them incorrectly as having hypertension.

Our model does not predict hypertension perfectly. We can calculate the proportion of correctly classified people by tabulating the observed and predicted number of people with hypertension.

. tabulate highbp_pr highbp

High blood pressure
highbp_pr Normotens Hypertens Total
0 5,795 4,055 9,850
1 180 321 501
Total 5,975 4,376 10,351

Our model correctly predicted 5,795 people without hypertension and 321 people with hypertension, which means that our model predicted 59.1% of the observations correctly.

. display "Correctly classified = " %9.3f (5795 + 321) / 10351
Correctly classified =     0.591

We can use estat classification to do all these steps for us.

. estat classification

Logistic model for highbp

True
Classified D ~D Total
+ 321 178 499
- 4055 5795 9850
Total 4376 5973 10349
Classified + if predicted Pr(D) >= .5 True D defined as highbp != 0
Sensitivity Pr( +| D) 7.34%
Specificity Pr( -|~D) 97.02%
Positive predictive value Pr( D| +) 64.33%
Negative predictive value Pr(~D| -) 58.83%
False + rate for true ~D Pr( +|~D) 2.98%
False - rate for true D Pr( -| D) 92.66%
False + rate for classified + Pr(~D| +) 35.67%
False - rate for classified - Pr( D| -) 41.17%
Correctly classified 59.10%

estat classification reports many other numbers, such as sensitivity and specificity. Sensitivity is the probability that a person with hypertension is predicted to have hypertension. Specificity is the probability that a person who does not have hypertension is predicted to not have hypertension. We can calculate these values using the frequencies reported in the table.

. display "Sensitivity = " %9.4f  321/4376
Sensitivity =    0.0734

. display "Specificity = " %9.4f  5795/5973
Specificity =    0.9702

We could change the cutoff value for predicting hypertension from 0.5 (the default) to 0.4 to see if this improves our predictions.

. estat classification, cutoff(0.4)

Logistic model for highbp

True
Classified D ~D Total
+ 4376 5973 10349
- 0 0 0
Total 4376 5973 10349
Classified + if predicted Pr(D) >= .4 True D defined as highbp != 0
Sensitivity Pr( +| D) 100.00%
Specificity Pr( -|~D) 0.00%
Positive predictive value Pr( D| +) 42.28%
Negative predictive value Pr(~D| -) .%
False + rate for true ~D Pr( +|~D) 100.00%
False - rate for true D Pr( -| D) 0.00%
False + rate for classified + Pr(~D| +) 57.72%
False - rate for classified - Pr( D| -) .%
Correctly classified 42.28%

The percentage of correctly classified observations decreased from 59.1% to 42.28%, the sensitivity increased from 7.34% to 100%, and the specificity decreased from 97.02% to 0%. Changing the cutoff from 0.5 to 0.4 resulted in poorer predictions and specificity but greatly improved the sensitivity. We might ask ourselves if there is an optimal cutpoint and how we might identify it. Stata's lsens command can help us by plotting sensitivity and specificity for various cutpoints.

. lsens

The graph shows the probability cutoff on the horizontal axis and sensitivity/specificity on the vertical axis. Sensitivity is plotted in blue and specificity is plotted in red. Sensitivity equals 1 for small values of the probability cutoff, decreases rapidly for middle values, and equals 0 for larger values. Specificity displays the opposite pattern.

The lroc command plots the receiver operating characteristic (ROC) curve with sensitivity on the vertical axis and 1 minus specificity on the horizontal axis. The area under the ROC curve has a range of 0.5 to 1.0 and is reported at the bottom of the graph. A value near 0.5 indicates that the model is no better than a random guess at distinguishing between the two possible outcomes. A model with a value above 0.8 is generally considered to be good at distinguishing between the two outcomes. The area under our ROC curve is 0.5218, which indicates that our model does not distinguish well between the two outcomes.

. lroc

Logistic model for highbp

Number of observations =    10349
Area under ROC curve   =   0.5218

We can check the goodness of fit for our model by using estat gof.

. estat gof

Goodness-of-fit test after logistic model
Variable: highbp

      Number of observations = 10,349
Number of covariate patterns =      2
             Pearson chi2(0) =   0.00
                 Prob > chi2 =      .

The p-value is missing because the degrees of freedom is calculated by subtracting the number of predictor variables, including the constant, from the number of covariate patterns. Our model has two covariate patterns and two predictors, so the degrees of freedom equals 0. Let's fit a model that includes age to increase the number of covariate patterns. Then we will run estat gof again.

. quietly logistic highbp age i.diabetes
. estat gof

Goodness-of-fit test after logistic model
Variable: highbp

      Number of observations = 10,349
Number of covariate patterns =    104
           Pearson chi2(101) = 119.76
                 Prob > chi2 = 0.0981

Now the model has 104 covariate patterns with three predictors, so the degrees of freedom equals 104. The p-value is larger than 0.05, so we do not find evidence of poor fit. A p-value less than 0.05 would suggest poor model fit.

We can add the group() option to use the Hosmer and Lemeshow goodness-of-fit test. The Hosmer and Lemeshow test is known to be sensitive to the number of quantiles, so you may wish to consider multiple values and check for agreement.

. estat gof, group(20)
note: obs collapsed on 20 quantiles of estimated probabilities.

Goodness-of-fit test after logistic model
Variable: highbp

  Number of observations = 10,349
        Number of groups =     20
Hosmer–Lemeshow chi2(18) =  44.42
             Prob > chi2 = 0.0005

. estat gof, group(30)
note: obs collapsed on 30 quantiles of estimated probabilities.

Goodness-of-fit test after logistic model
Variable: highbp

  Number of observations = 10,349
        Number of groups =     30
Hosmer–Lemeshow chi2(28) =  49.06
             Prob > chi2 = 0.0082

You can use many other options with logistic, and you can read about them by clicking on the link to the logistic manual entry below. You can also watch a demonstration of these commands on YouTube by clicking on the link below.