Logistic regression is a popular tool for quantifying the relationship between a binary outcome variable and one or more predictor variables. The outcome variable is often called the dependent variable, and the predictor variables are often called independent variables. The predictor variables can be binary, categorical, or continuous. Here we will learn how to use Stata's logistic command to fit logistic regression models with categorical predictor variables, and we will explore other features later.
Let's begin by opening the nhanes2l dataset. Then let's describe the variables highbp, diabetes, and hlthstat.
. webuse nhanes2l (Second National Health and Nutrition Examination Survey) . describe highbp age sex weight hlthstat
Variable Storage Display Value |
name type format label Variable label |
highbp byte %8.0g * High blood pressure |
age byte %9.0g Age (years) |
sex byte %9.0g sex Sex |
weight float %9.0g Weight (kg) |
hlthstat byte %20.0g hlth Health status |
The description tells us that highbp contains information about “High blood pressure”, diabetes contains information about “Diabetes status”, and hlthstat contains information about “Health status”.
Let's tabulate our outcome variable, highbp, to learn more about it.
. tabulate highbp
High blood | ||
pressure | Freq. Percent Cum. | |
0 | 5,975 57.72 57.72 | |
1 | 4,376 42.28 100.00 | |
Total | 10,351 100.00 |
The table shows that 57.7% of the sample is coded as 0 and 42.3% is coded as 1. Let's label 0 as “Normotensive” and 1 as “Hypertensive”, and then let's type tabulate highbp again to check our work.
. label define highbp 0 "Normotensive" 1 "Hypertensive" . label values highbp highbp . tabulate highbp
High blood | ||
pressure | Freq. Percent Cum. | |
Normotensive | 5,975 57.72 57.72 | |
Hypertensive | 4,376 42.28 100.00 | |
Total | 10,351 100.00 |
Next let's use dtable to explore our predictor variable for each category of highbp. You can type help dtable to learn more about dtable.
. dtable age weight i.sex i.hlthstat , by(highbp, tests) continuous(, statistics(mean sd)) nformat(%9.1f mean sd) sformat((%s) semean) factor(, statistics(fvfrequency fvpercent)) nformat(%16.0fc fvfrequency ) nformat(%9.1f fvpercent) title(Table 1: Descriptive statistics by hypertension status) note("Note: Mean (SD) for continuous variables, frequency (%) for categorical variables") export(table1.docx, replace) note: using test regress across levels of highbp for age and weight. note: using test pearson across levels of highbp for sex and hlthstat. Table 1: Descriptive statistics by hypertension status
High blood pressure |
Normotensive Hypertensive Total Test |
N 5,975 (57.7%) 4,376 (42.3%) 10,351 (100.0%) |
Age (years) 42.2 (16.8) 55.0 (14.9) 47.6 (17.2) <0.001 |
Weight (kg) 68.3 (13.6) 76.9 (16.2) 71.9 (15.4) <0.001 |
Sex |
Male 2,611 (43.7%) 2,304 (52.7%) 4,915 (47.5%) <0.001 |
Female 3,364 (56.3%) 2,072 (47.3%) 5,436 (52.5%) |
Health status |
Excellent 1,649 (27.7%) 758 (17.3%) 2,407 (23.3%) <0.001 |
Very good 1,666 (27.9%) 925 (21.2%) 2,591 (25.1%) |
Good 1,572 (26.4%) 1,366 (31.2%) 2,938 (28.4%) |
Fair 766 (12.8%) 904 (20.7%) 1,670 (16.2%) |
Poor 310 (5.2%) 419 (9.6%) 729 (7.1%) |
All the predictor variables have small p-values, which suggests that we should consider them in more complex models.
Let's fit a model that includes each predictor variable with no interactions.
. logistic highbp c.age i.sex c.weight i.hlthstat Logistic regression Number of obs = 10,335 LR chi2(7) = 2352.02 Prob > chi2 = 0.0000 Log likelihood = -5864.7164 Pseudo R2 = 0.1670
highbp | Odds ratio Std. err. z P>|z| [95% conf. interval] | |
age | 1.049891 .0015677 32.61 0.000 1.046823 1.052969 | |
sex | ||
Female | 1.021781 .0494017 0.45 0.656 .9294014 1.123342 | |
weight | 1.044387 .0017632 25.73 0.000 1.040937 1.047849 | |
hlthstat | ||
Very good | 1.007905 .0675317 0.12 0.906 .883868 1.149349 | |
Good | 1.308293 .0845285 4.16 0.000 1.152681 1.484913 | |
Fair | 1.275262 .0960363 3.23 0.001 1.100266 1.47809 | |
Poor | 1.250367 .121646 2.30 0.022 1.033298 1.513037 | |
_cons | .0025079 .0004114 -36.51 0.000 .0018185 .0034589 | |
Then let's use estat classification to estimate the sensitivity, specificity, and percent correctly classified for our model.
. estat classification
True | ||||
Classified | D ~D | Total | ||
+ | 2687 1457 | 4144 | ||
- | 1685 4506 | 6191 | ||
Total | 4372 5963 | 10335 |
Sensitivity Pr( +| D) 61.46% |
Specificity Pr( -|~D) 75.57% |
Positive predictive value Pr( D| +) 64.84% |
Negative predictive value Pr(~D| -) 72.78% |
False + rate for true ~D Pr( +|~D) 24.43% |
False - rate for true D Pr( -| D) 38.54% |
False + rate for classified + Pr(~D| +) 35.16% |
False - rate for classified - Pr( D| -) 27.22% |
Correctly classified 69.60% |
Next let's use etable to gather our results. You can type help etable to learn more about etable. We can include sensitivity, specificity, and percent correctly classified in our table by using the scalars returned by estat classification. You can view all of these scalars by typing return list.
. return list scalars: r(P_1n) = 27.21692779841706 r(P_0p) = 35.15926640926641 r(P_0n) = 72.78307220158294 r(P_1p) = 64.84073359073359 r(P_n1) = 38.54071363220494 r(P_p0) = 24.43400972664766 r(P_n0) = 75.56599027335234 r(P_p1) = 61.45928636779507 r(P_corr) = 69.5984518626028 matrices: r(ctable) : 3 x 3 . etable, keep(age i.sex weight i.hlthstat) showstars showstarsnote cstat(_r_b, nformat(%9.2f)) mstat(N, nformat(%9.0fc) label("Observations")) mstat(chi2, nformat(%9.2f)) mstat(df_m, nformat(%9.0f)) mstat(p, nformat(%9.4f)) mstat(r2_p, nformat(%9.4f)) mstat(correct =(r(P_corr)), nformat(%6.2f) label("% correctly classified")) mstat(Sensitivity=(r(P_p1)), nformat(%6.2f)) mstat(Specificity=(r(P_n0)), nformat(%6.2f)) mstat(aic, nformat(%6.0fc)) mstat(bic, nformat(%6.0fc))
highbp |
Age (years) 1.05 ** |
Sex |
Female 1.02 |
Weight (kg) 1.04 ** |
Health status |
Very good 1.01 |
Good 1.31 ** |
Fair 1.28 ** |
Poor 1.25 * |
Observations 10,335 |
χ² 2352.02 |
Model DF 7 |
Model test p-value 0.0000 |
Pseudo R-squared 0.1670 |
% correctly classified 69.60 |
Sensitivity 61.46 |
Specificity 75.57 |
AIC 11,745 |
BIC 11,803 |
Let's consider another model that includes the interaction of age and sex. We can again use estat classification to estimate the sensitivity, specificity, and percent correctly classified for our new model. Then we can type etable, append to append the new model to our table.
. logistic highbp c.age##i.sex c.weight i.hlthstat Logistic regression Number of obs = 10,335 LR chi2(8) = 2444.84 Prob > chi2 = 0.0000 Log likelihood = -5818.3051 Pseudo R2 = 0.1736
highbp | Odds ratio Std. err. z P>|z| [95% conf. interval] | |
age | 1.036594 .0020228 18.42 0.000 1.032637 1.040567 | |
sex | ||
Female | .2559988 .0392532 -8.89 0.000 .1895484 .345745 | |
sex#c.age | ||
Female | 1.027598 .0029315 9.54 0.000 1.021869 1.03336 | |
weight | 1.044088 .0017719 25.42 0.000 1.040621 1.047567 | |
hlthstat | ||
Very good | 1.019104 .068511 0.28 0.778 .8932953 1.162631 | |
Good | 1.337438 .0867651 4.48 0.000 1.177748 1.518779 | |
Fair | 1.300784 .0982347 3.48 0.000 1.121819 1.5083 | |
Poor | 1.309155 .1276285 2.76 0.006 1.081454 1.584798 | |
_cons | .004704 .0008202 -30.74 0.000 .0033424 .0066203 | |
True | ||||
Classified | D ~D | Total | ||
+ | 2724 1506 | 4230 | ||
- | 1648 4457 | 6105 | ||
Total | 4372 5963 | 10335 |
Sensitivity Pr( +| D) 62.31% |
Specificity Pr( -|~D) 74.74% |
Positive predictive value Pr( D| +) 64.40% |
Negative predictive value Pr(~D| -) 73.01% |
False + rate for true ~D Pr( +|~D) 25.26% |
False - rate for true D Pr( -| D) 37.69% |
False + rate for classified + Pr(~D| +) 35.60% |
False - rate for classified - Pr( D| -) 26.99% |
Correctly classified 69.48% |
highbp highbp |
Age (years) 1.05 ** 1.04 ** |
Sex |
Female 1.02 0.26 ** |
Weight (kg) 1.04 ** 1.04 ** |
Health status |
Very good 1.01 1.02 |
Good 1.31 ** 1.34 ** |
Fair 1.28 ** |
Poor 1.25 * 1.31 ** |
Observations 10,335 10,335 |
χ² 2352.02 2444.84 |
Model DF 7 8 |
Model test p-value 0.0000 0.0000 |
Pseudo R-squared 0.1670 0.1736 |
% correctly classified 69.60 69.48 |
Sensitivity 61.46 62.31 |
Specificity 75.57 74.74 |
AIC 11,745 11,655 |
BIC 11,803 11,720 |
We can use the same strategy to fit other candidate models and append the results to our table. Note that the quietly block keeps the output from being displayed in the Results window.
. quietly { logistic highbp c.age##c.weight i.sex##i.hlthstat estat classification etable, append logistic highbp c.age c.weight i.sex##i.hlthstat estat classification etable, append logistic highbp c.age##i.sex c.age##c.weight i.sex##i.hlthstat estat classification }
Let's add some features to our final table.
. etable, append keep(age i.sex weight i.hlthstat c.age##i.sex c.age##c.weight i.sex##i.hlthstat) column(index) showstars showstarsnote cstat(_r_b, nformat(%9.2f)) mstat(N, nformat(%9.0fc) label("Observations")) mstat(chi2, nformat(%9.2f)) mstat(df_m, nformat(%9.0f)) mstat(p, nformat(%9.4f)) mstat(r2_p, nformat(%9.4f)) mstat(correct =(r(P_corr)), nformat(%6.2f) label("% correctly classified")) mstat(Sensitivity=(r(P_p1)), nformat(%6.2f)) mstat(Specificity=(r(P_n0)), nformat(%6.2f)) mstat(aic, nformat(%6.0fc)) mstat(bic, nformat(%6.0fc)) title("Table 2: Candidate logistic models for hypertension") Table 2: Candidate logistic models for hypertension
1 2 3 4 5 |
Age (years) 1.05 ** 1.04 ** 1.10 ** 1.05 ** 1.08 ** Sex Female 1.02 0.26 ** 0.76 ** 0.72 ** 0.31 ** Weight (kg) 1.04 ** 1.04 ** 1.08 ** 1.04 ** 1.07 ** Health status Very good 1.01 1.02 0.94 0.93 0.97 Good 1.31 ** 1.34 ** 1.20 * 1.16 1.28 ** Fair 1.28 ** 1.30 ** 0.90 0.85 1.02 Poor 1.25 * 1.31 ** 0.90 0.86 1.06 Age (years) 1.05 ** 1.04 ** 1.10 ** 1.05 ** 1.08 ** Sex Female 1.02 0.26 ** 0.76 ** 0.72 ** 0.31 ** Sex # Age (years) Female 1.03 ** 1.02 ** Age (years) 1.05 ** 1.04 ** 1.10 ** 1.05 ** 1.08 ** Weight (kg) 1.04 ** 1.04 ** 1.08 ** 1.04 ** 1.07 ** Age (years) # Weight (kg) 1.00 ** 1.00 ** Sex Female 1.02 0.26 ** 0.76 ** 0.72 ** 0.31 ** Health status Very good 1.01 1.02 0.94 0.93 0.97 Good 1.31 ** 1.34 ** 1.20 * 1.16 1.28 ** Fair 1.28 ** 1.30 ** 0.90 0.85 1.02 Poor 1.25 * 1.31 ** 0.90 0.86 1.06 Sex # Health status Female # Very good 1.19 1.23 1.13 Female # Good 1.25 1.33 * 1.12 Female # Fair 2.00 ** 2.21 ** 1.60 ** Female # Poor 2.07 ** 2.28 ** 1.57 * Observations 10,335 10,335 10,335 10,335 10,335 χ² 2352.02 2444.84 2438.90 2392.57 2477.47 Model DF 7 8 12 11 13 Model test p-value 0.0000 0.0000 0.0000 0.0000 0.0000 Pseudo R-squared 0.1670 0.1736 0.1732 0.1699 0.1759 % correctly classified 69.60 69.48 69.96 69.79 69.62 Sensitivity 61.46 62.31 65.81 62.24 64.89 Specificity 75.57 74.74 73.00 75.33 73.08 AIC 11,745 11,655 11,669 11,713 11,632 BIC 11,803 11,720 11,763 11,800 11,733 |
Normally, we would use the export() option to export our table to a Microsoft Word document. But the layout(autofitcontents) option in collect style putdocx redistributes the column widths better.
. collect style putdocx, layout(autofitcontents)
. collect export table2.docx, as(docx) replace
(collection ETable exported to file table2.docx)
Just to be thorough, let's use a likelihood-ratio test to compare model 5 with model 1.
. quietly logistic highbp c.age i.sex c.weight i.hlthstat . estimates store reduced . quietly logistic highbp c.age##i.sex c.age##c.weight i.sex##i.hlthstat . estimates store full . lrtest full reduced Likelihood-ratio test Assumption: reduced nested within full LR chi2(6) = 125.45 Prob > chi2 = 0.0000
The small p-value suggests that the full model that includes interactions fits the data better than the reduced model that does not include interactions.
Let's explore the results of the full model further.
. logistic highbp c.age##i.sex c.age##c.weight i.sex##i.hlthstat note: age omitted because of collinearity. Logistic regression Number of obs = 10,335 LR chi2(13) = 2477.47 Prob > chi2 = 0.0000 Log likelihood = -5801.9907 Pseudo R2 = 0.1759
highbp | Odds ratio Std. err. z P>|z| [95% conf. interval] | |
age | 1.075828 .0089424 8.79 0.000 1.058443 1.093498 | |
sex | ||
Female | .3095554 .054965 -6.60 0.000 .2185738 .4384082 | |
sex#c.age | ||
Female | 1.019876 .0032379 6.20 0.000 1.01355 1.026242 | |
age | 1 (omitted) | |
weight | 1.067281 .0057565 12.07 0.000 1.056058 1.078623 | |
c.age# | ||
c.weight | .9995402 .0001042 -4.41 0.000 .9993359 .9997446 | |
hlthstat | ||
Very good | .9721125 .0863504 -0.32 0.750 .8167823 1.156982 | |
Good | 1.282753 .1120339 2.85 0.004 1.080937 1.52225 | |
Fair | 1.023111 .1083771 0.22 0.829 .831296 1.259185 | |
Poor | 1.06005 .138177 0.45 0.655 .8210553 1.368611 | |
sex#hlthstat | ||
Female # | ||
Very good | 1.129374 .1548714 0.89 0.375 .8632015 1.477622 | |
Female#Good | 1.116131 .1472965 0.83 0.405 .8617503 1.445602 | |
Female#Fair | 1.602814 .2444008 3.09 0.002 1.188748 2.161107 | |
Female#Poor | 1.569438 .3080657 2.30 0.022 1.068222 2.305828 | |
_cons | .0008451 .0003694 -16.19 0.000 .0003587 .0019908 | |
The p-values for hlthstat and the interaction of sex and hlthstat are both small and large. Some are less than 0.05, and some are greater than 0.05. The p-values in the regression table test that each individual coefficient is equal to 0. They do not tell us anything about the coefficients for all categories of the variable hlthstat or the interaction of sex and hlthstat. We can use testparm to simultaneously test all the coefficients for a categorical variable.
First, let's test the interaction of sex and hlthstat.
. testparm i.sex#i.hlthstat ( 1) [highbp]2.sex#2.hlthstat = 0 ( 2) [highbp]2.sex#3.hlthstat = 0 ( 3) [highbp]2.sex#4.hlthstat = 0 ( 4) [highbp]2.sex#5.hlthstat = 0 chi2( 4) = 13.20 Prob > chi2 = 0.0103
The p-value for the interaction of sex and health status is less than 0.05, even though some of the individual coefficients are greater than 0.05.
The results are similar when we test the main effect of hlthstat.
. testparm i.hlthstat ( 1) [highbp]2.hlthstat = 0 ( 2) [highbp]3.hlthstat = 0 ( 3) [highbp]4.hlthstat = 0 ( 4) [highbp]5.hlthstat = 0 chi2( 4) = 13.12 Prob > chi2 = 0.0107
Let's finish by using estat gof to check the goodness of fit for our model.
. estat gof Goodness-of-fit test after logistic model Variable: highbp Number of observations = 10,335 Number of covariate patterns = 9,998 Pearson chi2(9984) = 9873.40 Prob > chi2 = 0.7826 . estat gof, group(20) note: obs collapsed on 20 quantiles of estimated probabilities. Goodness-of-fit test after logistic model Variable: highbp Number of observations = 10,335 Number of groups = 20 Hosmer–Lemeshow chi2(18) = 20.20 Prob > chi2 = 0.3217 . estat gof, group(30) note: obs collapsed on 30 quantiles of estimated probabilities. Goodness-of-fit test after logistic model Variable: highbp Number of observations = 10,335 Number of groups = 30 Hosmer–Lemeshow chi2(28) = 32.08 Prob > chi2 = 0.2713
All the p-values are greater than 0.05, which suggests that our model fits the data reasonably well.
You can use many other options and postestimation commands with logistic, and you can read about them by clicking on the link to the logistic manual entry below. You can also watch a demonstration of these commands on YouTube by clicking on the links below.
Watch Fitting & interpreting regression models: Logistic regression with categorical predictors, Fitting & interpreting regression models: Logistic regression with continuous predictors, and Fitting & interpreting regression models: Logistic regression with continuous/categorical predictors.
Read more in the Stata Base Reference Manual: see [R] logistic, [R] test, [R] lrtest, [R] estimates store, [R] dtable, [R] etable, [R] margins, [R] marginsplot, [R] estat classification, [R] estat gof, [R] lsens, and [R] lroc. In the Stata Data Management Reference Manual, see [D] label and [D] list. In the Stata Customizable Tables and Collected Results Reference Manual, see [TABLES] Intro, [TABLES] collect style putdocx, and [TABLES] collect export. In the Stata User’s Guide, see [U] 11.4.3 Factor variables. In the Stata Functions Reference Manual, see [FN] Programming functions.