Logistic regression is a popular tool for quantifying the relationship between a binary outcome variable and one or more predictor variables. The outcome variable is often called the dependent variable, and the predictor variables are often called independent variables. The predictor variables can be binary, categorical, or continuous. Here we will learn how to use Stata's logistic command to fit logistic regression models with categorical predictor variables, and we will explore other features later.
Let's begin by opening the nhanes2l dataset. Then let's describe the variables highbp, diabetes, and hlthstat.
. webuse nhanes2l (Second National Health and Nutrition Examination Survey) . describe highbp diabetes hlthstat
Variable Storage Display Value |
name type format label Variable label |
highbp byte %8.0g * High blood pressure |
diabetes byte %12.0g diabetes Diabetes status |
hlthstat byte %20.0g hlth Health status |
The description tells us that highbp contains information about “High blood pressure”, diabetes contains information about “Diabetes status”, and hlthstat contains information about “Health status”.
Let's tabulate our outcome variable, highbp, to learn more about it.
. tabulate highbp
High blood | ||
pressure | Freq. Percent Cum. | |
0 | 5,975 57.72 57.72 | |
1 | 4,376 42.28 100.00 | |
Total | 10,351 100.00 |
The table shows that 57.7% of the sample is coded as 0 and 42.3% is coded as 1. Let's label 0 as “Normotensive” and 1 as “Hypertensive”, and then let's type tabulate highbp again to check our work.
. label define highbp 0 "Normotensive" 1 "Hypertensive" . label values highbp highbp . tabulate highbp
High blood | ||
pressure | Freq. Percent Cum. | |
Normotensive | 5,975 57.72 57.72 | |
Hypertensive | 4,376 42.28 100.00 | |
Total | 10,351 100.00 |
Next let's type tabulate highbp and diabetes. The row option tells Stata to display row percentages in addition to the frequencies.
. tabulate diabetes highbp, row
Key |
frequency |
row percentage |
Diabetes | High blood pressure | |||
status | Normotens Hypertens | Total | ||
Not diabetic | 5,795 4,055 | 9,850 | ||
58.83 41.17 | 100.00 | |||
Diabetic | 178 321 | 499 | ||
35.67 64.33 | 100.00 | |||
Total | 5,973 4,376 | 10,349 | ||
57.72 42.28 | 100.00 |
The table shows that 41.2% of people without diabetes have hypertension, while 64.3% of people with diabetes have hypertension. This suggests that diabetes status may be a predictor of hypertension.
Let's test this hypothesis by fitting a logistic regression model using highbp as the binary outcome variable. We include the i. prefix for diabetes, which is factor-variable notation that tells Stata that diabetes is a binary predictor variable.
. logistic highbp i.diabetes Logistic regression Number of obs = 10,349 LR chi2(1) = 103.12 Prob > chi2 = 0.0000 Log likelihood = -6998.1054 Pseudo R2 = 0.0073
highbp | Odds ratio Std. err. z P>|z| [95% conf. interval] | |
diabetes | ||
Diabetic | 2.577195 .2465557 9.90 0.000 2.136556 3.10871 | |
_cons | .6997412 .0143263 -17.44 0.000 .6722181 .7283911 | |
The output includes a column labeled Odds ratio, and the odds ratio for diabetes is 2.58. The literal interpretation is that the odds of having hypertension are 2.58 times higher for someone with diabetes than the odds for someone without diabetes. It is tempting to misinterpret the odds ratio as the relative risk or risk ratio. But that is not correct, because odds are not the same as risk or probability.
Let's use Stata's cs command to clarify the difference between odds and risk (or probability).
. cs highbp diabetes, or
Diabetes status | |||
Exposed Unexposed | Total | ||
Cases | 321 4055 | 4376 | |
Noncases | 178 5795 | 5973 | |
Total | 499 9850 | 10349 | |
Risk | .6432866 .4116751 | .4228428 | |
Point estimate | [95% conf. interval] | ||
Risk difference | .2316114 | .1884724 .2747505 | |
Risk ratio | 1.562607 | 1.457737 1.675022 | |
Attr. frac. ex. | .0264109 | .314005 .4029931 | |
Attr. frac. pop | .0264109 | ||
Odds ratio | 2.577197 | 2.137069 3.10796 | (Cornfield) |
The output tells us that 499 people in our sample have diabetes and 9,850 do not have diabetes. The output also tells us that 321 people with diabetes also have hypertension. So the risk (or probability) of having hypertension for someone with diabetes equals 321/499 = 0.64. Similarly, the risk of having hypertension for someone without diabetes is 4055/9850 = 0.41. The ratio of these risks is labeled Risk ratio in the output and equals 0.64/0.41 = 1.56.
The odds are calculated differently. The odds of having hypertension for someone with diabetes are 321/178 = 1.80. And the odds of having hypertension for someone without diabetes are 4055/5795 = 0.70. The ratio of these odds is labeled Odds ratio in the output and equals 1.80/0.70 = 2.58. This is the same as the odds ratio reported in our logistic regression output.
It is often easier to interpret the results of logistic regression models with the margins command. margins diabetes displays the predicted probability that the outcome variable equals 1 for each category of the predictor variable. Let's try this for our model.
. margins diabetes Adjusted predictions Number of obs = 10,349 Model VCE: OIM Expression: Pr(highbp), predict()
Delta-method | ||
Margin std. err. z P>|z| [95% conf. interval] | ||
diabetes | ||
Not diabetic | .4116751 .0049587 83.02 0.000 .4019563 .421394 | |
Diabetic | .6432864 .0214443 30.00 0.000 .6012563 .6853164 | |
The output shows us that the expected probability of hypertension equals 0.41 for someone without diabetes and equals 0.64 for someone with diabetes. We can use marginsplot to graph the results.
. marginsplot Variables that uniquely identify margins: diabetes
The default graph is often called a profile plot. We can add a few options to create a bar graph if we prefer.
. marginsplot, recast(bar) xdimension(diabetes) horizontal plotopts(barwidth(.9)) Variables that uniquely identify margins: diabetes
We can also use margins followed by nlcom to estimate the relative risk and the risk difference. The margins option post stores the results in memory, and the coeflegend option displays the names of the estimates so that we can refer to them with nlcom.
. margins diabetes, post coeflegend Adjusted predictions Number of obs = 10,349 Model VCE: OIM Expression: Pr(highbp), predict()
Margin Legend | ||
diabetes | ||
Not diabetic | .4116751 _b[0bn.diabetes] | |
Diabetic | .6432864 _b[1.diabetes] | |
Coefficient Std. err. z P>|z| [95% conf. interval] | ||
risk_ratio | 1.562607 .0553865 28.21 0.000 1.454051 1.671162 | |
Coefficient Std. err. z P>|z| [95% conf. interval] | ||
risk_diff | .2316112 .0220101 10.52 0.000 .1884721 .2747503 | |
The logistic regression output also reported a standard error, z statistic, p-value, and 95% confidence interval for the odds ratio. The z statistic is not the ratio of the odds ratio and its standard error; it is the ratio of the coefficient and its standard error. The coefficient is the natural logarithm of the odds ratio, and we can view it, and its standard error, by using the coef option (or the logit command). We can then use the coefficient and its standard error to compute the z statistic, p-value, and 95% confidence bounds. The z statistic and p-value are testing the null hypothesis that the coefficient equals 0, or, equivalently, that the odds ratio equals 1. The confidence interval for the coefficient is transformed to a confidence interval for the odds ratio.
. logistic highbp i.diabetes, coef coeflegend Logistic regression Number of obs = 10,349 LR chi2(1) = 103.12 Prob > chi2 = 0.0000 Log likelihood = -6998.1054 Pseudo R2 = 0.0073
highbp | Coefficient Legend | |
diabetes | ||
Diabetic | .9467015 _b[1.diabetes] | |
_cons | -.3570448 _b[_cons] | |
The top of the logistic regression output shows the sample size; a likelihood-ratio test labeled LR chi2(1); its p-value labeled Prob > chi2; and a pseudo R-squared labeled Pseudo R2.
The likelihood-ratio chi-squared compares the specified model to a model with no predictor variables, which is often called the intercept-only model. We can calculate this test manually by fitting our model, storing the results in memory, fitting the intercept-only model, storing those results in memory, and finally using lrtest to compare the two models.
. quietly logistic highbp i.diabetes . estimates store full . quietly logistic highbp if e(sample) . estimates store reduced . lrtest full reduced Likelihood-ratio test Assumption: reduced nested within full LR chi2(1) = 103.12 Prob > chi2 = 0.0000
The pseudo R-squared also compares the model we fit with the intercept-only (null) model using each model's log likelihood. We can type estat ic to display the log likelihood for both models and use the results to calculate the pseudo R-squared.
. estat ic Akaike's information criterion and Bayesian information criterion
Model | N ll(null) ll(model) df AIC BIC | |
reduced | 10,349 -7049.666 -7049.666 1 14101.33 14108.58 | |
Stata reports McFadden's pseudo R-squared, but there are many others. You can use the ssc install command to install the community-contributed command fitstat, which reports many different kinds of pseudo R-squareds.
. ssc install fitstat, replace . quietly logistic highbp i.diabetes . fitstat Measures of Fit for logistic of highbp Log-Lik Intercept Only: -7049.666 Log-Lik Full Model: -6998.105 D(10346): 13996.211 LR(1): 103.122 Prob > LR: 0.000 McFadden's R2: 0.007 McFadden's Adj R2: 0.007 Maximum Likelihood R2: 0.010 Cragg > Uhler's R2: 0.013 McKelvey and Zavoina's R2: 0.012 Efron's R2: 0.010 Variance of y*: 3.331 Variance of error: 3.290 Count R2: 0.591 Adj Count R2: 0.033 AIC: 1.353 AIC*n: 14002.211 BIC: -81648.888 BIC': -93.877
It is tempting to interpret the pseudo R-squared using the “proportion of variance explained” language we use for linear regression. Unfortunately, this is incorrect, and there is no general, intuitive interpretation for the pseudo R-squared statistic. It can, however, be used to select the best fitting model among a collection of models with the same outcome variable. Models with larger values of the same pseudo R-squared statistic fit better than models with smaller values.
We can use the results of our model to predict the probability that a person has hypertension. The predict prob, pr command generates a new variable named prob that contains the predicted probability that each person has hypertension. Then we can generate a variable named highbp_pr that classifies each person as having hypertension if their predicted probability is greater than or equal to 0.50 and classifies each person as not having hypertension if their predicted probability is less than 0.50. We can list some of the observations to check our predictions.
. quietly logistic highbp i.diabetes . predict prob, pr (2 missing values generated) . generate highbp_pr = (prob >= 0.5) . list highbp prob highbp_pr if inlist(_n,2,4,6,22), abbrev(10)
highbp prob highbp_pr | |
2. | Normotensive .4116751 0 |
4. | Hypertensive .6432863 1 |
6. | Hypertensive .4116751 0 |
22. | Normotensive .6432863 1 |
The person in observation 2 does not have hypertension, their predicted probability of hypertension is 0.412, and we classified them correctly as not having hypertension.
The person in observation 4 does have hypertension, their predicted probability of hypertension is 0.643, and we classified them correctly as having hypertension.
The person in observation 6 has hypertension, their predicted probability of hypertension is 0.412, and we classified them incorrectly as not having hypertension.
The person in observation 22 does not have hypertension, their predicted probability of hypertension is 0.643, and we classified them incorrectly as having hypertension.
Our model does not predict hypertension perfectly. We can calculate the proportion of correctly classified people by tabulating the observed and predicted number of people with hypertension.
. tabulate highbp_pr highbp
High blood pressure | ||||
highbp_pr | Normotens Hypertens | Total | ||
0 | 5,795 4,055 | 9,850 | ||
1 | 180 321 | 501 | ||
Total | 5,975 4,376 | 10,351 |
Our model correctly predicted 5,795 people without hypertension and 321 people with hypertension, which means that our model predicted 59.1% of the observations correctly.
. display "Correctly classified = " %9.3f (5795 + 321) / 10351 Correctly classified = 0.591
We can use estat classification to do all these steps for us.
. estat classification Logistic model for highbp
True | ||||
Classified | D ~D | Total | ||
+ | 321 178 | 499 | ||
- | 4055 5795 | 9850 | ||
Total | 4376 5973 | 10349 |
Sensitivity Pr( +| D) 7.34% |
Specificity Pr( -|~D) 97.02% |
Positive predictive value Pr( D| +) 64.33% |
Negative predictive value Pr(~D| -) 58.83% |
False + rate for true ~D Pr( +|~D) 2.98% |
False - rate for true D Pr( -| D) 92.66% |
False + rate for classified + Pr(~D| +) 35.67% |
False - rate for classified - Pr( D| -) 41.17% |
Correctly classified 59.10% |
estat classification reports many other numbers, such as sensitivity and specificity. Sensitivity is the probability that a person with hypertension is predicted to have hypertension. Specificity is the probability that a person who does not have hypertension is predicted to not have hypertension. We can calculate these values using the frequencies reported in the table.
. display "Sensitivity = " %9.4f 321/4376 Sensitivity = 0.0734 . display "Specificity = " %9.4f 5795/5973 Specificity = 0.9702
We could change the cutoff value for predicting hypertension from 0.5 (the default) to 0.4 to see if this improves our predictions.
. estat classification, cutoff(0.4) Logistic model for highbp
True | ||||
Classified | D ~D | Total | ||
+ | 4376 5973 | 10349 | ||
- | 0 0 | 0 | ||
Total | 4376 5973 | 10349 |
Sensitivity Pr( +| D) 100.00% |
Specificity Pr( -|~D) 0.00% |
Positive predictive value Pr( D| +) 42.28% |
Negative predictive value Pr(~D| -) .% |
False + rate for true ~D Pr( +|~D) 100.00% |
False - rate for true D Pr( -| D) 0.00% |
False + rate for classified + Pr(~D| +) 57.72% |
False - rate for classified - Pr( D| -) .% |
Correctly classified 42.28% |
The percentage of correctly classified observations decreased from 59.1% to 42.28%, the sensitivity increased from 7.34% to 100%, and the specificity decreased from 97.02% to 0%. Changing the cutoff from 0.5 to 0.4 resulted in poorer predictions and specificity but greatly improved the sensitivity. We might ask ourselves if there is an optimal cutpoint and how we might identify it. Stata's lsens command can help us by plotting sensitivity and specificity for various cutpoints.
. lsens
The graph shows the probability cutoff on the horizontal axis and sensitivity/specificity on the vertical axis. Sensitivity is plotted in blue and specificity is plotted in red. Sensitivity equals 1 for small values of the probability cutoff, decreases rapidly for middle values, and equals 0 for larger values. Specificity displays the opposite pattern.
The lroc command plots the receiver operating characteristic (ROC) curve with sensitivity on the vertical axis and 1 minus specificity on the horizontal axis. The area under the ROC curve has a range of 0.5 to 1.0 and is reported at the bottom of the graph. A value near 0.5 indicates that the model is no better than a random guess at distinguishing between the two possible outcomes. A model with a value above 0.8 is generally considered to be good at distinguishing between the two outcomes. The area under our ROC curve is 0.5218, which indicates that our model does not distinguish well between the two outcomes.
. lroc Logistic model for highbp Number of observations = 10349 Area under ROC curve = 0.5218
We can check the goodness of fit for our model by using estat gof.
. estat gof Goodness-of-fit test after logistic model Variable: highbp Number of observations = 10,349 Number of covariate patterns = 2 Pearson chi2(0) = 0.00 Prob > chi2 = .
The p-value is missing because the degrees of freedom is calculated by subtracting the number of predictor variables, including the constant, from the number of covariate patterns. Our model has two covariate patterns and two predictors, so the degrees of freedom equals 0. Let's fit a model that includes age to increase the number of covariate patterns. Then we will run estat gof again.
. quietly logistic highbp age i.diabetes . estat gof Goodness-of-fit test after logistic model Variable: highbp Number of observations = 10,349 Number of covariate patterns = 104 Pearson chi2(101) = 119.76 Prob > chi2 = 0.0981
Now the model has 104 covariate patterns with three predictors, so the degrees of freedom equals 104. The p-value is larger than 0.05, so we do not find evidence of poor fit. A p-value less than 0.05 would suggest poor model fit.
We can add the group() option to use the Hosmer and Lemeshow goodness-of-fit test. The Hosmer and Lemeshow test is known to be sensitive to the number of quantiles, so you may wish to consider multiple values and check for agreement.
. estat gof, group(20) note: obs collapsed on 20 quantiles of estimated probabilities. Goodness-of-fit test after logistic model Variable: highbp Number of observations = 10,349 Number of groups = 20 Hosmer–Lemeshow chi2(18) = 44.42 Prob > chi2 = 0.0005
. estat gof, group(30) note: obs collapsed on 30 quantiles of estimated probabilities. Goodness-of-fit test after logistic model Variable: highbp Number of observations = 10,349 Number of groups = 30 Hosmer–Lemeshow chi2(28) = 49.06 Prob > chi2 = 0.0082
You can use many other options with logistic, and you can read about them by clicking on the link to the logistic manual entry below. You can also watch a demonstration of these commands on YouTube by clicking on the link below.
Watch Fitting & interpreting regression models: Logistic regression with categorical predictors.
Read more in the Stata Base Reference Manual: see [R] logistic, [R] test, [R] margins, [R] marginsplot, [R] estat classification, [R] estat gof, [R] lsens, and [R] lroc. In the Stata User’s Guide, see [U] 11.4.3 Factor variables. In the Stata Data Management Reference Manual, see [D] list. In the Stata Functions Reference Manual, see [FN] Programming functions.