Stata's margins and marginsplot commands are powerful tools for visualizing the results of regression models. We will use linear regression below, but the same principles and syntax work with nearly all of Stata's regression commands, including probit, logistic, poisson, and others. You will want to review Stata's factor-variable notation if you have not used it before.
Let's begin by opening the nhanes2l dataset. Then let's describe and summarize the variables bpsystol, hlthstat, diabetes, age, and bmi.
. webuse nhanes2l (Second National Health and Nutrition Examination Survey) . describe bpsystol hlthstat diabetes age bmi
Variable Storage Display Value name type format label Variable label |
bpsystol int %9.0g Systolic blood pressure hlthstat byte %20.0g hlth Health status diabetes byte %12.0g diabetes Diabetes status age byte %9.0g Age (years) bmi float %9.0g Body mass index (BMI) |
Variable | Obs Mean Std. dev. Min Max | |
bpsystol | 10,351 130.8817 23.33265 65 300 | |
hlthstat | 10,335 2.586164 1.206196 1 5 | |
diabetes | 10,349 .0482172 .2142353 0 1 | |
age | 10,351 47.57965 17.21483 20 74 | |
bmi | 10,351 25.5376 4.914969 12.3856 61.1297 |
We are going to fit a series of linear regression models for the outcome variable bpsystol, which measures systolic blood pressure (SBP) with a range of 65 to 300 mmHg. hlthstat measures health status with a range from 1 to 5. diabetes measures diabetes status with a range of 0 to 1. age measures age with a range of 20 to 74 years. And bmi measures body mass index with a range of 12.4 to 61.1 kg/m2.
Let's fit a linear regression model using the continuous outcome variable bpsystol, the binary predictor variable diabetes, and the continuous predictor variable age. Note that I have used factor-variable notation to tell Stata that diabetes is categorical and age is continuous, and I have used the “##” operator to request the main effects and interaction of both predictor variables.
. regress bpsystol i.diabetes##c.age
Source | SS df MS | Number of obs = 10,349 | F(3, 10345) = 1071.05 |
Model | 1335031.79 3 445010.595 | Prob > F = 0.0000 | |
Residual | 4298248.26 10,345 415.490407 | R-squared = 0.2370 | Adj R-squared = 0.2368 |
Total | 5633280.05 10,348 544.38346 | Root MSE = 20.384 |
bpsystol | Coefficient Std. err. t P>|t| [95% conf. interval] | |
diabetes | ||
Diabetic | -5.669005 4.952369 -1.14 0.252 -15.37661 4.038595 | |
age | .6303981 .0119464 52.77 0.000 .6069808 .6538154 | |
diabetes# | ||
c.age | ||
Diabetic | .2233087 .0804934 2.77 0.006 .065526 .3810913 | |
_cons | 100.5111 .5969456 168.38 0.000 99.34096 101.6812 | |
The output can be challenging to interpret because we have two predictors and an interaction. We could spend our time carefully interpreting each coefficient, or we could calculate the expected SBP for combinations of diabetes status and various values of age. But Stata's margins command will estimate the expected SBP for combinations of the two predictor variables or for one predictor “adjusted for” the other. Note that the “i.” prefix is required in the regress command but not in the margins command.
Let's estimate marginal predictions of SBP for a 20-year-old with and without diabetes.
. margins diabetes, at(age=20) Adjusted predictions Number of obs = 10,349 Model VCE: OLS Expression: Linear prediction, predict() At: age = 20
Delta-method | ||
Margin std. err. t P>|t| [95% conf. interval] | ||
diabetes | ||
Not diabetic | 113.119 .3815637 296.46 0.000 112.3711 113.867 | |
Diabetic | 111.9162 3.364884 33.26 0.000 105.3204 118.512 | |
We could do this manually, but it would be a lot of typing.
. display "E(SBP | no diabetes, age=20) = " 100.5111 + (-5.669005) * 0 + 0.6303981 * 20 + 0.2233087 * 0 * 20 E(SBP | no diabetes, age=20) = 113.11906
. display "E(SBP | diabetes, age=20) = " 100.5111 + (-5.669005) * 1 + 0.6303981 * 20 + 0.2233087 * 1 * 20 E(SBP | diabetes, age=20) = 111.91623
Next let's use margins to estimate the expected SBP for each category of diabetes at ages 20–60 in increments of 5 years.
. margins diabetes, at(age=(20(5)60)) Adjusted predictions Number of obs = 10,349 Model VCE: OLS Expression: Linear prediction, predict() 1._at: age = 20 2._at: age = 25 3._at: age = 30 4._at: age = 35 5._at: age = 40 6._at: age = 45 7._at: age = 50 8._at: age = 55 9._at: age = 60
Delta-method | ||
Margin std. err. t P>|t| [95% conf. interval] | ||
_at#diabetes | ||
1 # | ||
Not diabetic | 113.119 .3815637 296.46 0.000 112.3711 113.867 | |
1#Diabetic | 111.9162 3.364884 33.26 0.000 105.3204 118.512 | |
2 # | ||
Not diabetic | 116.271 .3327796 349.39 0.000 115.6187 116.9234 | |
2#Diabetic | 116.1847 2.983741 38.94 0.000 110.336 122.0335 | |
3 # | ||
Not diabetic | 119.423 .2881485 414.45 0.000 118.8582 119.9879 | |
3#Diabetic | 120.4533 2.607642 46.19 0.000 115.3418 125.5648 | |
4 # | ||
Not diabetic | 122.575 .2499055 490.49 0.000 122.0852 123.0649 | |
4#Diabetic | 124.7218 2.239132 55.70 0.000 120.3327 129.1109 | |
5 # | ||
Not diabetic | 125.727 .2213861 567.91 0.000 125.293 126.161 | |
5#Diabetic | 128.9904 1.882671 68.51 0.000 125.3 132.6808 | |
6 # | ||
Not diabetic | 128.879 .206656 623.64 0.000 128.4739 129.2841 | |
6#Diabetic | 133.2589 1.546613 86.16 0.000 130.2272 136.2905 | |
7 # | ||
Not diabetic | 132.031 .2086565 632.77 0.000 131.622 132.44 | |
7#Diabetic | 137.5274 1.247557 110.24 0.000 135.082 139.9729 | |
8 # | ||
Not diabetic | 135.183 .2269454 595.66 0.000 134.7381 135.6278 | |
8#Diabetic | 141.796 1.01863 139.20 0.000 139.7992 143.7927 | |
9 # | ||
Not diabetic | 138.335 .2580829 536.01 0.000 137.8291 138.8409 | |
9#Diabetic | 146.0645 .9141335 159.78 0.000 144.2726 147.8564 | |
The numbers reported in the Margin column are average values of the linear prediction of SBP for each combination of diabetes category and age. For example, the output tells us that the expected SBP is 113.119 for a 20-year-old person without diabetes and the expected SBP is 146.0645 for a 60-year-old person with diabetes.
The output also reports a standard error, t statistic, p-value, and 95% confidence interval for each estimate. The t statistic tests the null hypothesis that the expected SBP is zero.
We can plot the marginal predictions and their 95% confidence intervals by typing marginsplot.
. marginsplot Variables that uniquely identify margins: age diabetes
Let's add more options to make our graph look nicer. We can use the legend() option to customize the look of the legend. And we can use the title(), subtitle(), and ytitle() options to add various titles to our graph.
. marginsplot, ytitle("Expected systolic blood pressure (mmHg)") title("Expected systolic blood pressure") subtitle("By age and diabetes status") legend(order(1 "No diabetes" 2 "Diabetes") rows(1) position(12)) Variables that uniquely identify margins: age diabetes
We can also use margins to estimate marginal predictions for one variable averaged over other variables in the model. For example, we can estimate the expected SBP for categories of diabetes averaged over age.
. margins diabetes Predictive margins Number of obs = 10,349 Model VCE: OLS Expression: Linear prediction, predict()
Delta-method | ||
Margin std. err. t P>|t| [95% conf. interval] | ||
diabetes | ||
Not diabetic | 130.5066 .2055351 634.96 0.000 130.1037 130.9094 | |
Diabetic | 135.463 1.385992 97.74 0.000 132.7462 138.1798 | |
Let's work a simpler example without the interaction to help us understand how margins works. Let's fit a linear regression model including diabetes and hlthstat without the interaction. The option coeflegend displays a legend that includes terms that refer to the coefficients in the model.
. regress bpsystol i.diabetes c.age, coeflegend
Source | SS df MS | Number of obs = 10,349 | F(2, 10346) = 1601.69 |
Model | 1331833.99 2 665916.993 | Prob > F = 0.0000 | |
Residual | 4301446.06 10,346 415.759333 | R-squared = 0.2364 | Adj R-squared = 0.2363 |
Total | 5633280.05 10,348 544.38346 | Root MSE = 20.39 |
bpsystol | Coefficient Legend | |
diabetes | ||
Diabetic | 7.815281 _b[1.diabetes] | |
age | .6353169 _b[age] | |
_cons | 100.2803 _b[_cons] | |
Let's display the contents of _b[1.diabetes] to verify that it equals 7.815281.
. display _b[1.diabetes] 7.8152815
Now we can use coefficients and indicator variables to generate a new variable that equals the expected SBP assuming every observation in the sample does not have diabetes.
. generate double sbp_diab0 = _b[_cons] + _b[1.diabetes]*0 + _b[age] * age
Next we can generate a new variable that equals the expected SBP assuming every observation in the sample has diabetes.
. generate double sbp_diab1 = _b[_cons] + _b[1.diabetes]*1 + _b[age] * age
Then we can calculate the average of the two variables to estimate the expected SBP for people with, and without, diabetes. The option if e(sample) restricts the calculation to observations that are not missing values for bpsystol, diabetes, or age.
. table () if e(sample), statistic(mean sbp_diab0 sbp_diab1)
sbp_diab0 | 130.5098 | |
sbp_diab1 | 138.3251 | |
This matches the results reported by margins.
. margins diabetes Predictive margins Number of obs = 10,349 Model VCE: OLS Expression: Linear prediction, predict()
Delta-method | ||
Margin std. err. t P>|t| [95% conf. interval] | ||
diabetes | ||
Not diabetic | 130.5098 .2055982 634.78 0.000 130.1068 130.9128 | |
Diabetic | 138.3251 .9258365 149.41 0.000 136.5103 140.1399 | |
In the previous example, we first calculated the response for each observation and then calculated the average of those responses. This is the default method. But we could also calculate the average covariate values first and then report the response at those average values.
Let's begin by using table to estimate the mean of age. The option if e(sample) restricts the calculation to observations that are not missing values for bpsystol, diabetes, or age.
. table () if e(sample), statistic(mean age)
Mean | 47.5818 | |
Then we can use the mean age to estimate the expected SBP assuming no one in the sample has diabetes.
. display _b[_cons] + _b[1.diabetes] * 0 + _b[age] * 47.5818
We can also calculate the expected SBP assuming everyone in the sample has diabetes.
. display _b[_cons] + _b[1.diabetes] * 1 + _b[age] * 47.5818
And we can check our work using margins with the atmeans option.
. margins diabetes, atmeans Adjusted predictions Number of obs = 10,349 Model VCE: OLS Expression: Linear prediction, predict() At: 0.diabetes = .9517828 (mean) 1.diabetes = .0482172 (mean) age = 47.5818 (mean)
Delta-method | ||
Margin std. err. t P>|t| [95% conf. interval] | ||
diabetes | ||
Not diabetic | 130.5098 .2055982 634.78 0.000 130.1068 130.9128 | |
Diabetic | 138.3251 .9258365 149.41 0.000 136.5103 140.1399 | |
Again, the manually calculated results match the results produced by margins.
Estimating the average response (method 1) and the response at the average (method 2) gives us the same results for linear regression. But the results may differ for generalized linear models such as probit, logistic, or Poisson regression.
You can read more about factor-variable notation, margins, and marginsplot in the Stata documentation. You can also watch a demonstration of these commands by clicking on the links to the YouTube videos below.
Read more in the Stata Base Reference Manual; see [R] margins, [R] marginsplot, and [R] regress. And in the Stata User’s Guide, see [U-11] factor variables.