Home  /  Resources & Support  /  Introduction to Stata basics  /  margins and marginsplot for the interaction of categorical and continuous predictor variables

Stata's margins and marginsplot commands are powerful tools for visualizing the results of regression models. We will use linear regression below, but the same principles and syntax work with nearly all of Stata's regression commands, including probit, logistic, poisson, and others. You will want to review Stata's factor-variable notation if you have not used it before.

Let's begin by opening the nhanes2l dataset. Then let's describe and summarize the variables bpsystol, hlthstat, diabetes, age, and bmi.

. webuse nhanes2l
(Second National Health and Nutrition Examination Survey)

. describe bpsystol hlthstat diabetes age bmi

Variable Storage Display Value name type format label Variable label
bpsystol int %9.0g Systolic blood pressure hlthstat byte %20.0g hlth Health status diabetes byte %12.0g diabetes Diabetes status age byte %9.0g Age (years) bmi float %9.0g Body mass index (BMI)
. summarize bpsystol hlthstat diabetes age bmi
Variable Obs Mean Std. dev. Min Max
bpsystol 10,351 130.8817 23.33265 65 300
hlthstat 10,335 2.586164 1.206196 1 5
diabetes 10,349 .0482172 .2142353 0 1
age 10,351 47.57965 17.21483 20 74
bmi 10,351 25.5376 4.914969 12.3856 61.1297

We are going to fit a series of linear regression models for the outcome variable bpsystol, which measures systolic blood pressure (SBP) with a range of 65 to 300 mmHg. hlthstat measures health status with a range from 1 to 5. diabetes measures diabetes status with a range of 0 to 1. age measures age with a range of 20 to 74 years. And bmi measures body mass index with a range of 12.4 to 61.1 kg/m2.

Let's fit a linear regression model using the continuous outcome variable bpsystol, the binary predictor variable diabetes, and the continuous predictor variable age. Note that I have used factor-variable notation to tell Stata that diabetes is categorical and age is continuous, and I have used the “##” operator to request the main effects and interaction of both predictor variables.

. regress bpsystol i.diabetes##c.age

Source SS df MS Number of obs = 10,349
F(3, 10345) = 1071.05
Model 1335031.79 3 445010.595 Prob > F = 0.0000
Residual 4298248.26 10,345 415.490407 R-squared = 0.2370
Adj R-squared = 0.2368
Total 5633280.05 10,348 544.38346 Root MSE = 20.384
bpsystol Coefficient Std. err. t P>|t| [95% conf. interval]
diabetes
Diabetic -5.669005 4.952369 -1.14 0.252 -15.37661 4.038595
age .6303981 .0119464 52.77 0.000 .6069808 .6538154
diabetes#
c.age
Diabetic .2233087 .0804934 2.77 0.006 .065526 .3810913
_cons 100.5111 .5969456 168.38 0.000 99.34096 101.6812

The output can be challenging to interpret because we have two predictors and an interaction. We could spend our time carefully interpreting each coefficient, or we could calculate the expected SBP for combinations of diabetes status and various values of age. But Stata's margins command will estimate the expected SBP for combinations of the two predictor variables or for one predictor “adjusted for” the other. Note that the “i.” prefix is required in the regress command but not in the margins command.

Let's estimate marginal predictions of SBP for a 20-year-old with and without diabetes.

. margins diabetes, at(age=20)

Adjusted predictions                                    Number of obs = 10,349
Model VCE: OLS

Expression: Linear prediction, predict()
At: age = 20

Delta-method
Margin std. err. t P>|t| [95% conf. interval]
diabetes
Not diabetic 113.119 .3815637 296.46 0.000 112.3711 113.867
Diabetic 111.9162 3.364884 33.26 0.000 105.3204 118.512

We could do this manually, but it would be a lot of typing.

. display "E(SBP | no diabetes, age=20) = "    100.5111 
                                              + (-5.669005) * 0
                                              + 0.6303981   * 20
                                              + 0.2233087   * 0 * 20
E(SBP | no diabetes, age=20) = 113.11906

. display "E(SBP | diabetes, age=20) = "    100.5111 
                                           + (-5.669005) * 1
                                           + 0.6303981   * 20
                                           + 0.2233087   * 1 * 20
E(SBP | diabetes, age=20) = 111.91623

Next let's use margins to estimate the expected SBP for each category of diabetes at ages 20–60 in increments of 5 years.

. margins diabetes, at(age=(20(5)60))


Adjusted predictions                                    Number of obs = 10,349
Model VCE: OLS

Expression: Linear prediction, predict()
1._at: age = 20
2._at: age = 25
3._at: age = 30
4._at: age = 35
5._at: age = 40
6._at: age = 45
7._at: age = 50
8._at: age = 55
9._at: age = 60

Delta-method
Margin std. err. t P>|t| [95% conf. interval]
_at#diabetes
1 #
Not diabetic 113.119 .3815637 296.46 0.000 112.3711 113.867
1#Diabetic 111.9162 3.364884 33.26 0.000 105.3204 118.512
2 #
Not diabetic 116.271 .3327796 349.39 0.000 115.6187 116.9234
2#Diabetic 116.1847 2.983741 38.94 0.000 110.336 122.0335
3 #
Not diabetic 119.423 .2881485 414.45 0.000 118.8582 119.9879
3#Diabetic 120.4533 2.607642 46.19 0.000 115.3418 125.5648
4 #
Not diabetic 122.575 .2499055 490.49 0.000 122.0852 123.0649
4#Diabetic 124.7218 2.239132 55.70 0.000 120.3327 129.1109
5 #
Not diabetic 125.727 .2213861 567.91 0.000 125.293 126.161
5#Diabetic 128.9904 1.882671 68.51 0.000 125.3 132.6808
6 #
Not diabetic 128.879 .206656 623.64 0.000 128.4739 129.2841
6#Diabetic 133.2589 1.546613 86.16 0.000 130.2272 136.2905
7 #
Not diabetic 132.031 .2086565 632.77 0.000 131.622 132.44
7#Diabetic 137.5274 1.247557 110.24 0.000 135.082 139.9729
8 #
Not diabetic 135.183 .2269454 595.66 0.000 134.7381 135.6278
8#Diabetic 141.796 1.01863 139.20 0.000 139.7992 143.7927
9 #
Not diabetic 138.335 .2580829 536.01 0.000 137.8291 138.8409
9#Diabetic 146.0645 .9141335 159.78 0.000 144.2726 147.8564

The numbers reported in the Margin column are average values of the linear prediction of SBP for each combination of diabetes category and age. For example, the output tells us that the expected SBP is 113.119 for a 20-year-old person without diabetes and the expected SBP is 146.0645 for a 60-year-old person with diabetes.

The output also reports a standard error, t statistic, p-value, and 95% confidence interval for each estimate. The t statistic tests the null hypothesis that the expected SBP is zero.

We can plot the marginal predictions and their 95% confidence intervals by typing marginsplot.

. marginsplot

Variables that uniquely identify margins: age diabetes

Let's add more options to make our graph look nicer. We can use the legend() option to customize the look of the legend. And we can use the title(), subtitle(), and ytitle() options to add various titles to our graph.

. marginsplot, ytitle("Expected systolic blood pressure (mmHg)")      
                title("Expected systolic blood pressure") 
                subtitle("By age and diabetes status") 
                legend(order(1 "No diabetes" 2 "Diabetes")
                rows(1) position(12))

Variables that uniquely identify margins: age diabetes

Marginal effects

We can also use margins to estimate marginal predictions for one variable averaged over other variables in the model. For example, we can estimate the expected SBP for categories of diabetes averaged over age.

. margins diabetes

Predictive margins                                      Number of obs = 10,349
Model VCE: OLS

Expression: Linear prediction, predict()

Delta-method
Margin std. err. t P>|t| [95% conf. interval]
diabetes
Not diabetic 130.5066 .2055351 634.96 0.000 130.1037 130.9094
Diabetic 135.463 1.385992 97.74 0.000 132.7462 138.1798

How does it work?

Method 1: Average response

Let's work a simpler example without the interaction to help us understand how margins works. Let's fit a linear regression model including diabetes and hlthstat without the interaction. The option coeflegend displays a legend that includes terms that refer to the coefficients in the model.

. regress bpsystol i.diabetes c.age, coeflegend

Source SS df MS Number of obs = 10,349
F(2, 10346) = 1601.69
Model 1331833.99 2 665916.993 Prob > F = 0.0000
Residual 4301446.06 10,346 415.759333 R-squared = 0.2364
Adj R-squared = 0.2363
Total 5633280.05 10,348 544.38346 Root MSE = 20.39
bpsystol Coefficient Legend
diabetes
Diabetic 7.815281 _b[1.diabetes]
age .6353169 _b[age]
_cons 100.2803 _b[_cons]

Let's display the contents of _b[1.diabetes] to verify that it equals 7.815281.

. display _b[1.diabetes]
7.8152815

Now we can use coefficients and indicator variables to generate a new variable that equals the expected SBP assuming every observation in the sample does not have diabetes.

. generate double sbp_diab0 = _b[_cons] + _b[1.diabetes]*0 + _b[age] * age

Next we can generate a new variable that equals the expected SBP assuming every observation in the sample has diabetes.

. generate double sbp_diab1 = _b[_cons] + _b[1.diabetes]*1 + _b[age] * age

Then we can calculate the average of the two variables to estimate the expected SBP for people with, and without, diabetes. The option if e(sample) restricts the calculation to observations that are not missing values for bpsystol, diabetes, or age.

. table () if e(sample), statistic(mean sbp_diab0 sbp_diab1)

sbp_diab0 130.5098
sbp_diab1 138.3251

This matches the results reported by margins.

. margins diabetes

Predictive margins                                      Number of obs = 10,349
Model VCE: OLS

Expression: Linear prediction, predict()

Delta-method
Margin std. err. t P>|t| [95% conf. interval]
diabetes
Not diabetic 130.5098 .2055982 634.78 0.000 130.1068 130.9128
Diabetic 138.3251 .9258365 149.41 0.000 136.5103 140.1399

Method 2: Response at average

In the previous example, we first calculated the response for each observation and then calculated the average of those responses. This is the default method. But we could also calculate the average covariate values first and then report the response at those average values.

Let's begin by using table to estimate the mean of age. The option if e(sample) restricts the calculation to observations that are not missing values for bpsystol, diabetes, or age.

. table () if e(sample), statistic(mean age)

Mean 47.5818

Then we can use the mean age to estimate the expected SBP assuming no one in the sample has diabetes.

. display _b[_cons] + _b[1.diabetes] * 0  + _b[age] * 47.5818

We can also calculate the expected SBP assuming everyone in the sample has diabetes.

. display _b[_cons] + _b[1.diabetes] * 1  + _b[age] * 47.5818 

And we can check our work using margins with the atmeans option.

. margins diabetes, atmeans

Adjusted predictions                                    Number of obs = 10,349
Model VCE: OLS

Expression: Linear prediction, predict()
At: 0.diabetes = .9517828 (mean)
    1.diabetes = .0482172 (mean)
    age        =  47.5818 (mean)

Delta-method
Margin std. err. t P>|t| [95% conf. interval]
diabetes
Not diabetic 130.5098 .2055982 634.78 0.000 130.1068 130.9128
Diabetic 138.3251 .9258365 149.41 0.000 136.5103 140.1399

Again, the manually calculated results match the results produced by margins.

Estimating the average response (method 1) and the response at the average (method 2) gives us the same results for linear regression. But the results may differ for generalized linear models such as probit, logistic, or Poisson regression.

You can read more about factor-variable notation, margins, and marginsplot in the Stata documentation. You can also watch a demonstration of these commands by clicking on the links to the YouTube videos below.