Home  /  Resources & Support  /  Introduction to Stata basics  /  Factor-variable notation

Factor-variable notation is a collection of prefixes and operators that allows us to specify regression models quickly and easily. We can distinguish between continuous and categorical variables, select reference categories, specify interactions between variables, and include polynomials of continuous variables. And factor-variable notation works with nearly all of Stata's regression commands such as regress, probit, logit, and poisson.

Let's begin by opening the nhanes2l dataset. Then let's describe and summarize the variables bpsystol, age, bmi, diabetes, and hlthstat.

. webuse nhanes2l
(Second National Health and Nutrition Examination Survey)

. describe bpsystol hlthstat diabetes age bmi

Variable Storage Display Value name type format label Variable label
bpsystol int %9.0g Systolic blood pressure hlthstat byte %20.0g hlth Health status diabetes byte %12.0g diabetes Diabetes status age byte %9.0g Age (years) bmi float %9.0g Body mass index (BMI)
. summarize bpsystol hlthstat diabetes age bmi
Variable Obs Mean Std. dev. Min Max
bpsystol 10,351 130.8817 23.33265 65 300
hlthstat 10,335 2.586164 1.206196 1 5
diabetes 10,349 .0482172 .2142353 0 1
age 10,351 47.57965 17.21483 20 74
bmi 10,351 25.5376 4.914969 12.3856 61.1297

We are going to fit a series of linear regression models for the outcome variable bpsystol, which measures systolic blood pressure with a range of 65 to 300 mmHg. hlthstat measures health status with a range from 1 to 5. diabetes measures diabetes status with a range of 0 to 1. age measures age with a range of 20 to 74 years. And bmi measures body mass index with a range of 12.4 to 61.1 kg/m2.

Factor-variable notation for categorical variables

Let's begin with a model including the predictor variable hlthstat. We suspect that hlthstat is a categorical variable because its description shows a value label named “hlth” and its summary has a minimum value of 1 and a maximum value of 5. Let's use label list to view the category labels.

. label list hlth
hlth:
           1 Excellent
           2 Very good
           3 Good
           4 Fair
           5 Poor
          .a Blank but applicable

hlthstat has five categories labeled Excellent, Very good, Good, Fair, and Poor. Stata's regression commands treat predictor variables as continuous by default, so we need to create indicator variables for each category of hlthstat. We could do this manually, but it is easier to use the “i.” prefix. The “i.” prefix is factor-variable notation that tells Stata a variable is categorical, and Stata will create temporary indicator variables for us automatically. Let's type list hlthstat i.hlthstat to see how it works.

. list hlthstat i.hlthstat in 1/10

1. 2. 3. 4. 5.
hlthstat hlthstat hlthstat hlthstat hlthstat hlthstat
1. Very good 0 1 0 0 0
2. Very good 0 1 0 0 0
3. Good 0 0 1 0 0
4. Fair 0 0 0 1 0
5. Very good 0 1 0 0 0
6. Poor 0 0 0 0 1
7. Very good 0 1 0 0 0
8. Excellent 1 0 0 0 0
9. Very good 0 1 0 0 0
10. Poor 0 0 0 0 1

The first column lists the value of hlthstat for the first 10 observations in our dataset. The next five columns, named 1.hlthstat through 5.hlthstat, are temporary indicator variables that Stata created for us. Category 1 in hlthstat is labeled “Excellent”, so the indicator variable 1.hlthstat will equal 1 when hlthstat equals “Excellent” and 0 otherwise. Category 2 in hlthstat is labeled “Very good”, so the indicator variable 2.hlthstat will equal 1 when hlthstat equals “Very good” and 0 otherwise. The indicator variables 3.hlthstat, 4.hlthstat, and 5.hlthstat follow the same pattern for “Good”, “Fair”, and “Poor”, respectively. Note that the indicator variables do not remain in the dataset after the command finishes running.

We can use the “i.” prefix with regress to treat hlthstat as a categorical predictor variable.

. regress bpsystol i.hlthstat

Source SS df MS Number of obs = 10,335
F(4, 10330) = 158.34
Model 325244.686 4 81311.1715 Prob > F = 0.0000
Residual 5304728.67 10,330 513.526492 R-squared = 0.0578
Adj R-squared = 0.0574
Total 5629973.35 10,334 544.800982 Root MSE = 22.661
bpsystol Coefficient Std. err. t P>|t| [95% conf. interval]
hlthstat
Very good 2.981587 .6415165 4.65 0.000 1.72409 4.239083
Good 8.034913 .6230047 12.90 0.000 6.813703 9.256123
Fair 14.71925 .721698 20.40 0.000 13.30459 16.13392
Poor 16.42304 .9580047 17.14 0.000 14.54517 18.30092
_cons 124.3191 .4618951 269.15 0.000 123.4137 125.2245

The output includes a coefficient for the intercept, labeled “_cons”, as well as slope coefficients for “Very good”, “Good”, “Fair”, and “Poor”. The “Excellent” category was automatically removed from the model and used as the comparison group called the “reference category”. By default, Stata will select the category with the smallest number, estimate the mean of the outcome for that category, and label it “_cons”. So the mean systolic blood pressure for the “Excellent” category is 124.3 mmHg. The coefficients for the other categories are the differences between the mean outcome in that category relative to the reference category. For example, the coefficient for the “Poor” group is 16.4, so the mean systolic blood pressure in the “Poor” group is 16.4 points higher than the “Excellent” category.

We can select a different reference category using the “ib(#).” prefix, where “#” is the category number for the reference category. Let's use hlthstat category 5, “Poor”, as the reference category.

. regress bpsystol ib(5).hlthstat

Source SS df MS Number of obs = 10,335
F(4, 10330) = 158.34
Model 325244.686 4 81311.1715 Prob > F = 0.0000
Residual 5304728.67 10,330 513.526492 R-squared = 0.0578
Adj R-squared = 0.0574
Total 5629973.35 10,334 544.800982 Root MSE = 22.661
bpsystol Coefficient Std. err. t P>|t| [95% conf. interval]
hlthstat
Excellent -16.42304 .9580047 -17.14 0.000 -18.30092 -14.54517
Very good -13.44146 .9500643 -14.15 0.000 -15.30377 -11.57915
Good -8.38813 .937664 -8.95 0.000 -10.22613 -6.550127
Fair -1.703789 1.005946 -1.69 0.090 -3.675638 .2680593
_cons 140.7421 .8393008 167.69 0.000 139.0969 142.3873

The “Poor” category is now omitted from the output and “Excellent” is included. The coefficient for _cons, 140.7, is now the mean systolic blood pressure in the “Poor” group, and the mean systolic blood pressure in the “Excellent” group is 16.4 mmHg lower than the “Poor” group.

We can also use the prefix “ib(frequent).” to select the category with the largest sample size. We can type tabulate hlthstat to verify that the “Good” category has the largest sample size.

. tabulate hlthstat

Health status Freq. Percent Cum.
Excellent 2,407 23.29 23.29
Very good 2,591 25.07 48.36
Good 2,938 28.43 76.79
Fair 1,670 16.16 92.95
Poor 729 7.05 100.00
Total 10,335 100.00
. regress bpsystol ib(frequent).hlthstat
Source SS df MS Number of obs = 10,335
F(4, 10330) = 158.34
Model 325244.686 4 81311.1715 Prob > F = 0.0000
Residual 5304728.67 10,330 513.526492 R-squared = 0.0578
Adj R-squared = 0.0574
Total 5629973.35 10,334 544.800982 Root MSE = 22.661
bpsystol Coefficient Std. err. t P>|t| [95% conf. interval]
hlthstat
Excellent -8.034913 .6230047 -12.90 0.000 -9.256123 -6.813703
Very good -5.053326 .6107242 -8.27 0.000 -6.250464 -3.856189
Fair 6.684341 .6944701 9.63 0.000 5.323045 8.045637
Poor 8.38813 .937664 8.95 0.000 6.550127 10.22613
_cons 132.354 .4180763 316.58 0.000 131.5345 133.1735

We can also use the prefix “ib(none).” to omit the reference category. This will display the mean outcome for each category when combined with the noconstant option.

. regress bpsystol ib(none).hlthstat, noconstant

Source SS df MS Number of obs = 10,335
F(5, 10330) = 69083.04
Model 177379866 5 35475973.3 Prob > F = 0.0000
Residual 5304728.67 10,330 513.526492 R-squared = 0.9710
Adj R-squared = 0.9709
Total 182684595 10,335 17676.3033 Root MSE = 22.661
bpsystol Coefficient Std. err. t P>|t| [95% conf. interval]
hlthstat
Excellent 124.3191 .4618951 269.15 0.000 123.4137 125.2245
Very good 127.3007 .4451924 285.95 0.000 126.428 128.1733
Good 132.354 .4180763 316.58 0.000 131.5345 133.1735
Fair 139.0383 .5545276 250.73 0.000 137.9513 140.1253
Poor 140.7421 .8393008 167.69 0.000 139.0969 142.3873

The output tells us that the mean systolic blood pressure in the “Excellent” category is 124.3 and the mean systolic blood pressure in the “Poor” group is 140.7.

Factor-variable notation for binary variables

Binary variables are simply categorical variables with two categories, so everything we discussed above applies to binary variables. Binary variables are often coded as “0/1” indicator variables, but you should still use the “i.” prefix if you plan to use postestimation commands, such as margins, after you fit a regression model. Let's look at a few quick examples in the interest of completeness.

Here is a model that includes diabetes as a binary predictor variable.

. regress bpsystol i.diabetes

Source SS df MS Number of obs = 10,349
F(1, 10347) = 244.99
Model 130296.034 1 130296.034 Prob > F = 0.0000
Residual 5502984.01 10,347 531.843434 R-squared = 0.0231
Adj R-squared = 0.0230
Total 5633280.05 10,348 544.38346 Root MSE = 23.062
bpsystol Coefficient Std. err. t P>|t| [95% conf. interval]
diabetes
Diabetic 16.56328 1.058212 15.65 0.000 14.48898 18.63758
_cons 130.088 .2323666 559.84 0.000 129.6325 130.5435

Let's use factor-variable notation to select people with diabetes as the reference category.

. regress bpsystol ib(1).diabetes

Source SS df MS Number of obs = 10,349
F(1, 10347) = 244.99
Model 130296.034 1 130296.034 Prob > F = 0.0000
Residual 5502984.01 10,347 531.843434 R-squared = 0.0231
Adj R-squared = 0.0230
Total 5633280.05 10,348 544.38346 Root MSE = 23.062
bpsystol Coefficient Std. err. t P>|t| [95% conf. interval]
diabetes
Not diabetic -16.56328 1.058212 -15.65 0.000 -18.63758 -14.48898
_cons 146.6513 1.032385 142.05 0.000 144.6276 148.675

Let's fit a model with no intercept and no reference category.

. regress bpsystol ib(none).diabetes, noconstant

Source SS df MS Number of obs = 10,349
F(2, 10347) > 99999.00
Model 177422292 2 88711146 Prob > F = 0.0000
Residual 5502984.01 10,347 531.843434 R-squared = 0.9699
Adj R-squared = 0.9699
Total 182925276 10,349 17675.6475 Root MSE = 23.062
bpsystol Coefficient Std. err. t P>|t| [95% conf. interval]
diabetes
Not diabetic 130.088 .2323666 559.84 0.000 129.6325 130.5435
Diabetic 146.6513 1.032385 142.05 0.000 144.6276 148.675

Factor-variable notation for continuous variables

Stata's regression commands treat predictor variables as continuous by default. But you can use the “c.” prefix to tell Stata explicitly that a predictor variable should be treated as continuous. This will be necessary when you include continuous variables in interactions with other variables.

Here is a quick example treating age as a continuous predictor variable.

. regress bpsystol c.age

Source SS df MS Number of obs = 10,351
F(1, 10349) = 3116.79
Model 1304200.02 1 1304200.02 Prob > F = 0.0000
Residual 4330470.01 10,349 418.443328 R-squared = 0.2315
Adj R-squared = 0.2314
Total 5634670.03 10,350 544.412563 Root MSE = 20.456
bpsystol Coefficient Std. err. t P>|t| [95% conf. interval]
age .6520775 .0116801 55.83 0.000 .6291823 .6749727
_cons 99.85603 .5909867 168.96 0.000 98.69758 101.0145

Factor-variable notation for interactions

Factor-variable notation also includes two operators. The “#” operator specifies an interaction between two variables, and the “##” operator specifies both the main effects and interaction of two variables.

Let's fit a model that includes the main effects for hlthstat and diabetes and use the “#” operator to include their interaction.

. regress bpsystol i.hlthstat i.diabetes i.hlthstat#i.diabetes

Source SS df MS Number of obs = 10,335
F(9, 10325) = 86.92
Model 396524.045 9 44058.2272 Prob > F = 0.0000
Residual 5233449.31 10,325 506.871604 R-squared = 0.0704
Adj R-squared = 0.0696
Total 5629973.35 10,334 544.800982 Root MSE = 22.514
bpsystol Coefficient Std. err. t P>|t| [95% conf. interval]
hlthstat
Very good 2.636051 .6417076 4.11 0.000 1.37818 3.893922
Good 7.648725 .6272209 12.19 0.000 6.419251 8.8782
Fair 13.50647 .7408272 18.23 0.000 12.0543 14.95863
Poor 14.77223 1.032484 14.31 0.000 12.74837 16.7961
diabetes
Diabetic 5.780232 4.618696 1.25 0.211 -3.273308 14.83377
hlthstat#
diabetes
Very good #
Diabetic 17.43339 5.726714 3.04 0.002 6.207924 28.65886
Good #
Diabetic 4.023894 5.032308 0.80 0.424 -5.840404 13.88819
Fair #
Diabetic 7.316062 4.97969 1.47 0.142 -2.445096 17.07722
Poor #
Diabetic 3.445358 5.09316 0.68 0.499 -6.538222 13.42894
_cons 124.2614 .4611975 269.43 0.000 123.3574 125.1655

We could fit the same model using the “##” operator.

. regress bpsystol i.hlthstat##i.diabetes

Source SS df MS Number of obs = 10,335
F(9, 10325) = 86.92
Model 396524.045 9 44058.2272 Prob > F = 0.0000
Residual 5233449.31 10,325 506.871604 R-squared = 0.0704
Adj R-squared = 0.0696
Total 5629973.35 10,334 544.800982 Root MSE = 22.514
bpsystol Coefficient Std. err. t P>|t| [95% conf. interval]
hlthstat
Very good 2.636051 .6417076 4.11 0.000 1.37818 3.893922
Good 7.648725 .6272209 12.19 0.000 6.419251 8.8782
Fair 13.50647 .7408272 18.23 0.000 12.0543 14.95863
Poor 14.77223 1.032484 14.31 0.000 12.74837 16.7961
diabetes
Diabetic 5.780232 4.618696 1.25 0.211 -3.273308 14.83377
hlthstat#
diabetes
Very good #
Diabetic 17.43339 5.726714 3.04 0.002 6.207924 28.65886
Good #
Diabetic 4.023894 5.032308 0.80 0.424 -5.840404 13.88819
Fair #
Diabetic 7.316062 4.97969 1.47 0.142 -2.445096 17.07722
Poor #
Diabetic 3.445358 5.09316 0.68 0.499 -6.538222 13.42894
_cons 124.2614 .4611975 269.43 0.000 123.3574 125.1655

We can include interactions with continuous variables too.

. regress bpsystol i.diabetes##c.age

Source SS df MS Number of obs = 10,349
F(3, 10345) = 1071.05
Model 1335031.79 3 445010.595 Prob > F = 0.0000
Residual 4298248.26 10,345 415.490407 R-squared = 0.2370
Adj R-squared = 0.2368
Total 5633280.05 10,348 544.38346 Root MSE = 20.384
bpsystol Coefficient Std. err. t P>|t| [95% conf. interval]
diabetes
Diabetic -5.669005 4.952369 -1.14 0.252 -15.37661 4.038595
age .6303981 .0119464 52.77 0.000 .6069808 .6538154
diabetes#
c.age
Diabetic .2233087 .0804934 2.77 0.006 .065526 .3810913
_cons 100.5111 .5969456 168.38 0.000 99.34096 101.6812

We can even include three-way and higher-order interactions using the “#” and “##” operators.

. regress bpsystol i.hlthstat##i.diabetes##c.age

Source SS df MS Number of obs = 10,335
F(19, 10315) = 173.56
Model 1363865.23 19 71782.3807 Prob > F = 0.0000
Residual 4266108.12 10,315 413.582949 R-squared = 0.2423
Adj R-squared = 0.2409
Total 5629973.35 10,334 544.800982 Root MSE = 20.337
bpsystol Coefficient Std. err. t P>|t| [95% conf. interval]
hlthstat
Very good -.2522701 1.571793 -0.16 0.872 -3.333289 2.828748
Good -1.269239 1.640212 -0.77 0.439 -4.484373 1.945895
Fair -1.892737 2.323042 -0.81 0.415 -6.446351 2.660877
Poor -1.470403 4.440142 -0.33 0.741 -10.17394 7.233137
diabetes
Diabetic 5.648359 16.10149 0.35 0.726 -25.91369 37.21041
hlthstat#
diabetes
Very good #
Diabetic .6634293 26.12969 0.03 0.980 -50.55583 51.88269
Good #
Diabetic -16.56507 18.00713 -0.92 0.358 -51.86255 18.7324
Fair #
Diabetic -7.761426 18.83079 -0.41 0.680 -44.67343 29.15058
Poor #
Diabetic -5.055061 20.09251 -0.25 0.801 -44.44028 34.33016
age .5505586 .0261998 21.01 0.000 .499202 .6019153
hlthstat#
c.age
Very good .026618 .0352546 0.76 0.450 -.0424879 .0957239
Good .084684 .0349617 2.42 0.015 .0161522 .1532157
Fair .1210264 .0438944 2.76 0.006 .0349849 .2070679
Poor .0900039 .0752338 1.20 0.232 -.057469 .2374768
diabetes#
c.age
Diabetic -.1428421 .2867743 -0.50 0.618 -.7049754 .4192913
hlthstat#
diabetes#
c.age
Very good #
Diabetic .2297988 .4324672 0.53 0.595 -.6179209 1.077518
Good #
Diabetic .3910658 .316956 1.23 0.217 -.2302295 1.012361
Fair #
Diabetic .3139083 .3258971 0.96 0.335 -.3249132 .9527298
Poor #
Diabetic .26957 .3465917 0.78 0.437 -.409817 .948957
_cons 102.2407 1.127687 90.66 0.000 100.0302 104.4512

We have already learned that Stata treats predictor variables as continuous by default. But the opposite is true with interaction operators. Both “#” and “##” treat variables as categorical predictors if you do not specify a prefix. So typing hlthstat##diabetes would work. But typing diabetes##age would make a mess because age would be treated as a categorical variable by default. When in doubt, use the “i.” and “c.” prefixes to avoid mistakes.

The prefixes also have a “distributive property” when used with parentheses. The syntax below treats hlthstat and diabetes as categorical predictors and fits a model that includes their main effects as well as their interactions with age. Note that the model will not include the interaction of hlthstat and diabetes.

. regress bpsystol i.(hlthstat diabetes)##c.age

Source SS df MS Number of obs = 10,335
F(11, 10323) = 298.72
Model 1359359.05 11 123578.096 Prob > F = 0.0000
Residual 4270614.3 10,323 413.698954 R-squared = 0.2415
Adj R-squared = 0.2406
Total 5629973.35 10,334 544.800982 Root MSE = 20.34
bpsystol Coefficient Std. err. t P>|t| [95% conf. interval]
hlthstat
Very good -.5801787 1.56339 -0.37 0.711 -3.644726 2.484369
Good -1.453802 1.627043 -0.89 0.372 -4.643121 1.735517
Fair -2.078403 2.286625 -0.91 0.363 -6.56063 2.403824
Poor -.9296666 4.211361 -0.22 0.825 -9.18475 7.325417
diabetes
Diabetic -5.664698 5.022147 -1.13 0.259 -15.50908 4.179683
age .5433911 .0259983 20.90 0.000 .4924295 .5943527
hlthstat#
c.age
Very good .0382409 .034885 1.10 0.273 -.0301404 .1066222
Good .0887067 .0345224 2.57 0.010 .021036 .1563773
Fair .1300174 .0430386 3.02 0.003 .0456535 .2143813
Poor .0888559 .0713922 1.24 0.213 -.0510867 .2287985
diabetes#
c.age
Diabetic .2067666 .0816404 2.53 0.011 .0467356 .3667976
_cons 102.4518 1.122841 91.24 0.000 100.2508 104.6528

Factor-variable notation for polynomials

We can also use the “#” and “##” operators to specify polynomial terms for continuous variables. For example, we may wish to fit a model that includes both age and the square of age in our model. We can do this by interacting age with itself.

. regress bpsystol c.age##c.age

Source SS df MS Number of obs = 10,351
F(2, 10348) = 1592.42
Model 1326071.99 2 663035.995 Prob > F = 0.0000
Residual 4308598.04 10,348 416.370123 R-squared = 0.2353
Adj R-squared = 0.2352
Total 5634670.03 10,350 544.412563 Root MSE = 20.405
bpsystol Coefficient Std. err. t P>|t| [95% conf. interval]
age .0345687 .0859928 0.40 0.688 -.1339939 .2031312
c.age#c.age .0066366 .0009157 7.25 0.000 .0048417 .0084315
_cons 112.2463 1.808325 62.07 0.000 108.7017 115.791

We could include a term for age cubed.

. regress bpsystol c.age##c.age##c.age

Source SS df MS Number of obs = 10,351
F(3, 10347) = 1065.37
Model 1329759.5 3 443253.167 Prob > F = 0.0000
Residual 4304910.52 10,347 416.053979 R-squared = 0.2360
Adj R-squared = 0.2358
Total 5634670.03 10,350 544.412563 Root MSE = 20.397
bpsystol Coefficient Std. err. t P>|t| [95% conf. interval]
age -1.107037 .3929805 -2.82 0.005 -1.877355 -.3367196
c.age#c.age .0329455 .0088844 3.71 0.000 .0155303 .0503607
c.age#c.age#
c.age -.0001879 .0000631 -2.98 0.003 -.0003116 -.0000642
_cons 112.2463 1.808325 62.07 0.000 108.7017 115.791

We can also include the square of age when we include an interaction of age with another variable.

. regress bpsystol i.diabetes##c.age c.age#c.age

Source SS df MS Number of obs = 10,349
F(4, 10344) = 817.53
Model 1353111.75 4 338277.939 Prob > F = 0.0000
Residual 4280168.29 10,344 413.782704 R-squared = 0.2402
Adj R-squared = 0.2399
Total 5633280.05 10,348 544.38346 Root MSE = 20.342
bpsystol Coefficient Std. err. t P>|t| [95% conf. interval]
diabetes
Diabetic -.8886553 4.994811 -0.18 0.859 -10.67945 8.902141
age .0640567 .0865028 0.74 0.459 -.1055054 .2336188
diabetes#
c.age
Diabetic .1403559 .0813022 1.73 0.084 -.0190122 .2997239
c.age#c.age .0061116 .0009246 6.61 0.000 .0042992 .0079239
_cons 111.823 1.812009 61.71 0.000 108.2711 115.3749

You can read more about factor-variable notation in the Stata documentation. You can also watch a demonstration of these commands by clicking on the links to the YouTube videos below.

Tell me more

Read more in the Stata Base Reference Manual; see [R] regress. And in the Stata User’s Guide, see [U] 11 Factor variables.