Home  /  Products  /  Features  /  Factor variables

Stata handles factor (categorical) variables elegantly. You can prefix a variable with i. to specify indicators for each level (category) of the variable. You can put a # between two variables to create an interaction–indicators for each combination of the categories of the variables. You can put ## instead to specify a full factorial of the variables—main effects for each variable and an interaction. If you want to interact a continuous variable with a factor variable, just prefix the continuous variable with c.. You can specify up to eight-way interactions.

We run a linear regression of cholesterol level on a full factorial of age group and whether the person smokes along with a continuous body mass index (bmi) and its interaction with whether the person smokes.

. regress cholesterol i.smoker##agegrp bmi i.smoker#c.bmi

Source         SS           df       MS    Number of obs   =     4,049
F(9, 4039)      =     15.30
Model    137.845627         9  15.3161808    Prob > F        =    0.0000
Residual    4044.55849     4,039   1.0013762    R-squared       =    0.0330
Total    4182.40412     4,048   1.0332026    Root MSE        =    1.0007

cholesterol   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]

smoker
smoker     -.7699108    .337665    -2.28   0.023    -1.431921   -.1079012

agegrp
45-49      .1554985   .0620537     2.51   0.012     .0338391    .2771579
50-54      .1838839   .0618467     2.97   0.003     .0626303    .3051375
55-59      .1746813   .0763244     2.29   0.022     .0250433    .3243193

smoker#agegrp
smoker#45-49      -.118553   .1367914    -0.87   0.386    -.3867396    .1496336
smoker#50-54     -.1332379   .1363604    -0.98   0.329    -.4005796    .1341038
smoker#55-59     -.2466412   .1717679    -1.44   0.151    -.5834009    .0901185

bmi     .0253916   .0059336     4.28   0.000     .0137585    .0370246

smoker#c.bmi
smoker      .0501707   .0129223     3.88   0.000     .0248358    .0755055

_cons     5.437234   .1520921    35.75   0.000     5.139049    5.735418



We could have used parenthesis binding, to type the same model more briefly:

. regress cholesterol smoker##(agegrp c.bmi)


Base levels can be changed on the fly: i.agegrp uses the default base level of 1, whereas b3.agegrp makes 3 the base level.

The level indicator variables are not created in your dataset, saving lots of space.

Factor variables are integrated deeply into Stata’s processing of variable lists, providing a consistent way of interacting with both estimation and postestimation commands.