Home  /  Resources & support  /  FAQs  /  Keeping all levels of a variable in the model

How do I keep all levels of my categorical variable in my model?

How do I specify a cell means model?

Title   Keeping all levels of a variable in the model
Author Kenneth Higbee, StataCorp

In the following example, we use regress as our estimation command, but the same thing applies to other estimation commands that have a noconstant option.

You might try

. sysuse auto, clear
(1978 Automobile Data)

. regress mpg i.rep78, noconstant

Source SS df MS Number of obs = 69
F(4, 65) = 188.12
Model 30942.2129 4 7735.55322 Prob > F = 0.0000
Residual 2672.78712 65 41.1198019 R-squared = 0.9205
Adj R-squared = 0.9156
Total 33615 69 487.173913 Root MSE = 6.4125
mpg Coefficient Std. err. t P>|t| [95% conf. interval]
rep78
2 19.125 2.267151 8.44 0.000 14.59719 23.65281
3 19.43333 1.170752 16.60 0.000 17.09518 21.77149
4 21.66667 1.511434 14.34 0.000 18.64812 24.68521
5 27.36364 1.933433 14.15 0.000 23.5023 31.22497

and then wonder why the first level of rep78 does not appear in your regression table. If you add the baselevels option to your regression command, you will see that the first level is considered a base level and has been omitted from the model.

. regress mpg i.rep78, noconstant baselevels

Source SS df MS Number of obs = 69
F(4, 65) = 188.12
Model 30942.2129 4 7735.55322 Prob > F = 0.0000
Residual 2672.78712 65 41.1198019 R-squared = 0.9205
Adj R-squared = 0.9156
Total 33615 69 487.173913 Root MSE = 6.4125
mpg Coefficient Std. err. t P>|t| [95% conf. interval]
rep78
1 0 (base)
2 19.125 2.267151 8.44 0.000 14.59719 23.65281
3 19.43333 1.170752 16.60 0.000 17.09518 21.77149
4 21.66667 1.511434 14.34 0.000 18.64812 24.68521
5 27.36364 1.933433 14.15 0.000 23.5023 31.22497

The ibn. factor-variable operator specifies that a categorical variable should be treated as if it has no base, or, in other words, that all levels of the categorical variable are to be included in the model; see [U] 11.4.3 Factor variables.

What happens when you specify that rep78 should have no base level but leave the constant in the model?

. regress mpg ibn.rep78

note: 5.rep78 omitted because of collinearity
Source SS df MS Number of obs = 69
F(4, 64) = 4.91
Model 549.415777 4 137.353944 Prob > F = 0.0016
Residual 1790.78712 64 27.9810488 R-squared = 0.2348
Adj R-squared = 0.1869
Total 2340.2029 68 34.4147485 Root MSE = 5.2897
mpg Coefficient Std. err. t P>|t| [95% conf. interval]
rep78
1 -6.363636 4.066234 -1.56 0.123 -14.48687 1.759599
2 -8.238636 2.457918 -3.35 0.001 -13.14889 -3.32838
3 -7.930303 1.86452 -4.25 0.000 -11.65511 -4.205497
4 -5.69697 2.02441 -2.81 0.006 -9.741193 -1.652747
5 0 (omitted)
_cons 27.36364 1.594908 17.16 0.000 24.17744 30.54983

One of the levels of rep78 is omitted from the model despite your request that there be no base level for rep78. If you have the constant and all levels of a categorical variable in a model, something must be dropped because of the collinearity between all the levels and the constant.

You need to use the ibn. operator on your categorical variable and the noconstant option on your estimation command to obtain a cell means model.

. regress mpg ibn.rep78, noconstant

Source SS df MS Number of obs = 69
F(5, 64) = 227.47
Model 31824.2129 5 6364.84258 Prob > F = 0.0000
Residual 1790.78712 64 27.9810488 R-squared = 0.9467
Adj R-squared = 0.9426
Total 33615 69 487.173913 Root MSE = 5.2897
mpg Coefficient Std. err. t P>|t| [95% conf. interval]
rep78
1 21 3.740391 5.61 0.000 13.52771 28.47229
2 19.125 1.870195 10.23 0.000 15.38886 22.86114
3 19.43333 .9657648 20.12 0.000 17.504 21.36267
4 21.66667 1.246797 17.38 0.000 19.1759 24.15743
5 27.36364 1.594908 17.16 0.000 24.17744 30.54983