Factor-variable notation

Factor-variable notation is a collection of prefixes and operators that allows us to specify regression models quickly and easily. We can distinguish between continuous and categorical variables, select reference categories, specify interactions between variables, and include polynomials of continuous variables. And factor-variable notation works with nearly all of Stata's regression commands, such as regress, probit, logit, and poisson.

Let's begin by opening the nhanes2l dataset. Then let's describe and summarize the variables bpsystol, age, bmi, diabetes, and hlthstat.

. webuse nhanes2l
(Second National Health and Nutrition Examination Survey)

. describe bpsystol hlthstat diabetes age bmi


Variable      Storage   Display    Value
    name         type    format    label      Variable label
                                                                                
bpsystol        int     %9.0g                 Systolic blood pressure
hlthstat        byte    %20.0g     hlth       Health status
diabetes        byte    %12.0g     diabetes   Diabetes status
age             byte    %9.0g                 Age (years)
bmi             float   %9.0g                 Body mass index (BMI)


. summarize bpsystol hlthstat diabetes age bmi


    Variable          Obs        Mean    Std. dev.       Min        Max
   
    bpsystol       10,351    130.8817    23.33265         65        300
    hlthstat       10,335    2.586164    1.206196          1          5
    diabetes       10,349    .0482172    .2142353          0          1
         age       10,351    47.57965    17.21483         20         74
         bmi       10,351     25.5376    4.914969    12.3856    61.1297

We are going to fit a series of linear regression models for the outcome variable bpsystol, which measures systolic blood pressure with a range of 65 to 300 mmHg. hlthstat measures health status with a range of 1 to 5. diabetes measures diabetes status with a range of 0 to 1. age measures age with a range of 20 to 74 years. And bmi measures body mass index with a range of 12.4 to 61.1 kg/m².

Factor-variable notation for categorical variables

Let's begin with a model including the predictor variable hlthstat. We suspect that hlthstat is a categorical variable because its description shows a value label named “hlth” and because its summary has a minimum value of 1 and a maximum value of 5. Let's use label list to view the category labels.

. label list hlth
hlth:
           1 Excellent
           2 Very good
           3 Good
           4 Fair
           5 Poor
          .a Blank but applicable

hlthstat has five categories labeled Excellent, Very good, Good, Fair, and Poor. Stata's regression commands treat predictor variables as continuous by default, so we need to create indicator variables for each category of hlthstat. We could do this manually, but it is easier to use the i. prefix. The i. prefix is factor-variable notation that tells Stata a variable is categorical, and Stata will create temporary indicator variables for us automatically. Let's type the following to see how it works:

. list hlthstat i.hlthstat in 1/10


                         1.         2.         3.         4.         5.
       hlthstat   hlthstat   hlthstat   hlthstat   hlthstat   hlthstat 
  1.  Very good          0          1          0          0          0 
  2.  Very good          0          1          0          0          0 
  3.       Good          0          0          1          0          0 
  4.       Fair          0          0          0          1          0 
  5.  Very good          0          1          0          0          0 
  6.       Poor          0          0          0          0          1 
  7.  Very good          0          1          0          0          0 
  8.  Excellent          1          0          0          0          0 
  9.  Very good          0          1          0          0          0 
 10.       Poor          0          0          0          0          1

The first column lists the value of hlthstat for the first 10 observations in our dataset. The next five columns, named 1.hlthstat through 5.hlthstat, are temporary indicator variables that Stata created for us. Category 1 in hlthstat is labeled “Excellent”, so the indicator variable 1.hlthstat will equal 1 when hlthstat equals “Excellent” and will equal 0 otherwise. Category 2 in hlthstat is labeled “Very good”, so the indicator variable 2.hlthstat will equal 1 when hlthstat equals “Very good” and will equal 0 otherwise. The indicator variables 3.hlthstat, 4.hlthstat, and 5.hlthstat follow the same pattern for “Good”, “Fair”, and “Poor”, respectively. Note that the indicator variables do not remain in the dataset after the command finishes running.

We can use the i. prefix with regress to treat hlthstat as a categorical predictor variable.

. regress bpsystol i.hlthstat


      Source         SS           df       MS     Number of obs   =    10,335
      F(4, 10330)     =    158.34
       Model    325244.686         4  81311.1715    Prob > F        =    0.0000
    Residual    5304728.67    10,330  513.526492    R-squared       =    0.0578
      Adj R-squared   =    0.0574
       Total    5629973.35    10,334  544.800982    Root MSE        =    22.661




    bpsystol   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
    hlthstat                                                                  
  Very good      2.981587   .6415165     4.65   0.000      1.72409    4.239083
       Good      8.034913   .6230047    12.90   0.000     6.813703    9.256123
       Fair      14.71925    .721698    20.40   0.000     13.30459    16.13392
       Poor      16.42304   .9580047    17.14   0.000     14.54517    18.30092
                                                                              
       _cons     124.3191   .4618951   269.15   0.000     123.4137    125.2245

The output includes a coefficient for the intercept, labeled _cons, as well as slope coefficients for Very good, Good, Fair, and Poor. The “Excellent” category was automatically removed from the model and used as the comparison group called the reference category. By default, Stata will select the category with the smallest number, estimate the mean of the outcome for that category, and label it “_cons”. So the mean systolic blood pressure for the “Excellent” category is 124.3 mmHg. The coefficients for the other categories are the difference between the mean outcome in that category relative to the reference category. For example, the coefficient for the Poor group is 16.4, so the mean systolic blood pressure in the “Poor” group is 16.4 mmHg higher than in the “Excellent” group.

We can select a different reference category by using the ib(#). prefix, where # is the category number for the reference category. Let's use hlthstat category 5, “Poor”, as the reference category.

. regress bpsystol ib(5).hlthstat


      Source         SS           df       MS     Number of obs   =    10,335
      F(4, 10330)     =    158.34
       Model    325244.686         4  81311.1715    Prob > F        =    0.0000
    Residual    5304728.67    10,330  513.526492    R-squared       =    0.0578
      Adj R-squared   =    0.0574
       Total    5629973.35    10,334  544.800982    Root MSE        =    22.661




    bpsystol   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
    hlthstat                                                                  
  Excellent     -16.42304   .9580047   -17.14   0.000    -18.30092   -14.54517
  Very good     -13.44146   .9500643   -14.15   0.000    -15.30377   -11.57915
       Good      -8.38813    .937664    -8.95   0.000    -10.22613   -6.550127
       Fair     -1.703789   1.005946    -1.69   0.090    -3.675638    .2680593
                                                                              
       _cons     140.7421   .8393008   167.69   0.000     139.0969    142.3873

The “Poor” category is now omitted from the output and Excellent is included. The coefficient for _cons, 140.7, is now the mean systolic blood pressure in the “Poor” group, and the mean systolic blood pressure in the “Excellent” group is 16.4 mmHg lower than in the “Poor” group.

We can also use the prefix ib(frequent). to select the category with the largest sample size. We can type tabulate hlthstat to verify that the “Good” category has the largest sample size.

. tabulate hlthstat


       Health status        Freq.     Percent        Cum.
    
           Excellent        2,407       23.29       23.29
           Very good        2,591       25.07       48.36
                Good        2,938       28.43       76.79
                Fair        1,670       16.16       92.95
                Poor          729        7.05      100.00
    
               Total       10,335      100.00            


. regress bpsystol ib(frequent).hlthstat


      Source         SS           df       MS     Number of obs   =    10,335
      F(4, 10330)     =    158.34
       Model    325244.686         4  81311.1715    Prob > F        =    0.0000
    Residual    5304728.67    10,330  513.526492    R-squared       =    0.0578
      Adj R-squared   =    0.0574
       Total    5629973.35    10,334  544.800982    Root MSE        =    22.661




    bpsystol   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
    hlthstat                                                                  
  Excellent     -8.034913   .6230047   -12.90   0.000    -9.256123   -6.813703
  Very good     -5.053326   .6107242    -8.27   0.000    -6.250464   -3.856189
       Fair      6.684341   .6944701     9.63   0.000     5.323045    8.045637
       Poor       8.38813    .937664     8.95   0.000     6.550127    10.22613
                                                                              
       _cons      132.354   .4180763   316.58   0.000     131.5345    133.1735

We can also use the prefix ib(none). to omit the reference category. When combined with the noconstant option, this will display the mean outcome for each category.

. regress bpsystol ib(none).hlthstat, noconstant


      Source         SS           df       MS     Number of obs   =    10,335
      F(5, 10330)     =  69083.04
       Model     177379866         5  35475973.3    Prob > F        =    0.0000
    Residual    5304728.67    10,330  513.526492    R-squared       =    0.9710
      Adj R-squared   =    0.9709
       Total     182684595    10,335  17676.3033    Root MSE        =    22.661




    bpsystol   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
    hlthstat                                                                  
  Excellent      124.3191   .4618951   269.15   0.000     123.4137    125.2245
  Very good      127.3007   .4451924   285.95   0.000      126.428    128.1733
       Good       132.354   .4180763   316.58   0.000     131.5345    133.1735
       Fair      139.0383   .5545276   250.73   0.000     137.9513    140.1253
       Poor      140.7421   .8393008   167.69   0.000     139.0969    142.3873

The output tells us that the mean systolic blood pressure in the “Excellent” category is 124.3 and the mean systolic blood pressure in the “Poor” group is 140.7.

Factor-variable notation for binary variables

Binary variables are simply categorical variables with only two categories, so everything we discussed above applies to binary variables. Binary variables are often coded as 0/1 indicator variables, but you should still use the i. prefix if you plan to use postestimation commands, such as margins, after you fit a regression model. Let's look at a few quick examples in the interest of completeness.

Here is a model that includes diabetes as a binary predictor variable.

. regress bpsystol i.diabetes


      Source         SS           df       MS     Number of obs   =    10,349
      F(1, 10347)     =    244.99
       Model    130296.034         1  130296.034    Prob > F        =    0.0000
    Residual    5502984.01    10,347  531.843434    R-squared       =    0.0231
      Adj R-squared   =    0.0230
       Total    5633280.05    10,348   544.38346    Root MSE        =    23.062




    bpsystol   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
    diabetes                                                                  
   Diabetic      16.56328   1.058212    15.65   0.000     14.48898    18.63758
       _cons      130.088   .2323666   559.84   0.000     129.6325    130.5435

Let's use factor-variable notation to select people with diabetes as the reference category.

. regress bpsystol ib(1).diabetes


      Source         SS           df       MS     Number of obs   =    10,349
      F(1, 10347)     =    244.99
       Model    130296.034         1  130296.034    Prob > F        =    0.0000
    Residual    5502984.01    10,347  531.843434    R-squared       =    0.0231
      Adj R-squared   =    0.0230
       Total    5633280.05    10,348   544.38346    Root MSE        =    23.062




     bpsystol   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
     diabetes                                                                  
Not diabetic     -16.56328   1.058212   -15.65   0.000    -18.63758   -14.48898
        _cons     146.6513   1.032385   142.05   0.000     144.6276     148.675

Let's fit a model with no intercept and no reference category.

. regress bpsystol ib(none).diabetes, noconstant


      Source         SS           df       MS     Number of obs   =    10,349
      F(2, 10347)     >  99999.00
       Model     177422292         2    88711146    Prob > F        =    0.0000
    Residual    5502984.01    10,347  531.843434    R-squared       =    0.9699
      Adj R-squared   =    0.9699
       Total     182925276    10,349  17675.6475    Root MSE        =    23.062




     bpsystol   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
     diabetes                                                                  
Not diabetic       130.088   .2323666   559.84   0.000     129.6325    130.5435
    Diabetic      146.6513   1.032385   142.05   0.000     144.6276     148.675

Factor-variable notation for continuous variables

Stata's regression commands treat predictor variables as continuous by default. But you can use the c. prefix to tell Stata explicitly that a predictor variable should be treated as continuous. This will be necessary when you include continuous variables in interactions with other variables.

Here is a quick example treating age as a continuous predictor variable.

. regress bpsystol c.age


      Source         SS           df       MS     Number of obs   =    10,351
      F(1, 10349)     =   3116.79
       Model    1304200.02         1  1304200.02    Prob > F        =    0.0000
    Residual    4330470.01    10,349  418.443328    R-squared       =    0.2315
      Adj R-squared   =    0.2314
       Total    5634670.03    10,350  544.412563    Root MSE        =    20.456




    bpsystol   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
         age     .6520775   .0116801    55.83   0.000     .6291823    .6749727
       _cons     99.85603   .5909867   168.96   0.000     98.69758    101.0145

Factor-variable notation for interactions

Factor-variable notation also includes two operators. The # operator specifies an interaction between two variables, and the ## operator specifies both the main effects and the interaction of two variables.

Let's fit a model that includes the main effects for hlthstat and diabetes and use the # operator to include their interaction.

. regress bpsystol i.hlthstat i.diabetes i.hlthstat#i.diabetes


      Source         SS           df       MS     Number of obs   =    10,335
      F(9, 10325)     =     86.92
       Model    396524.045         9  44058.2272    Prob > F        =    0.0000
    Residual    5233449.31    10,325  506.871604    R-squared       =    0.0704
      Adj R-squared   =    0.0696
       Total    5629973.35    10,334  544.800982    Root MSE        =    22.514




    bpsystol   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
    hlthstat                                                                  
  Very good      2.636051   .6417076     4.11   0.000      1.37818    3.893922
       Good      7.648725   .6272209    12.19   0.000     6.419251      8.8782
       Fair      13.50647   .7408272    18.23   0.000      12.0543    14.95863
       Poor      14.77223   1.032484    14.31   0.000     12.74837     16.7961
                                                                              
    diabetes                                                                  
   Diabetic      5.780232   4.618696     1.25   0.211    -3.273308    14.83377
                                                                              
    hlthstat#                                                                  
    diabetes                                                                  
  Very good #                                                                  
   Diabetic      17.43339   5.726714     3.04   0.002     6.207924    28.65886
       Good #                                                                  
   Diabetic      4.023894   5.032308     0.80   0.424    -5.840404    13.88819
       Fair #                                                                  
   Diabetic      7.316062    4.97969     1.47   0.142    -2.445096    17.07722
       Poor #                                                                  
   Diabetic      3.445358    5.09316     0.68   0.499    -6.538222    13.42894
                                                                              
       _cons     124.2614   .4611975   269.43   0.000     123.3574    125.1655

We could fit the same model by using the ## operator.

. regress bpsystol i.hlthstat##i.diabetes


      Source         SS           df       MS     Number of obs   =    10,335
      F(9, 10325)     =     86.92
       Model    396524.045         9  44058.2272    Prob > F        =    0.0000
    Residual    5233449.31    10,325  506.871604    R-squared       =    0.0704
      Adj R-squared   =    0.0696
       Total    5629973.35    10,334  544.800982    Root MSE        =    22.514




    bpsystol   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
    hlthstat                                                                  
  Very good      2.636051   .6417076     4.11   0.000      1.37818    3.893922
       Good      7.648725   .6272209    12.19   0.000     6.419251      8.8782
       Fair      13.50647   .7408272    18.23   0.000      12.0543    14.95863
       Poor      14.77223   1.032484    14.31   0.000     12.74837     16.7961
                                                                              
    diabetes                                                                  
   Diabetic      5.780232   4.618696     1.25   0.211    -3.273308    14.83377
                                                                              
    hlthstat#                                                                  
    diabetes                                                                  
  Very good #                                                                  
   Diabetic      17.43339   5.726714     3.04   0.002     6.207924    28.65886
       Good #                                                                  
   Diabetic      4.023894   5.032308     0.80   0.424    -5.840404    13.88819
       Fair #                                                                  
   Diabetic      7.316062    4.97969     1.47   0.142    -2.445096    17.07722
       Poor #                                                                  
   Diabetic      3.445358    5.09316     0.68   0.499    -6.538222    13.42894
                                                                              
       _cons     124.2614   .4611975   269.43   0.000     123.3574    125.1655

We can include interactions with continuous variables too.

. regress bpsystol i.diabetes##c.age


      Source         SS           df       MS     Number of obs   =    10,349
      F(3, 10345)     =   1071.05
       Model    1335031.79         3  445010.595    Prob > F        =    0.0000
    Residual    4298248.26    10,345  415.490407    R-squared       =    0.2370
      Adj R-squared   =    0.2368
       Total    5633280.05    10,348   544.38346    Root MSE        =    20.384




    bpsystol   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
    diabetes                                                                  
   Diabetic     -5.669005   4.952369    -1.14   0.252    -15.37661    4.038595
         age     .6303981   .0119464    52.77   0.000     .6069808    .6538154
                                                                              
    diabetes#                                                                  
       c.age                                                                  
   Diabetic      .2233087   .0804934     2.77   0.006      .065526    .3810913
                                                                              
       _cons     100.5111   .5969456   168.38   0.000     99.34096    101.6812

We can even include three-way and higher-order interactions by using the # and ## operators.

. regress bpsystol i.hlthstat##i.diabetes##c.age


      Source         SS           df       MS     Number of obs   =    10,335
      F(19, 10315)    =    173.56
       Model    1363865.23        19  71782.3807    Prob > F        =    0.0000
    Residual    4266108.12    10,315  413.582949    R-squared       =    0.2423
      Adj R-squared   =    0.2409
       Total    5629973.35    10,334  544.800982    Root MSE        =    20.337




    bpsystol   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
    hlthstat                                                                  
  Very good     -.2522701   1.571793    -0.16   0.872    -3.333289    2.828748
       Good     -1.269239   1.640212    -0.77   0.439    -4.484373    1.945895
       Fair     -1.892737   2.323042    -0.81   0.415    -6.446351    2.660877
       Poor     -1.470403   4.440142    -0.33   0.741    -10.17394    7.233137
                                                                              
    diabetes                                                                  
   Diabetic      5.648359   16.10149     0.35   0.726    -25.91369    37.21041
                                                                              
    hlthstat#                                                                  
    diabetes                                                                  
  Very good #                                                                  
   Diabetic      .6634293   26.12969     0.03   0.980    -50.55583    51.88269
       Good #                                                                  
   Diabetic     -16.56507   18.00713    -0.92   0.358    -51.86255     18.7324
       Fair #                                                                  
   Diabetic     -7.761426   18.83079    -0.41   0.680    -44.67343    29.15058
       Poor #                                                                  
   Diabetic     -5.055061   20.09251    -0.25   0.801    -44.44028    34.33016
                                                                              
         age     .5505586   .0261998    21.01   0.000      .499202    .6019153
                                                                              
    hlthstat#                                                                  
       c.age                                                                  
  Very good       .026618   .0352546     0.76   0.450    -.0424879    .0957239
       Good       .084684   .0349617     2.42   0.015     .0161522    .1532157
       Fair      .1210264   .0438944     2.76   0.006     .0349849    .2070679
       Poor      .0900039   .0752338     1.20   0.232     -.057469    .2374768
                                                                              
    diabetes#                                                                  
       c.age                                                                  
   Diabetic     -.1428421   .2867743    -0.50   0.618    -.7049754    .4192913
                                                                              
    hlthstat#                                                                  
    diabetes#                                                                  
       c.age                                                                  
  Very good #                                                                  
   Diabetic      .2297988   .4324672     0.53   0.595    -.6179209    1.077518
       Good #                                                                  
   Diabetic      .3910658    .316956     1.23   0.217    -.2302295    1.012361
       Fair #                                                                  
   Diabetic      .3139083   .3258971     0.96   0.335    -.3249132    .9527298
       Poor #                                                                  
   Diabetic        .26957   .3465917     0.78   0.437     -.409817     .948957
                                                                              
       _cons     102.2407   1.127687    90.66   0.000     100.0302    104.4512

We have already learned that Stata treats predictor variables as continuous by default. But the opposite is true with interaction operators. Both # and ## treat variables as categorical predictors if you do not specify a prefix. So typing hlthstat##diabetes would work. But typing diabetes##age would make a mess because age would be treated as a categorical variable by default. When in doubt, use the i. and c. prefixes to avoid mistakes.

The prefixes also have a distributive property when used with parentheses. The syntax below treats hlthstat and diabetes as categorical predictors and fits a model that includes their main effects as well as their interactions with age. Note that the model will not include the interaction of hlthstat and diabetes.

. regress bpsystol i.(hlthstat diabetes)##c.age


      Source         SS           df       MS     Number of obs   =    10,335
      F(11, 10323)    =    298.72
       Model    1359359.05        11  123578.096    Prob > F        =    0.0000
    Residual     4270614.3    10,323  413.698954    R-squared       =    0.2415
      Adj R-squared   =    0.2406
       Total    5629973.35    10,334  544.800982    Root MSE        =     20.34




    bpsystol   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
    hlthstat                                                                  
  Very good     -.5801787    1.56339    -0.37   0.711    -3.644726    2.484369
       Good     -1.453802   1.627043    -0.89   0.372    -4.643121    1.735517
       Fair     -2.078403   2.286625    -0.91   0.363     -6.56063    2.403824
       Poor     -.9296666   4.211361    -0.22   0.825     -9.18475    7.325417
                                                                              
    diabetes                                                                  
   Diabetic     -5.664698   5.022147    -1.13   0.259    -15.50908    4.179683
         age     .5433911   .0259983    20.90   0.000     .4924295    .5943527
                                                                              
    hlthstat#                                                                  
       c.age                                                                  
  Very good      .0382409    .034885     1.10   0.273    -.0301404    .1066222
       Good      .0887067   .0345224     2.57   0.010      .021036    .1563773
       Fair      .1300174   .0430386     3.02   0.003     .0456535    .2143813
       Poor      .0888559   .0713922     1.24   0.213    -.0510867    .2287985
                                                                              
    diabetes#                                                                  
       c.age                                                                  
   Diabetic      .2067666   .0816404     2.53   0.011     .0467356    .3667976
                                                                              
       _cons     102.4518   1.122841    91.24   0.000     100.2508    104.6528

Factor-variable notation for polynomials

We can also use the # and ## operators to specify polynomial terms for continuous variables. For example, we may wish to fit a model that includes both age and the square of age in our model. We can do this by interacting age with itself.

. regress bpsystol c.age##c.age


      Source         SS           df       MS     Number of obs   =    10,351
      F(2, 10348)     =   1592.42
       Model    1326071.99         2  663035.995    Prob > F        =    0.0000
    Residual    4308598.04    10,348  416.370123    R-squared       =    0.2353
      Adj R-squared   =    0.2352
       Total    5634670.03    10,350  544.412563    Root MSE        =    20.405




    bpsystol   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
         age     .0345687   .0859928     0.40   0.688    -.1339939    .2031312
                                                                              
 c.age#c.age     .0066366   .0009157     7.25   0.000     .0048417    .0084315
                                                                              
       _cons     112.2463   1.808325    62.07   0.000     108.7017     115.791

We could include a term for age cubed.

. regress bpsystol c.age##c.age##c.age


      Source         SS           df       MS     Number of obs   =    10,351
      F(3, 10347)     =   1065.37
       Model     1329759.5         3  443253.167    Prob > F        =    0.0000
    Residual    4304910.52    10,347  416.053979    R-squared       =    0.2360
      Adj R-squared   =    0.2358
       Total    5634670.03    10,350  544.412563    Root MSE        =    20.397




    bpsystol   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
         age    -1.107037   .3929805    -2.82   0.005    -1.877355   -.3367196
                                                                              
 c.age#c.age     .0329455   .0088844     3.71   0.000     .0155303    .0503607
                                                                              
 c.age#c.age#                                                                  
       c.age    -.0001879   .0000631    -2.98   0.003    -.0003116   -.0000642
                                                                              
       _cons     112.2463   1.808325    62.07   0.000     108.7017     115.791

We can also include the square of age when we include an interaction of age with another variable.

. regress bpsystol i.diabetes##c.age c.age#c.age


      Source         SS           df       MS     Number of obs   =    10,349
      F(4, 10344)     =    817.53
       Model    1353111.75         4  338277.939    Prob > F        =    0.0000
    Residual    4280168.29    10,344  413.782704    R-squared       =    0.2402
      Adj R-squared   =    0.2399
       Total    5633280.05    10,348   544.38346    Root MSE        =    20.342




    bpsystol   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
    diabetes                                                                  
   Diabetic     -.8886553   4.994811    -0.18   0.859    -10.67945    8.902141
         age     .0640567   .0865028     0.74   0.459    -.1055054    .2336188
                                                                              
    diabetes#                                                                  
       c.age                                                                  
   Diabetic      .1403559   .0813022     1.73   0.084    -.0190122    .2997239
                                                                              
 c.age#c.age     .0061116   .0009246     6.61   0.000     .0042992    .0079239
                                                                              
       _cons      111.823   1.812009    61.71   0.000     108.2711    115.3749

You can watch a demonstration of these commands by clicking on the links to the YouTube videos below. You can read more about factor-variable notation by clicking on the links to the Stata manual entries below.

See it in action

Watch Introduction to factor variables in Stata, part 1: The basics.

Watch Introduction to factor variables in Stata, part 2: Interactions.

Watch Introduction to factor variables in Stata, part 3: More interactions.

Tell me more

Products

New in Stata 19

Why Stata

All features

Disciplines

Stata/MP

StataNow

Order Stata

Purchase

Order Stata

Bookstore

Stata Press

Stata Journal

Gift Shop

Learn

Free webinars

NetCourses

Classroom and web training

Organizational training

Video tutorials

Third-party courses

Web resources

Teaching with Stata

Support

Training

Video tutorials

FAQs

Statalist: The Stata Forum

Resources

Technical support

Customer service

Alerts

Company

News and events

Customer service

Careers

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Privacy policy

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Required cookies

Advertising cookies

Required cookies

These cookies are essential for our website to function and do not store any personally identifiable information. These cookies cannot be disabled.
Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

Accept Cookies

Variable		Obs Mean Std. dev. Min Max

bpsystol		10,351 130.8817 23.33265 65 300
hlthstat		10,335 2.586164 1.206196 1 5
diabetes		10,349 .0482172 .2142353 0 1
age		10,351 47.57965 17.21483 20 74
bmi		10,351 25.5376 4.914969 12.3856 61.1297

	1. 2. 3. 4. 5.
	hlthstat hlthstat hlthstat hlthstat hlthstat hlthstat
1.	Very good 0 1 0 0 0
2.	Very good 0 1 0 0 0
3.	Good 0 0 1 0 0
4.	Fair 0 0 0 1 0
5.	Very good 0 1 0 0 0
6.	Poor 0 0 0 0 1
7.	Very good 0 1 0 0 0
8.	Excellent 1 0 0 0 0
9.	Very good 0 1 0 0 0
10.	Poor 0 0 0 0 1

Source	SS df MS	Number of obs = 10,335
		F(4, 10330) = 158.34
Model	325244.686 4 81311.1715	Prob > F = 0.0000
Residual	5304728.67 10,330 513.526492	R-squared = 0.0578
		Adj R-squared = 0.0574
Total	5629973.35 10,334 544.800982	Root MSE = 22.661


bpsystol		Coefficient Std. err. t P>\|t\| [95% conf. interval]

hlthstat
Very good		2.981587 .6415165 4.65 0.000 1.72409 4.239083
Good		8.034913 .6230047 12.90 0.000 6.813703 9.256123
Fair		14.71925 .721698 20.40 0.000 13.30459 16.13392
Poor		16.42304 .9580047 17.14 0.000 14.54517 18.30092

_cons		124.3191 .4618951 269.15 0.000 123.4137 125.2245

Health status		Freq. Percent Cum.

Excellent		2,407 23.29 23.29
Very good		2,591 25.07 48.36
Good		2,938 28.43 76.79
Fair		1,670 16.16 92.95
Poor		729 7.05 100.00

Total		10,335 100.00

Source	SS df MS	Number of obs = 10,349
		F(1, 10347) = 244.99
Model	130296.034 1 130296.034	Prob > F = 0.0000
Residual	5502984.01 10,347 531.843434	R-squared = 0.0231
		Adj R-squared = 0.0230
Total	5633280.05 10,348 544.38346	Root MSE = 23.062

Source	SS df MS	Number of obs = 10,351
		F(1, 10349) = 3116.79
Model	1304200.02 1 1304200.02	Prob > F = 0.0000
Residual	4330470.01 10,349 418.443328	R-squared = 0.2315
		Adj R-squared = 0.2314
Total	5634670.03 10,350 544.412563	Root MSE = 20.456


bpsystol		Coefficient Std. err. t P>\|t\| [95% conf. interval]

age		.6520775 .0116801 55.83 0.000 .6291823 .6749727
_cons		99.85603 .5909867 168.96 0.000 98.69758 101.0145


bpsystol		Coefficient Std. err. t P>\|t\| [95% conf. interval]

age		.0345687 .0859928 0.40 0.688 -.1339939 .2031312

c.age#c.age		.0066366 .0009157 7.25 0.000 .0048417 .0084315

_cons		112.2463 1.808325 62.07 0.000 108.7017 115.791


bpsystol		Coefficient Std. err. t P>\|t\| [95% conf. interval]

age		-1.107037 .3929805 -2.82 0.005 -1.877355 -.3367196

c.age#c.age		.0329455 .0088844 3.71 0.000 .0155303 .0503607

c.age#c.age#
c.age		-.0001879 .0000631 -2.98 0.003 -.0003116 -.0000642

_cons		112.2463 1.808325 62.07 0.000 108.7017 115.791