Home  /  Resources & support  /  FAQs  /  Creating dummy variables

How do I create dummy variables?

Title   Creating dummy variables
Author William Gould, StataCorp

A dummy variable is a variable that takes on the values 1 and 0; 1 means something is true (such as age < 25, sex is male, or in the category “very much”).

Dummy variables are also called indicator variables.

As we will see shortly, in most cases, if you use factor-variable notation, you do not need to create dummy variables.

In cases where factor variables are not the answer, you may use generate to create one dummy variable at a time and tabulate to create a set of dummies at once.

Using factor variables instead of generating dummy variables

I have a discrete variable, size, that takes on discrete values from 0 to 4

 . tabulate size

       size |      Freq.     Percent        Cum.
------------+-----------------------------------
  miniature |         19       19.00       19.00
      small |         37       37.00       56.00
     normal |         30       30.00       86.00
      large |         12       12.00       98.00
       huge |          2        2.00      100.00
------------+-----------------------------------
      Total |        100      100.00

If I want a dummy for all levels of size except for a comparison group or base level, I do not need to create 4 dummies. Using [U] factor variables, I may type

        . summarize i.size

or use an estimator

        . regress y x i.size

If I want to use a dummy that is 1 if size is large (size==3) and 0 otherwise, I type

        . regress y x 3.size

If I want to make the comparison group, or base level, of size be size==3 instead of the default size==0, I type

        . regress y x ib3.size

You can also use factor-variable notation to refer to categorical variables, their interactions, or interactions between categorical and continuous variables.

For example, I can specify the interaction of each level of size (except the base level) and the continuous variable x by typing

        . regress y x i.size#c.x

The c. instructs Stata that variable x is continuous.

In all the cases above, you did not need to create a variable.

Moreover, many of Stata's postestimation facilities, including in particular the margins command, are aware of factor variables and will handle them elegantly when making computations.

There are some instances where creating dummies might be worthwhile. We illustrate these below.

Using generate to create dummy variables

You could type

        . generate young = 0 
        . replace young = 1 if age<25

or

        . generate young = (age<25)

This statement does the same thing as the first two statements. age<25 is an expression, and Stata evaluates it; returning 1 if the statement is true and 0 if it is false.

If you have missing values in your data, it would be better if you type

        . generate young = 0 
        . replace young = 1 if age<25
        . replace young = . if missing(age)

or

        . generate young = (age<25) if !missing(age) 

Stata treats a missing value as positive infinity, so the expression age<25 evaluates to 0, not missing, when age is missing. (If the expression were age>25, the expression would evaluate to 1 when age is missing.)

You do not have to type the parentheses around the expression.

        . generate young = age<25 if !missing(age)

is good enough. Here are some more illustrations of generating dummy variables:

        . generate male = sex==1

        . generate top = answer=="very much"

        . generate eligible = sex=="male" & (age>55 | (age>40 & enrolled)) if !missing(age)

In the above line, enrolled is itself a dummy variable—a variable taking on values zero and one. We could have typed & enrolled==1, but typing & enrolled is good enough.

Just as Stata returns 1 for true and 0 for false, Stata assumes that 1 means true and that 0 means false.

Using tabulate to create dummy variables

tabulate with the generate() option will generate whole sets of dummy variables.

Say that variable group takes on the values 1, 2, and 3. If you type

        . tabulate group

you will see a frequency table of how many times group takes on each of those values. If you type

        . tabulate group, generate(g)

you will see the table, and tabulate will create variable names g1, g2, and g3 that take on values 1 and 0, g1 being 1 when group==1, g2 being 1 when group==2, and g3 being 1 when group==3. Watch:

 . list

      +-------+
      | group |
      |-------|
   1. |     1 |
   2. |     3 |
   3. |     2 |
   4. |     1 |
   5. |     2 |
      |-------|
   6. |     2 |
      +-------+

 . tabulate group, generate(g)
 
       group |      Freq.     Percent        Cum.
 ------------+-----------------------------------
           1 |          2       33.33       33.33
           2 |          3       50.00       83.33
           3 |          1       16.67      100.00
 ------------+-----------------------------------
       Total |          6      100.00

 . list

      +----------------------+
      | group   g1   g2   g3 |
      |----------------------|
   1. |     1    1    0    0 |
   2. |     3    0    0    1 |
   3. |     2    0    1    0 |
   4. |     1    1    0    0 |
   5. |     2    0    1    0 |
      |----------------------|
   6. |     2    0    1    0 |
      +----------------------+

What you name the variable is up to you. If we had typed

        . tabulate group, generate(res)

the new variables would have been named res1, res2, and res3.

It is also not necessary for the variable being tabulated to take sequential values or even be integers. Here is another example:

 . list

      +------+
      |    x |
      |------|
   1. |   -1 |
   2. | 3.14 |
   3. |    8 |
   4. |   -1 |
   5. |    8 |
      +------+

 . tabulate x, generate(xval)

           x |      Freq.     Percent        Cum.
 ------------+-----------------------------------
          -1 |          2       40.00       40.00
        3.14 |          1       20.00       60.00
           8 |          2       40.00      100.00
 ------------+-----------------------------------
       Total |          5      100.00

 . list

      +------------------------------+
      |    x   xval1   xval2   xval3 |
      |------------------------------|
   1. |   -1       1       0       0 |
   2. | 3.14       0       1       0 |
   3. |    8       0       0       1 |
   4. |   -1       1       0       0 |
   5. |    8       0       0       1 |
      +------------------------------+

You can find out what the values are from describe:

 . describe

 Contains data
   obs:             5                          
  vars:             4                          
  size:            55 
 ------------------------------------------------------------------------
               storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------
 x               float  %9.0g                  
 xval1           byte   %8.0g                  x==    -1.0000
 xval2           byte   %8.0g                  x==     3.1400
 xval3           byte   %8.0g                  x==     8.0000
 ------------------------------------------------------------------------
 Sorted by:  
      Note:  dataset has changed since last saved

Finally, tabulate can be used with string variables:

 . list

      +-----------+
      |    result |
      |-----------|
   1. |      good |
   2. |       bad |
   3. |      good |
   4. | excellent |
   5. |       bad |
      +-----------+

 . tabulate result, generate(res)
 
          result |      Freq.     Percent        Cum.
 ----------------+-----------------------------------
             bad |          2       40.00       40.00
       excellent |          1       20.00       60.00
            good |          2       40.00      100.00
 ----------------+-----------------------------------
           Total |          5      100.00

 . describe

 Contains data
   obs:             5                          
  vars:             4                          
  size:           110 (99.9% of memory free)
 ------------------------------------------------------------------------
               storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------
 result          str15  %15s                   
 res1            byte   %8.0g                  result==bad
 res2            byte   %8.0g                  result==excellent
 res3            byte   %8.0g                  result==good
 ------------------------------------------------------------------------
 Sorted by:  
      Note:  dataset has changed since last saved
 
 . list

      +--------------------------------+
      |    result   res1   res2   res3 |
      |--------------------------------|
   1. |      good      0      0      1 |
   2. |       bad      1      0      0 |
   3. |      good      0      0      1 |
   4. | excellent      0      1      0 |
   5. |       bad      1      0      0 |
      +--------------------------------+