Search
   >> Home >> Resources & support >> FAQs >> Creating dummy variables

How do I create dummy variables?

Title   Creating dummy variables
Author William Gould, StataCorp
Date March 1997; updated July 2011

A dummy variable is a variable that takes on the values 1 and 0; 1 means something is true (such as age < 25, sex is male, or in the category “very much”).

Dummy variables are also called indicator variables.

There are three ways to create dummy variables: one is to use generate, which creates one dummy variable at a time; another is to use tabulate, which creates whole sets of dummies at once; and the third is to use xi, which may allow you to avoid the issue of dummy-creation altogether.

Answer 1 of 3: Use generate

You could type

        . gen young = 0 
        . replace young = 1 if age<25

or

        . gen young = (age<25)

This statement does the same thing as the first two statements. age<25 is an expression, and Stata evaluates it; returning 1 if the statement is true and 0 if it is false.

If you have missing values in your data, it would be better if you type

        . gen young = 0 
        . replace young = 1 if age<25
        . replace young = . if missing(age)

or

        . gen young = (age<25) if !missing(age) 

Stata treats a missing value as positive infinity, so the expression age<25 evaluates to 0, not missing, when age is missing. (If the expression were age>25, the expression would evaluate to 1 when age is missing.)

You do not have to type the parentheses around the expression.

        . gen young = age<25 if !missing(age)

is good enough. Here are some more illustrations of generating dummy variables:

        . gen male = sex==1

        . gen top = answer=="very much"

        . gen eligible = sex=="male" & (age>55 | (age>40 & enrolled))

In the above line, enrolled is itself a dummy variable—a variable taking on values zero and one. We could have typed & enrolled==1, but typing & enrolled is good enough.

Just as Stata returns 1 for true and 0 for false, Stata assumes that 1 means true and that 0 means false.

Answer 2 of 3: Use tabulate

tabulate with the generate() option will generate whole sets of dummy variables.

Say that variable group takes on the values 1, 2, and 3. If you type

        . tabulate group

you will see a frequency table of how many times group takes on each of those values. If you type

        . tabulate group, gen(g)

you will see the table, and tabulate will create variable names g1, g2, and g3 that take on values 1 and 0, g1 being 1 when group==1, g2 being 1 when group==2, and g3 being 1 when group==3. Watch:

 . list

      +-------+
      | group |
      |-------|
   1. |     1 |
   2. |     3 |
   3. |     2 |
   4. |     1 |
   5. |     2 |
      |-------|
   6. |     2 |
      +-------+

 . tabulate group, gen(g)
 
       group |      Freq.     Percent        Cum.
 ------------+-----------------------------------
           1 |          2       33.33       33.33
           2 |          3       50.00       83.33
           3 |          1       16.67      100.00
 ------------+-----------------------------------
       Total |          6      100.00

 . list

      +----------------------+
      | group   g1   g2   g3 |
      |----------------------|
   1. |     1    1    0    0 |
   2. |     3    0    0    1 |
   3. |     2    0    1    0 |
   4. |     1    1    0    0 |
   5. |     2    0    1    0 |
      |----------------------|
   6. |     2    0    1    0 |
      +----------------------+

What you name the variable is up to you. If we had typed

        . tabulate group, gen(res)

the new variables would have been named res1, res2, and res3.

It is also not necessary for the variable being tabulated to take sequential values or even be integers. Here is another example:

 . list

      +------+
      |    x |
      |------|
   1. |   -1 |
   2. | 3.14 |
   3. |    8 |
   4. |   -1 |
   5. |    8 |
      +------+

 . tab x, gen(xval)

           x |      Freq.     Percent        Cum.
 ------------+-----------------------------------
          -1 |          2       40.00       40.00
        3.14 |          1       20.00       60.00
           8 |          2       40.00      100.00
 ------------+-----------------------------------
       Total |          5      100.00

 . list

      +------------------------------+
      |    x   xval1   xval2   xval3 |
      |------------------------------|
   1. |   -1       1       0       0 |
   2. | 3.14       0       1       0 |
   3. |    8       0       0       1 |
   4. |   -1       1       0       0 |
   5. |    8       0       0       1 |
      +------------------------------+

You can find out what the values are from describe:

 . describe

 Contains data
   obs:             5                          
  vars:             4                          
  size:            55 (99.9% of memory free)
 ------------------------------------------------------------------------
               storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------
 x               float  %9.0g                  
 xval1           byte   %8.0g                  x==    -1.0000
 xval2           byte   %8.0g                  x==     3.1400
 xval3           byte   %8.0g                  x==     8.0000
 ------------------------------------------------------------------------
 Sorted by:  
      Note:  dataset has changed since last saved

Finally, tabulate can be used with string variables:

 . list

      +-----------+
      |    result |
      |-----------|
   1. |      good |
   2. |       bad |
   3. |      good |
   4. | excellent |
   5. |       bad |
      +-----------+

 . tabulate result, gen(res)
 
          result |      Freq.     Percent        Cum.
 ----------------+-----------------------------------
             bad |          2       40.00       40.00
       excellent |          1       20.00       60.00
            good |          2       40.00      100.00
 ----------------+-----------------------------------
           Total |          5      100.00

 . describe

 Contains data
   obs:             5                          
  vars:             4                          
  size:           110 (99.9% of memory free)
 ------------------------------------------------------------------------
               storage  display     value
 variable name   type   format      label      variable label
 ------------------------------------------------------------------------
 result          str15  %15s                   
 res1            byte   %8.0g                  result==bad
 res2            byte   %8.0g                  result==excellent
 res3            byte   %8.0g                  result==good
 ------------------------------------------------------------------------
 Sorted by:  
      Note:  dataset has changed since last saved
 
 . list

      +--------------------------------+
      |    result   res1   res2   res3 |
      |--------------------------------|
   1. |      good      0      0      1 |
   2. |       bad      1      0      0 |
   3. |      good      0      0      1 |
   4. | excellent      0      1      0 |
   5. |       bad      1      0      0 |
      +--------------------------------+

Answer 3 of 3: Use factor variables

There are two reasons to create dummy variables. One is convenience:

        . gen eligible = sex=="male" & (age>55 | (age>40 & enrolled))

It will be more convenient and less error prone to use eligible in subsequent statements:

        . list if eligible

        . tabulate age if eligible

The generate approach is best here.

Another reason is because you wish to fit a model with dummy variables:

        . tabulate group, gen(g)

        . regress y age tenure g2 g3

Here tabulate is convenient, but factor variables are even more convenient because you could simply type

        . regress y age tenure i.group

There are several factor-variable operators available in Stata for creating indicators and interactions. For more details, type help fvvarlist or see [U] Factor variables.

The Stata Blog: Not Elsewhere Classified Find us on Facebook Follow us on Twitter LinkedIn Google+ Watch us on YouTube