Title | Creating dummy variables | |

Author | William Gould, StataCorp | |

Date | March 1997; updated July 2016 |

A dummy variable is a variable that takes on the values 1 and 0; 1 means something is true (such as age < 25, sex is male, or in the category “very much”).

Dummy variables are also called indicator variables.

As we will see shortly, in most cases, if you use factor-variable notation, you do not need to create dummy variables.

In cases where factor variables are not the answer, you may use
**
generate** to create one dummy variable at a time and
**tabulate** to create a set of dummies at once.

I have a discrete variable, **size**, that takes on discrete values from 0 to 4

. tabulate sizesize | Freq. Percent Cum. ------------+----------------------------------- miniature | 19 19.00 19.00 small | 37 37.00 56.00 normal | 30 30.00 86.00 large | 12 12.00 98.00 huge | 2 2.00 100.00 ------------+----------------------------------- Total | 100 100.00

If I want a dummy for all levels of **size** except for a comparison group
or base level, I do not need to create 4 dummies. Using
[U] **factor variables**, I may type

. summarize i.size

or use an estimator

. regress y x i.size

If I want to use a dummy that is 1 if **size** is large (**size==3**) and 0 otherwise, I type

. regress y x 3.size

If I want to make the comparison group, or base level, of **size**
be **size==3** instead of the default **size==0**, I type

. regress y x ib3.size

You can also use factor-variable notation to refer to categorical variables, their interactions, or interactions between categorical and continuous variables.

For example, I can specify the interaction of each level of **size**
(except the base level) and the continuous variable **x** by typing

. regress y x i.size#c.x

The **c.** instructs Stata that variable **x** is continuous.

In all the cases above, you did not need to create a variable.

Moreover, many of Stata's postestimation facilities, including in particular
the **margins** command, are aware of factor variables and will handle them
elegantly when making computations.

There are some instances where creating dummies might be worthwhile. We illustrate these below.

You could type

. generate young = 0 . replace young = 1 if age<25

or

. generate young = (age<25)

This statement does the same thing as the first two statements.
**age<25** is an expression, and Stata evaluates it; returning 1 if
the statement is true and 0 if it is false.

If you have missing values in your data, it would be better if you type

. generate young = 0 . replace young = 1 if age<25 . replace young = . if missing(age)

or

. generate young = (age<25) if !missing(age)

Stata treats a missing value as positive infinity, so the expression
**age<25** evaluates to 0, not missing, when **age** is missing.
(If the expression were **age>25**, the expression would evaluate to 1
when **age** is missing.)

You do not have to type the parentheses around the expression.

. generate young = age<25 if !missing(age)

is good enough. Here are some more illustrations of generating dummy variables:

. generate male = sex==1 . generate top = answer=="very much" . generate eligible = sex=="male" & (age>55 | (age>40 & enrolled)) if !missing(age)

In the above line, **enrolled** is itself a dummy variable—a
variable taking on values zero and one. We could have typed **&
enrolled==1**, but typing **& enrolled** is good enough.

Just as Stata returns 1 for true and 0 for false, Stata assumes that 1 means true and that 0 means false.

**tabulate** with the **generate()** option will generate whole sets
of dummy variables.

Say that variable **group** takes on the values 1, 2, and 3. If you type

. tabulate group

you will see a frequency table of how many times group takes on each of those values. If you type

. tabulate group, generate(g)

you will see the table, and **tabulate**
will create variable names
**g1**, **g2**, and **g3** that take on values 1 and 0, **g1**
being 1 when **group==1**, **g2** being 1 when **group==2**, and
**g3** being 1 when **group==3**. Watch:

. list+-------+ | group | |-------| 1. | 1 | 2. | 3 | 3. | 2 | 4. | 1 | 5. | 2 | |-------| 6. | 2 | +-------+. tabulate group, generate(g)group | Freq. Percent Cum. ------------+----------------------------------- 1 | 2 33.33 33.33 2 | 3 50.00 83.33 3 | 1 16.67 100.00 ------------+----------------------------------- Total | 6 100.00. list+----------------------+ | group g1 g2 g3 | |----------------------| 1. | 1 1 0 0 | 2. | 3 0 0 1 | 3. | 2 0 1 0 | 4. | 1 1 0 0 | 5. | 2 0 1 0 | |----------------------| 6. | 2 0 1 0 | +----------------------+

What you name the variable is up to you. If we had typed

. tabulate group, generate(res)

the new variables would have been named **res1**, **res2**, and
**res3**.

It is also not necessary for the variable being tabulated to take sequential values or even be integers. Here is another example:

. list+------+ | x | |------| 1. | -1 | 2. | 3.14 | 3. | 8 | 4. | -1 | 5. | 8 | +------+. tabulate x, generate(xval)x | Freq. Percent Cum. ------------+----------------------------------- -1 | 2 40.00 40.00 3.14 | 1 20.00 60.00 8 | 2 40.00 100.00 ------------+----------------------------------- Total | 5 100.00. list+------------------------------+ | x xval1 xval2 xval3 | |------------------------------| 1. | -1 1 0 0 | 2. | 3.14 0 1 0 | 3. | 8 0 0 1 | 4. | -1 1 0 0 | 5. | 8 0 0 1 | +------------------------------+

You can find out what the values are from
**describe**:

. describeContains data obs: 5 vars: 4 size: 55 (99.9% of memory free) ------------------------------------------------------------------------ storage display value variable name type format label variable label ------------------------------------------------------------------------ x float %9.0g xval1 byte %8.0g x== -1.0000 xval2 byte %8.0g x== 3.1400 xval3 byte %8.0g x== 8.0000 ------------------------------------------------------------------------ Sorted by: Note: dataset has changed since last saved

Finally, **tabulate** can be used with string variables:

. list+-----------+ | result | |-----------| 1. | good | 2. | bad | 3. | good | 4. | excellent | 5. | bad | +-----------+. tabulate result, generate(res)result | Freq. Percent Cum. ----------------+----------------------------------- bad | 2 40.00 40.00 excellent | 1 20.00 60.00 good | 2 40.00 100.00 ----------------+----------------------------------- Total | 5 100.00. describeContains data obs: 5 vars: 4 size: 110 (99.9% of memory free) ------------------------------------------------------------------------ storage display value variable name type format label variable label ------------------------------------------------------------------------ result str15 %15s res1 byte %8.0g result==bad res2 byte %8.0g result==excellent res3 byte %8.0g result==good ------------------------------------------------------------------------ Sorted by: Note: dataset has changed since last saved. list+--------------------------------+ | result res1 res2 res3 | |--------------------------------| 1. | good 0 0 1 | 2. | bad 1 0 0 | 3. | good 0 0 1 | 4. | excellent 0 1 0 | 5. | bad 1 0 0 | +--------------------------------+