How do I create dummy variables?
|
Title
|
|
Creating dummy variables
|
|
Author
|
William Gould, StataCorp
|
|
Date
|
March 1997; updated July 2011
|
A dummy variable is a variable that takes on the values 1 and 0; 1 means
something is true (such as age < 25, sex is male, or in the category
“very much”).
Dummy variables are also called indicator variables.
There are three ways to create dummy variables: one is to use
generate, which creates one dummy variable at a time; another is to use
tabulate, which creates whole sets of dummies at once; and the third
is to use
xi, which may allow you to avoid the issue of dummy-creation altogether.
Answer 1 of 3: Use generate
You could type
. gen young = 0
. replace young = 1 if age<25
or
. gen young = (age<25)
This statement does the same thing as the first two statements.
age<25 is an expression, and Stata evaluates it; returning 1 if
the statement is true and 0 if it is false.
If you have missing values in your data, it would be better if you type
. gen young = 0
. replace young = 1 if age<25
. replace young = . if missing(age)
or
. gen young = (age<25) if !missing(age)
Stata treats a missing value as positive infinity, so the expression
age<25 evaluates to 0, not missing, when age is missing.
(If the expression were age>25, the expression would evaluate to 1
when age is missing.)
You do not have to type the parentheses around the expression.
. gen young = age<25 if !missing(age)
is good enough. Here are some more illustrations of generating dummy
variables:
. gen male = sex==1
. gen top = answer=="very much"
. gen eligible = sex=="male" & (age>55 | (age>40 & enrolled))
In the above line, enrolled is itself a dummy variable—a
variable taking on values zero and one. We could have typed &
enrolled==1, but typing & enrolled is good enough.
Just as Stata returns 1 for true and 0 for false, Stata assumes that 1 means
true and that 0 means false.
Answer 2 of 3: Use tabulate
tabulate with the generate() option will generate whole sets
of dummy variables.
Say that variable group takes on the values 1, 2, and 3. If you type
. tabulate group
you will see a frequency table of how many times group takes on each of
those values. If you type
. tabulate group, gen(g)
you will see the table, and tabulate will create variable names
g1, g2, and g3 that take on values 1 and 0, g1
being 1 when group==1, g2 being 1 when group==2, and
g3 being 1 when group==3. Watch:
. list
+-------+
| group |
|-------|
1. | 1 |
2. | 3 |
3. | 2 |
4. | 1 |
5. | 2 |
|-------|
6. | 2 |
+-------+
. tabulate group, gen(g)
group | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 33.33 33.33
2 | 3 50.00 83.33
3 | 1 16.67 100.00
------------+-----------------------------------
Total | 6 100.00
. list
+----------------------+
| group g1 g2 g3 |
|----------------------|
1. | 1 1 0 0 |
2. | 3 0 0 1 |
3. | 2 0 1 0 |
4. | 1 1 0 0 |
5. | 2 0 1 0 |
|----------------------|
6. | 2 0 1 0 |
+----------------------+
What you name the variable is up to you. If we had typed
. tabulate group, gen(res)
the new variables would have been named res1, res2, and
res3.
It is also not necessary for the variable being tabulated to take sequential
values or even be integers. Here is another example:
. list
+------+
| x |
|------|
1. | -1 |
2. | 3.14 |
3. | 8 |
4. | -1 |
5. | 8 |
+------+
. tab x, gen(xval)
x | Freq. Percent Cum.
------------+-----------------------------------
-1 | 2 40.00 40.00
3.14 | 1 20.00 60.00
8 | 2 40.00 100.00
------------+-----------------------------------
Total | 5 100.00
. list
+------------------------------+
| x xval1 xval2 xval3 |
|------------------------------|
1. | -1 1 0 0 |
2. | 3.14 0 1 0 |
3. | 8 0 0 1 |
4. | -1 1 0 0 |
5. | 8 0 0 1 |
+------------------------------+
You can find out what the values are from
describe:
. describe
Contains data
obs: 5
vars: 4
size: 55 (99.9% of memory free)
------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------
x float %9.0g
xval1 byte %8.0g x== -1.0000
xval2 byte %8.0g x== 3.1400
xval3 byte %8.0g x== 8.0000
------------------------------------------------------------------------
Sorted by:
Note: dataset has changed since last saved
Finally, tabulate can be used with string variables:
. list
+-----------+
| result |
|-----------|
1. | good |
2. | bad |
3. | good |
4. | excellent |
5. | bad |
+-----------+
. tabulate result, gen(res)
result | Freq. Percent Cum.
----------------+-----------------------------------
bad | 2 40.00 40.00
excellent | 1 20.00 60.00
good | 2 40.00 100.00
----------------+-----------------------------------
Total | 5 100.00
. describe
Contains data
obs: 5
vars: 4
size: 110 (99.9% of memory free)
------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------
result str15 %15s
res1 byte %8.0g result==bad
res2 byte %8.0g result==excellent
res3 byte %8.0g result==good
------------------------------------------------------------------------
Sorted by:
Note: dataset has changed since last saved
. list
+--------------------------------+
| result res1 res2 res3 |
|--------------------------------|
1. | good 0 0 1 |
2. | bad 1 0 0 |
3. | good 0 0 1 |
4. | excellent 0 1 0 |
5. | bad 1 0 0 |
+--------------------------------+
Answer 3 of 3: Use factor variables
There are two reasons to create dummy variables. One is convenience:
. gen eligible = sex=="male" & (age>55 | (age>40 & enrolled))
It will be more convenient and less error prone to use eligible in
subsequent statements:
. list if eligible
. tabulate age if eligible
The generate approach is best here.
Another reason is because you wish to fit a model with dummy variables:
. tabulate group, gen(g)
. regress y age tenure g2 g3
Here tabulate is convenient, but factor variables are even more
convenient because you could simply type
. regress y age tenure i.group
There are several factor-variable operators available in Stata for
creating indicators and interactions. For more details, type
help fvvarlist
or see [U] Factor variables.
|