[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: dummy variable generation [was: simple question]

From   "Nick Cox" <>
To   <>
Subject   st: dummy variable generation [was: simple question]
Date   Tue, 5 Feb 2008 19:31:24 -0000

This "simple question" has generated a thread with eight replies so far,
so it clearly poses a challenge. By the way, please use more informative
titles for your postings, Renuka! 

Maarten Buis, Svend Juul, Martin Weiss and E. Paul Wileyto all made good
points, but none gave my favoured solution, leaving scope for a ninth

Solution first, then comment: 

gen byte redundism = cond(missing(zredundab, zdismissa), ., 
	(zredundab == 1 | zdismissa == 1)) 

For a problem like this we seek first correctness and then as far as
possible clarity, conciseness and efficiency. 

Paul and Maarten flagged that missing values need to be handled properly
whenever they exist. Coding on the assumption that missings might be
present is always safe. 

missing(a, b) will evaluate to 1, meaning true, whenever one or both of
a or b is missing. Hence the first two arguments of the call to
-cond(,)- above: 

missing(zredundab, zdismissa), . 

yield missing results for the dummy if either variable is missing. 

The first argument 

(zredundab == 1 | zdismissa == 1) 

will evaluate to 1 when true and 0 when false (as Martin stressed),
completing the assignment in a single command. The FAQ on true and false
in Stata

2/03    What is true and false in Stata?

gives a longer discussion with more examples. 

Insisting on a -byte- variable is for efficiency in storage. If you
generate lots of floats for dummies, it may be Stata that will bite you
when you run into memory problems. More bytes, fewer bites. 

As Svend signalled, considering all the cross-combinations in a truth
table is good technique. 

                     0       1
     zdismissa 0     a       b 
               1     c       d

(zredundab == 1 | zdismissa == 1) covers cells b, c and d of the table
above. That leaves just cell a, which is defined by (zredundab == 0 &
zdismissa == 0). But you need not puzzle that out. Just negating the
condition would solve the problem. That is, the two conditions 

(zredundab == 1 | zdismissa == 1)


!(zredundab == 1 | zdismissa == 1)

are complementary and divide up the field. Just parenthesising and
negating is especially useful as conditions get more and more

I know that some people may want to spell out each step 

gen redundism = 1 if (zredundab == 1 | zdismissa == 1)
replace redundism = 0 if !(zredundab == 1 | zdismissa == 1)
replace redundism = . if missing(zredundab, zdismissa) 

but the only advantage of that is whenever it appears clearer to you or
your readers. It is best just to internalise the Stata fact that logical
conditions evaluate to 0 or 1 as soon as you can, as it is so useful. 

Finally, -egen- is evidently not needed here. Its use to compute a row
sum of two variables is very inefficient, replacing one command by
dozens once 
-egen- is interpreted. 


Renuka Metcalfe

I want to create a dummy variable redundism which
equals dummy = 1 if the establishment has had any
dismissals or redundancies in the past 12 months. I
would be grateful, if anyone would let me know if the
following is the correct way to do it. There is a
debate amongst us at whether it in the second line it
should be "|" or "&". I would be grateful, if you
would confirm if the following is correct or should it
be an "&"

ge redundism=.
replace redundism=1 if zredundab==1|zdismissa==1
replace redundism=0 if zredundab==0|zdismissa==0

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index