How do I create a variable recording whether any members of a group (or all
members of a group) possess some characteristic?
|
Title
|
|
Creating variables recording whether any or all members of a group
possess some characteristic
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
October 1999, updated February 2003
|
In the simplest case, we have a binary variable recording whether, for example,
persons are male or female, unemployed or employed, or whatever, and some
group variable, like a variable recording a family identifier. For example,
family person sex
1. 1 1 1
2. 1 2 1
3. 1 3 1
4. 2 1 0
5. 2 2 0
6. 2 3 0
7. 3 1 0
8. 3 2 0
9. 3 3 0
10. 3 4 1
11. 3 5 1
12. 3 6 1
Suppose that sex is recorded as 1 for female and 0 for male. Such
0–1 coding is in a sense arbitrary but makes life easier, especially
for statistical modeling in which the response is a binary variable.
Imagine various families:
- contains 3 females, so values of sex are 1, 1, 1
- contains 3 males, so values of sex are 0, 0, 0
- contains 3 males and 3 females, so values of sex
are 0, 0, 0, 1, 1, 1
From these examples, we can see a correspondence between two ways of
thinking about such families:
- If all members of a family are female, the minimum value of sex is
1 in that family and vice versa.
- If no members of a family are female, the maximum value of sex is
0 in that family and vice versa.
- If any member of a family is female, the maximum value of sex is 1
in that family and vice versa.
Thus egen
provides a one-line answer here to each part of the question:
. egen anyfem = max(sex), by(family)
. egen allfem = min(sex), by(family)
anyfem or allfem will be 1 or 0 according to whether it is
true (1) or false (0) that any or all in a family are female.
Real examples could be more complicated than this.
First, what if the characteristic of interest is not coded as a 0–1
variable? This approach is only barely more difficult. The syntax of
egen, min() and egen, max() is that each feeds on an
expression; see [D] egen. We could have typed
. egen anymale = max(sex == 0), by(family)
. egen allmale = min(sex == 0), by(family)
. egen anyDemo = max(pty == "D"), by(family)
. egen allDemo = min(pty == "D"), by(family)
In other words, we can use any expression that is true or false. That
expression, fed to max() or min(), will be evaluated
observation by observation with a result of 1 if true or 0 if false. The
expression can refer to numeric or string variables or to a combination of
the two.
Second, what if missing values are present? For numeric variables, missing
counts as higher than any other numeric value, but egen, max() is
smart enough to ignore it. Only if all values in a group are missing will
the result variable be missing.
Occasionally, you may want a strict definition of all—that literally
all values in a group must possess the characteristic, with no missing
values allowed. Here is one approach
. egen anymiss = max(sex), by(family)
. egen allfem = min(sex) if !anymiss, by(family)
and another
. egen anymiss = max(sex), by(family)
. egen allfem = min(sex), by(family)
. replace allfem = 0 if anymiss
The difference is, in the first case, any family with a member with
unknown sex will be coded as missing, whereas, in the second case, any family
with such a member will be coded as 0.
In expressions, for example, sex==0 is false (0) if sex is
missing (that is, sex==0 does not evaluate to missing). If we had
another variable in our data—grade taking on values 1, 2, 3, 4,
...—then grade>3 is true even if grade is missing.
Think of missing value as positive infinity. In some instances, excluding
missing values explicitly is the most appropriate specification.
. egen anyhigh = max(grade > 3 & grade < .), by(group)
. egen allhigh = min(grade > 3 & grade < .), by(group)
|
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Mac
Technical support
|