Home  /  Resources & support  /  FAQs  /  Creating variables recording whether any or all members of a group possess some characteristic

How do I create a variable recording whether any members of a group (or all members of a group) possess some characteristic?

Title   Creating variables recording whether any or all members of a group possess some characteristic
Author Nicholas J. Cox, Durham University, UK

In the simplest case, we have a binary variable recording whether, for example, persons are male or female, unemployed or employed, or whatever, and some group variable, like a variable recording a family identifier. For example,

          family     person       female 
  1.         1          1          1  
  2.         1          2          1  
  3.         1          3          1  

  4.         2          1          0  
  5.         2          2          0  
  6.         2          3          0  

  7.         3          1          0  
  8.         3          2          0  
  9.         3          3          0  
 10.         3          4          1  
 11.         3          5          1  
 12.         3          6          1  

Suppose that female is recorded as 1 for female and 0 for male. Such 0–1 coding is in a sense arbitrary but makes life easier, especially for statistical modeling in which the response is a binary variable.

Imagine various families:

  1. contains 3 females, so values of female are 1, 1, 1
  2. contains 3 males, so values of female are 0, 0, 0
  3. contains 3 males and 3 females, so values of female are 0, 0, 0, 1, 1, 1

From these examples, we can see a correspondence between two ways of thinking about such families:

  1. If all members of a family are female, the minimum value of female is 1 in that family and vice versa.
  2. If no members of a family are female, the maximum value of female is 0 in that family and vice versa.
  3. If any member of a family is female, the maximum value of female is 1 in that family and vice versa.

Thus egen provides a one-line answer here to each part of the question:

        . egen anyfem = max(female), by(family) 
        . egen allfem = min(female), by(family) 

anyfem or allfem will be 1 or 0 according to whether it is true (1) or false (0) that any or all in a family are female.

Real examples could be more complicated than this.

First, what if the characteristic of interest is not coded as a 0–1 variable? This approach is only barely more difficult. The syntax of egen, min() and egen, max() is that each feeds on an expression; see [D] egen. We could have typed

 . egen anymale = max(female == 0), by(family) 

 . egen allmale = min(female == 0), by(family) 

 . egen anyDemo = max(pty == "D"), by(family) 

 . egen allDemo = min(pty == "D"), by(family) 

In other words, we can use any expression that is true or false. That expression, fed to max() or min(), will be evaluated observation by observation with a result of 1 if true or 0 if false. The expression can refer to numeric or string variables or to a combination of the two.

Second, what if missing values are present? For numeric variables, missing counts as higher than any other numeric value, but egen, max() is smart enough to ignore it. Only if all values in a group are missing will the result variable be missing.

Occasionally, you may want a strict definition of all—that literally all values in a group must possess the characteristic, with no missing values allowed. Here is one approach:

 . egen anymiss = max(missing(female)), by(family)

 . egen allfem = min(female) if !anymiss, by(family) 

Here is another:

 . egen anymiss = max(female), by(family)

 . egen allfem = min(female), by(family)

 . replace allfem = 0 if anymiss 

The difference is, in the first case, any family with a member with unknown sex will be coded as missing, whereas, in the second case, any family with such a member will be coded as 0.

In expressions, for example, female==0 is false (0) if female is missing (that is, female==0 does not evaluate to missing). If we had another variable in our data—grade taking on values 1, 2, 3, 4, ...—then grade>3 is true even if grade is missing. Think of missing values as positive infinity. In some instances, excluding missing values explicitly is the most appropriate specification.

 . egen anyhigh = max(grade > 3 & grade < .), by(group) 

 . egen allhigh = min(grade > 3 & grade < .), by(group) 

Acknowledgement

Thanks to Tom Rogers for highlighting an incorrect detail in an earlier version.