How do I create variables summarizing for each individual properties of the
other members of a group?
|
Title
|
|
Creating variables recording properties of the other members of a group
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
May 2001; updated April 2005
|
1. Examples: data on families
Suppose you have data on families. For each person in each family, it
may be useful to calculate variables that summarize properties of the other
members of the same family. How many other children are there? What is their
average, maximum, or minimum age? Is there an older child or a younger
child? The more general problem can be described as summarizing properties,
for each individual, of the other members of the same group.
Let us look at some invented data. For what follows, it is essential to
have a group identifier, so, in this example, we have an identifier for each
family. It is not always essential to have an individual identifier, but
what follows does depend on each person occurring just once in the
dataset. In practice, however, such data usually include individual
identifiers.
family person sex age
1. 1 1 1 36
2. 1 2 1 16
3. 1 3 1 14
4. 2 1 0 45
5. 2 2 1 42
6. 2 3 0 14
7. 2 4 1 12
8. 2 5 0 10
9. 3 1 0 39
10. 3 2 1 36
11. 3 3 0 11
12. 3 4 1 9
13. 3 5 1 7
14. 3 6 1 3
We will suppose that sex is recorded as 1 for female and 0 for male.
Such 0–1 coding is in a sense arbitrary, but it makes life easier,
especially for statistical modeling in which the response is a binary
variable and (more directly important here) for counting values within each
group.
2. Specific problem: for each child, how many other children are there?
Let us define children as those whose age is 17 and under. For each
child, how many other children are there? This is simply the number of
children in the family, minus 1 if each person is a child. (In family 3,
with 4 children, for each child there are 3 other children.)
For any calculation like this, it is always worth looking to see whether
egen provides
an answer to at least part of the problem. Many functions have been written
for egen. In particular, egen, total() by() is natural for
producing totals, including counts, separately for groups defined by one or
more variables specified as arguments to by(). egen, count()
by() is also often useful but is a little less general in application,
so we will concentrate here on total(). total() in Stata 9
and later releases is a replacement for sum() in Stata 8.
. egen nchild = total(age <= 17), by(family)
. replace nchild = nchild - (age <= 17)
age < = 17 will be true (evaluates to 1) whenever age is
less than or equal to 17, and false (evaluates to 0) otherwise. Adding up
the 1s and 0s within egen, total() is the same as counting the
observations for which age <= 17. We then subtract age <=
17 from each observation. The effect of the by(family) option is
to count within families, each family being a group of observations with the
same value of family. The effect of the replace correction is
confined to individual observations.
The syntax for egen indicates that total() works on an
expression exp. The argument need not be a single variable but can
usefully be something more complicated. Being interested only in other
female children is not any more difficult:
. egen nsisters = total(age <= 17 & sex == 1), by(family)
. replace nsisters = nsisters - (age <= 17 & sex == 1)
This solution also assigns values to adults, those with age greater
than or equal to 18. This could be useful, or not useful, depending on your
substantive problem. If you wanted to exclude adults completely from the
calculation, you could specify if age <= 17 on the egen
command, and values for adults would then be missing (.).
If we wanted to count not “other children” but “other
adults”, we should be a little more careful. The expression age
>= 18 includes missing values for age, as in Stata missing
counts higher than any other numeric value. Often we will want to exclude
those with the condition age >= 18 & age < . unless we know
we can treat missing ages as adults.
3. Generic problem: totals and means
Other totals, and by extension means, can be calculated using the same
general approach. Put simply,
- Calculate the total for each group.
- Subtract each member’s contribution from that total (possibly,
the contribution is 0).
- If needed, calculate the mean as the total divided by the number of
values.
What is the average age of the other children in each family? Here is one
solution:
. egen totalage = total(age) if age <= 17, by(family)
. replace totalage = totalage - age
. generate meanage = totalage/nchild
This solution excludes the adults. Not only are they not included in the
summation of age, but they also receive missing values for the
result. In the replace command, we can be cavalier about excluding
or including the adults; either way, the missing values will not be
changed.
If we want to include the adults—that is, we want a record for each
adult of the average age of the children—here is a solution:
. egen totalage = total(age * (age <= 17)), by(family)
. replace totalage = totalage - age * (age <= 17)
. generate meanage = totalage/nchild
Here the multiplier age <= 17 says the summand is 0
whenever age is 18 or more, so the total is the correct total
and is assigned to all observations in each family.
4. Generic problem: other statistics
What we have done so far hinges rather delicately on two properties of sums:
first, the sum for “everybody else” is just the sum for
“everybody” minus the sum (the value) for this observation; and
second, that the value of a sum is not affected by adding or subtracting 0.
When we turn to other summary statistics, we can no longer rely on these
properties. We need a more general approach.
In broad terms, we need to do the work within a loop:
for each member in the family {
calculate a statistic from data on the family
assign the result to that member of that family
}
5. Specific problem: maximum age of the other children
Let us suppose that we want to know, for each child, the maximum age of
the other children in the same family. Within the loop, we will find
ourselves assigning chunks of values: for that task, we cannot use
generate repeatedly. We can use replace repeatedly, so we need
to generate a variable before we can do that:
. generate maxage = .
Next we need an identifier running from 1 and above to assign to each person
in the family. In our little dataset, there was already such an identifier,
but, if there was not, one could easily be created using
by with the
sort option:
. by family, sort: gen pid = _n
. summarize pid
Under by varlist: _n is interpreted within each group
of observations, not for the whole dataset. For this problem, it does not
matter that pid is arbitrary; we just need a systematic way of doing
the calculations in turn for each member of the family. The summarize
shows us the maximum value of pid, which we will need shortly. We
could also pick up the value of the maximum as r(max), which is
important for any automation of the whole process.
Within the loop, we need a way of excluding each value of pid from
the calculation. Here is one way to do it, using
forvalues:
. quietly forvalues i = 1/`r(max)' {
. generate include = 1 if pid != `i' & age <= 17
. egen work = max(age * include), by(family)
. replace maxage = work if pid == `i'
. drop include work
. }
The forvalues construct loops over values of the local macro
i, which is set in turn to 1, then to 2, and so on, up to the maximum
of pid as returned by summarize. The macro is automatically
incremented each time through the loop. In practice, most Stata programmers
use the abbreviation forval. Within the loop, the value of i
is referred to as 'i'. The generate statement produces a
variable that is 1 if the observation is to be included in the calculation
and missing otherwise. The expression age * include, which is then
fed to egen, max(), is age * 1 or age when
include is 1, and age * . or missing . when
include is missing. What egen, max() does is exclude missings
from the calculation, and, only if all the values in each group are missing,
will the maximum be returned as missing. Although Stata has a general rule
that numeric missing is larger than any other numeric value, it assumes when
calculating maxima that you really want the largest nonmissing value. (See
what happens when you type display max(1,2,_pi,42,.).) We then use
the result of that calculation to replace the maxage value for
the current member of the family. Finally, it is easiest to drop the
variables include and work so that Stata can start afresh next
time around the loop.
Why is this loop not the following code?
. quietly forvalues i = 1/`r(max)' {
. egen work = max(age) if age <= 17 & pid != `i', by(family)
. replace maxage = work if pid == `i'
. drop work
. }
The reason this will not work as desired is the result of the
egen calculation will be missing for observations excluded by the
if condition. In fact, the result of the loop is that all values of
maxage will be missing.
For each child, there is an older one (strictly, one or more) if
maxage is greater than age,
. generate olderch = maxage > age if age <= 17
and we could use a similar approach to get the minimum age of the other
children and thus to determine whether there are younger children.
The same general scheme can be used for other egen functions that
take an expression exp as an argument and allow by() as an
option. See
help egen.
6. Specific problem: how many of a person’s own children are in the family?
Consider a family survey in which we do not have direct information about
the number of children of each person. We do have variables for family ID
family and individual ID person and also for father ID
fatherm and mother ID motherm (which are missing if a person’s
mother or father is not a member of the same family). Thus in the example,
family person fatherm motherm
1 1 . .
1 2 . .
1 3 1 2
1 4 1 2
1 5 1 2
2 1 . .
2 2 . 1
2 3 . 2
family 1 includes a couple and three children, all of whom are children of
the same mother and father, whereas family 2 includes a grandmother, her
daughter, and a grandchild—the son or daughter of that daughter.
The problem is to create a variable ownchild giving the number of
each person’s own children living in the family. Thus in family 1,
both parents have three children living with them, whereas in family 2, both
the grandmother and her daughter have one child each living with them.
We first find the number of children of each father and each mother:
. by family fatherm, sort: gen fchild = _N if fatherm < .
. by family motherm, sort: gen mchild = _N if motherm < .
Under by varlist: _N is interpreted within each group
of observations, not for the whole dataset. Now we initialize the variable
to be produced and a variable we will need to produce it. Both can be
byte variables:
. gen byte ownchild = 0
. gen byte ischild = 0
We are going to loop over the values of person within each family.
We can see in the example that these range from 1 to 5, but, more generally,
we can pick up the maximum from summarize, like in the previous
problem:
. summarize person, meanonly
The main loop is like this, which we will look at first and then unpack:
. forval i = 1 / `r(max)' {
. replace ischild = (fatherm == `i') | (motherm == `i')
. #delimit ;
. qui by family (ischild), sort:
. replace ownchild =
. cond(motherm[_N] == `i', mchild[_N], fchild[_N])
. if person == `i' & ischild[_N] ;
. #delimit cr
. }
As we go around the forvalues loop, the local macro i is
varied from 1 to the maximum observed person, which we pick up as
r(max). Here we are capitalizing on the fact that person
takes small integers from 1 and above within each family. Later, we will
look at a method for mapping arbitrary identifiers to this set-up. What may
look like a special case is a step away from any identifier scheme.
Follow through as we start the loop with `i' and also person
equal to 1. Members of each family are children of this person if he
or she is their father or their mother. forval substitutes 1
for `i':
. replace ischild = (fatherm == 1) | (motherm == 1)
This indicator variable will be 0 (is not a child of 1) or 1 (is a child of
1). For more explanation of indicator variables as showing true or false,
see
http://www.stata.com/support/faqs/data-management/true-and-false/.
Within each family, we are going to sort on this variable, so that all the
children of person 1 come at the end of each family. Then we can pick
up the number of children from the other variables in the last observation,
subject to conditions to be mentioned in a moment.
qui by family (ischild), sort:
replace ownchild =
cond(motherm[_N] == `i', mchild[_N], fchild[_N])
if person == `i' & ischild[_N]
This is a lot of information in one statement and is best taken in pieces:
- qui by family (ischild), sort:
We are going to do a replace separately by families (recall that
family is the family identifier). Within each family, we
sort first on ischild so that any children of person
1 go to the end of the family. As always, sort puts lowest values
first, so all values of 0 come before all values of 1 for indicator
variables such as ischild. Also, we do all this quietly,
although that is not essential.
- replace ownchild = ... if person == 1 & ischild[_N]
We are going to replace ownchild but only for observations with
person equal to 1 and only if the last person in the family is a
child of this person. As before, under by
varlist: _N is interpreted within each group defined
by varlist. Hence ischild[_N] is the value for the last
person in each family. (ischild[_N] is a shortcut for
ischild[_N] == 1 as they always evaluate to the same result. For
more, see the FAQ just cited.)
- What are we going to replace ownchild with?
The condition ischild[_N] ensures that we will only
replace values when the last observation in each family is for a
child of any person for whom person is 1. If that person is a
mother, we use the value for mchild; if not, we use the value
for fchild:
cond(motherm[_N] == `i', mchild[_N], fchild[_N])
We went through the operations for person equal to 1.
forvalues automatically repeats them for the other values of
person.
7. Mapping from arbitrary identifiers to integers 1 and above
We have seen that for some problems there is an advantage in using integer
identifiers which run from 1 and above within each group. If such
identifiers do not exist, they can be created, as seen in section 5.
What needs more explanation is how to map arbitrary existing identifiers to
this setup. Suppose that the identifiers were, say,
family person fatherm motherm
1 1001 . .
1 1002 . .
1 1003 1001 1002
1 1004 1001 1002
1 1005 1001 1002
2 2001 . .
2 2002 . 2001
2 2003 . 2002
First, we generate integers from 1 and above as before
. by family (person), sort: gen pid = _n
We need to map fatherm and motherm to consistent identifiers.
We initialize the variables we want
. gen byte fid = .
. gen byte mid = .
Now our main loop is to cycle through the values of pid, which by
construction contains integers 1 and above. We replace fid and
mid by each value as appropriate:
. summarize pid, meanonly
. qui forval i = 1 / `r(max)' {
. #delimit ;
. by family: replace fid = `i'
. if fatherm == person[`i'] & !missing(fatherm) ;
. by family: replace mid = `i'
. if motherm == person[`i'] & !missing(motherm) ;
. #delimit cr
. }
That is, by cycling through all the values of pid, we are also
cycling through all the values of person. Although the example
dataset contains numeric identifiers for person, fatherm, and
motherm, the code is general enough to apply to string identifiers as
well.
Doing this by family: covers the case in which a value of
person is unique for a person within a family but may also be a
identifier for another person in another family. That is, one person may be
person 1 in one family and another person may also be person
1, but in another family. Alternatively, if person has a unique value
for each person in the dataset, we lose nothing by doing this under
by:, except that possibly it may be a little slower in machine time.
The extra conditions & !missing(fatherm) and &
!missing(motherm) are needed. Why? In the example, family 1 has 5
members and family 2 has 3 members. When the forval loop gets to 4,
we are using the conditions if fatherm == person[4] and if motherm
== person[4]. Under by family: subscripting is interpreted
within groups defined by family, but there is no 4th observation for
family 2. Stata evaluates person[4] as missing in this
circumstance, but we then have a problem in that any values of
fatherm or motherm that are missing will get mapped to 4. To
prevent this mapping, we add the extra condition that the variable in
question must not be missing.
8. Acknowledgment
Thanks to Guillermo Cruces for posing the problem in sections 6 and 7.
|