|
The following material is based on a question and answer that appeared on
Statalist.
How do I list observations in a group that differ on a variable?
|
Title
|
|
Listing observations in a group that differ on a variable
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
November 2001; Updated September 2012
|
The problem
I have data on various individuals with genotypes ascertained from
samples taken at different times. I want to list only those
samples with differing genotypes for each individual.
The data are
eid egenotype
0 vv
0 vv
1 vv
1 ww
2 ww
2 vv
2 ww
The solution
The question does not specify whether egenotype is a string variable
or a numeric variable with labels. The solution here applies to both and
also to numeric variables without labels. First, we
sort the data on
eid and then on egenotype:
. sort eid egenotype
If all the values of egenotype are the same for each eid, then,
after sorting the first value within each, eid will equal the last. If
there is any variation within eid, this will not be true. This will
work irrespective of the number of observations for each eid, the
number of egenotypes, and the type of variable used. Thus, for
eid 0, the first value vv will equal the last, but, for
eid 1 and 2, the first and last values will differ. The example of
eid 2 also shows why sorting is essential, as at present the first
and third values are both ww, but the middle value is vv.
Accordingly, we work out which groups have different values and then
list those groups only:
. by eid (egenotype), sort: gen diff = egenotype[1] != egenotype[_N]
. list eid egenotype if diff
The by ...,
sort combines sort eid egenotype with an ensuing by eid:
generate statement. Under the protection of by:, subscripts apply
to observations within each group. Thus [1] denotes the first
observation, and [_N] denotes the last observation within each group.
If the corresponding values differ, diff will be 1, and, if they are
the same, diff will be 0. (For more information on this, see
http://www.stata.com/support/faqs/data-management/true-and-false/.) Then the
list is restricted to values that are different.
How would this be extended to identifying groups that differ on at least one
of two or more variables? One way would be to use
egen. For example,
egen, group() could be used to group values according to one or more
variables, and then the same method could be used on the resulting variable.
The opposite problem: observations with the same values
It should be clear that the opposite problem, finding observations with the
same values, has an essentially similar solution. We could negate the
variable diff above, which would exchange 0s and 1s. Or, starting from
scratch, we could just change the operator from != to ==.
. by eid (egenotype), sort: gen same = egenotype[1] == egenotype[_N]
. list eid egenotype if same
Careful sorting remains essential here. If all the values in a group are
identical, then the first and last values will necessarily be the same, but
the converse does not always follow. The first and last of a group with two
or more distinct values could be identical as a matter of accident in an
unsorted group. So we need sorting within a group to shake different values
apart.
|