How do I go through the groups of a variable in order of their first
occurrence in the dataset?
|
Title
|
|
Going through groups in order of first occurrence
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
September 2005
|
Suppose that you wish to do something for each of several groups of your
data but in the order of their first occurrence in your
dataset. That stipulation limits the use of
levelsof or
egen,
group(), which ignore current sort order. For concreteness, imagine an
example of panel data for which we have an identifier variable id. We
want analyses to respect order of first occurrence of id.
Order of occurrence in the data is encapsulated in the set of observation
numbers, so we put those in a variable:
. generate long obs = _n
Now we sort by id, breaking ties by obs. The first
observation in each block, defined by a value of id, then carries
information on first occurrence. We copy the observation number of first
occurrence to each other occurrence of the same id.
. by id (obs), sort: replace obs = obs[1]
Now we tag identifiers from 1 to whatever, according to first occurrence:
. by obs, sort: gen byte group = _n == 1
. replace group = sum(group)
Those familiar with egen, group() may recognize the basic idea here.
Now the number of groups is identifiable from
. summarize group, meanonly
. local max = r(max)
Typically, then you loop over groups:
. forvalues i = 1/`max' {
something for each group
. }
There is one common need we should mention. As we cycle over the
groups within the loop, we often wish to display the identifier of the
current group. Recall that there was a mapping from groups of id
according to their order of occurrence in the data to the new variable
group, which by construction takes on the integers from 1 and above.
For a numeric identifier, a look-up technique within the loop to get the
current identifier is
. summarize id if group == `i', meanonly
All values of id in each group are the same, so it matters
little whether we pick up the minimum, the mean, or the maximum. Typing
. local which = r(min)
will do, for example. However, for a string identifier, we need to work a
little harder. Outside the loop, before it starts, we must type
. replace obs = _n
Inside the loop, we type
. summarize obs if group == `i', meanonly
. local which = id[`r(min)']
That is, as id is a string variable, we cannot feed it to
summarize. We
must feed the observation numbers to summarize so that we can work
out where to look for the identifier string value. (Here and in the previous
summarizes, the meanonly option makes calculations as fast as
possible.)
|