Home  /  Resources & support  /  FAQs  /  Going through groups in order of first occurrence

How do I go through the groups of a variable in order of their first occurrence in the dataset?

Title   Going through groups in order of first occurrence
Author Nicholas J. Cox, Durham University, UK

Suppose that you wish to do something for each of several groups of your data but in the order of their first occurrence in your dataset. That stipulation limits the use of levelsof or egen, group(), which ignore current sort order. For concreteness, imagine an example of panel data for which we have an identifier variable id. We want analyses to respect order of first occurrence of id.

For example:

     +----+
     | id |    
     |----|
  1. |  5 |
  2. |  1 |
  3. |  4 |
  4. |  1 |
  5. |  1 |
     |----|
  6. |  2 |
  7. |  2 |
  8. |  2 |
  9. |  4 |
 10. |  4 |
     |----|
 11. |  4 |
 12. |  1 |
 13. |  5 |
 14. |  4 |
 15. |  1 |
     +----+

We have variable id in this initial order. We want to go through all the values of id in the order 5, 1, 4, 2.

Order of occurrence in the data is encapsulated in the set of observation numbers, so we put those in a variable:

        . generate long obs = _n 

Now we sort by id, breaking ties by obs. The first observation in each block, defined by a value of id, then carries information on first occurrence. We copy the observation number of first occurrence to each other occurrence of the same id.

        . by id (obs), sort: replace obs = obs[1] 

Now we tag identifiers from 1 to whatever, according to first occurrence:

        . by obs, sort: gen byte group = _n == 1
        . replace group = sum(group) 

Those familiar with egen, group() may recognize the basic idea here. Now the number of groups is identifiable from

        . summarize group, meanonly 
        . local max = r(max) 

Typically, then you loop over groups:

        . forvalues i = 1/`max' {
 
	something for each group 

        . }

There is one common need we should mention. As we cycle over the groups within the loop, we often wish to display the identifier of the current group. Recall that there was a mapping from groups of id according to their order of occurrence in the data to the new variable group, which by construction takes on the integers from 1 and above. For a numeric identifier, a look-up technique within the loop to get the current identifier is

        . summarize id if group == `i', meanonly 

All values of id in each group are the same, so it matters little whether we pick up the minimum, the mean, or the maximum. Typing

        . local which = r(min) 

will do, for example. However, for a string identifier, we need to work a little harder. Outside the loop, before it starts, we must type

        . replace obs = _n 

Inside the loop, we type

        . summarize obs if group == `i', meanonly 
        . local which = id[`r(min)'] 

That is, as id is a string variable, we cannot feed it to summarize. We must feed the observation numbers to summarize so that we can work out where to look for the identifier string value. (Here and in the previous summarizes, the meanonly option makes calculations as fast as possible.)