Home  /  Resources & support  /  FAQs  /  Sorting on categorical variables

Why does my do-file or ado-file produce different results every time I run it?

Title   Sorting on categorical variables
Author William Gould, StataCorp

There is really no general answer to this question other than your program has an error in it. There is, however, one common error even experienced Stata users make:

If you sort on a variable that does not have unique values for every observation in the data and subsequently refer, implicitly or explicitly, to the order within group (say, by referring to _n or _N with by), the results will vary every time you run the file.

Consider the following dataset:

 . sort group

 . list

      +-----------------+
      | group   x1   x2 |
      |-----------------|
   1. |     1    5    7 |
   2. |     1    2    6 |
   3. |     1    3    9 |
   4. |     2    1    2 |
   5. |     2    7    4 |
      +-----------------+

The first value of x1 in the first group is 5. Now let us jumble up these data (we will sort on x1) and then sort the data again by group:

 . sort x1

 . sort group

 . list

      +-----------------+
      | group   x1   x2 |
      |-----------------|
   1. |     1    3    9 |
   2. |     1    5    7 |
   3. |     1    2    6 |
   4. |     2    1    2 |
   5. |     2    7    4 |
      +-----------------+

Before, the first value of x1 in the first group was 5, now it is 3. Why the change? Because group takes on repeated values across observations, we said sort group, and we did not say how the data should be sorted within group. Since we did not specify, Stata chose an order at random.

People have sent us do-files that contain

        ...
        sort patid
        quietly by patid: keep if age[1]>20
        ...

The intent of these lines was to select patients who were at least age 20, but that is not what the user got and, moreover, the user got a different sample every time he ran the do-file. The problem was that patid took on repeated values, so saying sort patid was not enough to specify what the order should be within patid. The user meant to code

        ...
        sort patid age
        quietly by patid: keep if age[1]>20
        ...

or

        ...
        sort patid time
        quietly by patid: keep if age[1]>20
        ...

Now, pretend that, rather than keeping all patient records, we wanted to just keep the first record.

        ...
        sort patid age
        quietly by patid: keep if _n==1
        ...

Sorting on both patid and age might not be sufficient because each patient might have multiple records with the same age. We would be selecting one record at random from the earliest records for each patient. If our data included variable time and time was unique within patient,

        ...
        sort patid time
        quietly by patid: keep if _n==1
        ...

would be better.

In other words, be careful. There is nothing wrong with sorting on categorical variables by themselves—sort patid and sort group—just do not assume that the order within the grouping variable is unique. Be especially careful when selecting observations within groups.