|Title||Sorting on categorical variables|
|Author||William Gould, StataCorp|
There is really no general answer to this question other than your program has an error in it. There is, however, one common error even experienced Stata users make:
If you sort on a variable that does not have unique values for every observation in the data and subsequently refer, implicitly or explicitly, to the order within group (say, by referring to _n or _N with by), the results will vary every time you run the file.
Consider the following dataset:
. sort group . list +-----------------+ | group x1 x2 | |-----------------| 1. | 1 5 7 | 2. | 1 2 6 | 3. | 1 3 9 | 4. | 2 1 2 | 5. | 2 7 4 | +-----------------+
The first value of x1 in the first group is 5. Now let us jumble up these data (we will sort on x1) and then sort the data again by group:
. sort x1 . sort group . list +-----------------+ | group x1 x2 | |-----------------| 1. | 1 3 9 | 2. | 1 5 7 | 3. | 1 2 6 | 4. | 2 1 2 | 5. | 2 7 4 | +-----------------+
Before, the first value of x1 in the first group was 5, now it is 3. Why the change? Because group takes on repeated values across observations, we said sort group, and we did not say how the data should be sorted within group. Since we did not specify, Stata chose an order at random.
People have sent us do-files that contain
... sort patid quietly by patid: keep if age>20 ...
The intent of these lines was to select patients who were at least age 20, but that is not what the user got and, moreover, the user got a different sample every time he ran the do-file. The problem was that patid took on repeated values, so saying sort patid was not enough to specify what the order should be within patid. The user meant to code
... sort patid age quietly by patid: keep if age>20 ...
... sort patid time quietly by patid: keep if age>20 ...
Now, pretend that, rather than keeping all patient records, we wanted to just keep the first record.
... sort patid age quietly by patid: keep if _n==1 ...
Sorting on both patid and age might not be sufficient because each patient might have multiple records with the same age. We would be selecting one record at random from the earliest records for each patient. If our data included variable time and time was unique within patient,
... sort patid time quietly by patid: keep if _n==1 ...
would be better.
These kinds of problems can be very subtle. For instance, consider
... use mydata logistic outcome x1 x2 predict p sort p gen group = group(10) ...
The intent of the above code was to categorize observations into 10 equal-size groups based on their predicted probabilities of a positive outcome. The code does that but does not necessarily put the same observations in the same groups every time it is run. If x1 and x2 were categorical variables, then some people would have the same predicted probability because they have the same values of x1 and x2. Thus sort p will put the observations in predicted-probability order, but the order within p will be random.
Stata will not produce the same result every time the above do-file is run. Some statistical packages leave tied observations (observations with the same value of the sort key) in the same order that they appeared in the original dataset, but Stata does not. There is nothing wrong with either approach, but there is a slight advantage to Stata's reordering.
This logistic-regression example is real, and the user first implemented it using another package and later switched to Stata. The other package did not reorder tied observations, with the result that the user did not know there was a problem. He could run the problem over and over and always obtain the same results. When he converted the problem to Stata, his first question was "What was wrong with Stata?" Nothing was wrong with Stata. Something was wrong with his procedure—and he just did not know it. Thinking about that, the user realized he had to modify his grouping procedure.
In other words, be careful. There is nothing wrong with sorting on categorical variables by themselves—sort patid and sort group—just do not assume that the order within the grouping variable is unique. Be especially careful when selecting observations within groups.