Why does my do-file or ado-file produce different results every time I run it?
| Title |
|
Sorting on categorical variables |
| Author |
William Gould, StataCorp |
| Date |
February 2000 |
There is really no general answer to this question other than your program
has an error in it. There is, however, one common error even experienced
Stata users make:
If you sort on a variable that does not have unique values for every
observation in the data and subsequently refer, implicitly or explicitly, to
the order within group (say, by referring to _n or _N with
by), the results will vary every time you run the file.
Consider the following dataset:
. sort group
. list
+-----------------+
| group x1 x2 |
|-----------------|
1. | 1 5 7 |
2. | 1 2 6 |
3. | 1 3 9 |
4. | 2 1 2 |
5. | 2 7 4 |
+-----------------+
The first value of x1 in the first group is 5. Now let us
jumble up these data (we will sort on x1) and then sort the data
again by group:
. sort x1
. sort group
. list
+-----------------+
| group x1 x2 |
|-----------------|
1. | 1 3 9 |
2. | 1 5 7 |
3. | 1 2 6 |
4. | 2 1 2 |
5. | 2 7 4 |
+-----------------+
Before, the first value of x1 in the first group was 5, now it is 3.
Why the change? Because group takes on repeated values across
observations, we said sort group, and we did not say how the data
should be sorted within group. Since we did not specify, Stata
chose an order at random.
People have sent us do-files that contain
...
sort patid
quietly by patid: keep if age[1]>20
...
The intent of these lines was to select patients who were at least age 20,
but that is not what the user got and, moreover, the user got a different
sample every time he ran the do-file. The problem was that patid
took on repeated values, so saying sort patid was not enough to
specify what the order should be within patid. The user meant to
code
...
sort patid age
quietly by patid: keep if age[1]>20
...
or
...
sort patid time
quietly by patid: keep if age[1]>20
...
Now, pretend that, rather than keeping all patient records, we wanted to
just keep the first record.
...
sort patid age
quietly by patid: keep if _n==1
...
Sorting on both patid and age might not be sufficient because
each patient might have multiple records with the same age. We would be
selecting one record at random from the earliest records for each patient.
If our data included variable time and time was unique within
patient,
...
sort patid time
quietly by patid: keep if _n==1
...
would be better.
These kinds of problems can be very subtle. For instance, consider
...
use mydata
logistic outcome x1 x2
predict p
sort p
gen group = group(10)
...
The intent of the above code was to categorize observations into 10
equal-size groups based on their predicted probabilities of a positive
outcome. The code does that but does not necessarily put the same
observations in the same groups every time it is run. If x1 and
x2 were categorical variables, then some people would have the same
predicted probability because they have the same values of x1 and
x2. Thus sort p will put the observations in
predicted-probability order, but the order within p will be random.
Stata will not produce the same result every time the above do-file is run.
Some statistical packages leave tied observations (observations with the
same value of the sort key) in the same order that they appeared in the
original dataset, but Stata does not. There is nothing wrong with either
approach, but there is a slight advantage to Stata's reordering.
This logistic-regression example is real, and the user first implemented it
using another package and later switched to Stata. The other package did
not reorder tied observations, with the result that the user did not know
there was a problem. He could run the problem over and over and always
obtain the same results. When he converted the problem to Stata, his first
question was "What was wrong with Stata?" Nothing was wrong with
Stata. Something was wrong with his procedure—and he just did not
know it. Thinking about that, the user realized he had to modify his
grouping procedure.
In other words, be careful. There is nothing wrong with sorting on
categorical variables by themselves—sort patid and sort
group—just do not assume that the order within the grouping
variable is unique. Be especially careful when selecting observations
within groups.
|