Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Problem with seed and bootstrap

From (William Gould, Stata)
Subject   Re: st: Problem with seed and bootstrap
Date   Mon, 19 Sep 2005 10:53:22 -0500

Svend Juul <SJ@SOCI.AU.DK> writes 

> Imagine that -sort ... , stable- was the default, but that you could
> avoid it with an -unstable- option. Can anybody imagine a situation
> where a user would benefit from the -unstable- option?

My reply is that (1) -sort, stable- does consume more computer time
and,  (2) -sort, unstable- uncovers bugs.

(2) requires some explanation.  We write down formulas all the time 
that state the minimum assumptions necessary to carry forth a 
calculation.  Such a formula might go, 

     1.  Make the following calculation within group

                t_i = ...

     2.  Then sum t_i to obtain the test statistic.

I know of a researcher (who shall remain nameless) who wrote down exactly a
calculation like that.  He did simulations, too, not using Stata, and it
worked well.  I cannot remember whether the paper actually made it to print,
but if not, it was on the way.

We were implementing this same test statistic in Stata at the researcher's
request.  He gave us some datasets and certified answers.  We wrote our
program and discovered that sometimes we got the claimed answer, and sometimes
we did not.  When we examined our code, we discovered that we got the "right"
answer if we added -stable- to -sort- at the "Make the following calculation
within group" step.

Problem was, nobody had noticed that the t_i calculation was not determinant;
not the original author, not reviewers, and not his test runs.  Even so, the
formula was a function of within-group sort order.  This lead to a
reconsideration, and an improvement.

Let me add that I have other, less dramatic examples of the benefits of 
unstable.  In those less dramatric cases, it was not the formula that 
was wrong, it was our code.

In a programming or procedural language, one states what is to be done.
A good language makes the assumptions obvious.  When I code 

       . sort group

I mean the data are to be sorted by group, and not, say, that within 
group, by time.  (If the data happened to be sorted by time before hand, 
than a stable sort would yield a dataset sorted by group and time, and 
that kind of behavior is what leads to undiscovered bugs.)

I have not been paying adequate attention to this tread, but I 
gather that for someone, -sort- was creating the problem and option 
-stable- solved it.

My first question, on hearing that, is "Why is that?".  What hidden 
assumption is laying around?  What is it that the user really needs 
to code?  When I type -sort group-, I should be make assumptions about 
how the data are already sorted.

Alejandro's problem may well be that the formula he is bootstrapping is not
determinant and, if so, that should bother him.  At this point, all we really
know is that -stable- solved the reproducibility problem.  There is one thing,
however, that I can guarantee:  there is a program bug or a substantive error,
and -stable- is covering it up.

-- Bill
*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index