Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: bootstrap command -- cluster and strata options


From   Stas Kolenikov <[email protected]>
To   [email protected]
Subject   Re: st: bootstrap command -- cluster and strata options
Date   Fri, 16 Jul 2004 17:50:01 -0400 (EDT)

see comments below

> I am trying to understand what the "cluster" and "strata" options do on
> -bootstrap-.  I may be misinterpreting the manual with respect to what
> these options do because when I gin up a dataset to which I think I know
> what the result should be,  the Stata answer doesn't seem to be what I
> expected.
>
> Basically, I set up a data set which is drawn from two distributions --
> 1000 observations from a uniform distribution of from 0 to 100 and 1000
> observations from a uniform distribution from 0 to 1000.  "Score" is the
> value, group is a "1" or "2" indicating whether it was drawn from the
> U(0,100) or U(0.1000) distribution, and id is a unique identifier.
>
> I am interested in sampling by "group" so tried both the -cluster- and
> -strata- options (only the cluster option shown below -- but both
> produce results I did not expect).  Specifically, I would like Stata to,
> when it samples, to  repeatedly sample from only group 1 or group 2
> (i.e., not mix a group 1 value with a group 2 value).  I am interested
> in the 95th percentile values that result from the exercise.  I would
> expect the -saving(bsout)- output from this command to contain a value
> close to 95 half  of the time and close to 950 the remainder of the
> time.  This would be true if Stata were consistently sampling from the
> U(0,100) half of the time and the U(0,1000) the remaining half.  I used
> the following command (output follows) :

What Stata can do for you is to sample either the whole group (with
-cluster(group)-) option, or sample some observations within the group
(with -strata(group) cluster(id)- combination). That makes good sense for
the survey data where the only truly independent pieces of samples are the
clusters identified by -cluster- (or -psu- in -svy- context).

What you are asking for cannot be resolved within the framework of
existing commands. What you would need to do is to write your own piece of
code involving -post- that will do something like this:

postfile topost p95 using bsout, replace
forvalues k=1/500 {
  restore, preserve
  if uniform()<0.5 keep if group==1
  else keep if group==2
  bsample
  sum score, d
  post topost (r(p95))
}
postclose topost

Make sure you understand every step and every piece of those commands.

See also comments below on the 95th percentile.

> Two things:  (1) I see the values close to 95 and 950 which I expected,
> but also see a 899.03 which I don't expect if Stata is consistently
> drawing from either the U(0,100) or the U(0, 1000) distributions for any
> given trial; and (2) when Stata draws, it consistently gets the exact
> same value for the 95th percentile -- I would expect it to vary
> somewhat.

As Michael Blasnik explained, you are resampling the whole cluster of
U(0,100) or U(0,1000) observations. If you have two groups that came from
the first one, then you end up with something close to 95 (which was the
95th percentile of that piece). When you have two groups from the latter,
you get 950. If you get a mix, then your 95th percentile is close to 900,
which would be the appropriate 95th percentile for the mixture of
0.5*U(0,100)+0.5*U(0,1000).

Even if you did specify things correctly, you won't get enough variability
in your extreme percentiles. That's an intrinsic feature of the bootstrap
as a method, and the only way to overcome that is not to be so greedy and
use only a portion of data for your bootstrap samples, say 10% of your
data (provided you have enough to begin with). Otherwise, you'll be
getting your sample maximum approximately 1-exp(-1) times regardless of
how the tail of your empirical distribution looks like, and the following
observations and their probabilities predefined, too. Hence, the -bsample-
command in the code above better read

-bsample int(0.1*_N)-

If that leaves you with a sample size of 50 for your real data
application, then it means the bootstrap cannot give you a consistent
answer. In other words, you cannot twist it in any way to get a consistent
estimator of the distribution of that percentile.

At any rate, you need to construct a pivotal statistics to bootstrap and
use for inference. For the case of maximum of a U(0,A) distribution, this
would be something like n(full sample max - subsample max)/full sample
max, which should have an asymptotic exponential distribution, so as to
resemble the asymptotic distribution of n(A - sample max)/A. Any decent
book on the bootstrap must have a discussion of this. I have Davison &
Hinkley at hand, they talk about this stuff in the middle of Chapter 2.

If you are in the Triangle area, you should talk to Ed Carlstein from our
department about this stuff. His class on subsampling is probably the best
one I took at UNC Statistics. The bootstrap is tricky, and it only works
really nice for estimating the distribution of the mean for the sample of
i.i.d. observations... which is reasonably well known, anyway :)

I know EPA would have an interest about high percentiles because that's
how many environmental regulations are written. Those percentiles and
their sampling distributions are more reliably estimated with the extreme
value theory... and the appropriate references on that are Ross Leadbetter
and Richard Smith from our department.

 ---                                    Stas Kolenikov
 --       Ph.D. student in Statistics at UNC-Chapel Hill
 - http://www.komkon.org/~tacik/  -- [email protected]

* This e-mail and all attachments to it are not intended to provide any
* reasonable point of view and was transmitted to you in error. It
* should be immediately deleted by all recipients unless they really
* enjoy communicating with the author :). Other restrictions apply.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index