Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: bootstrap command -- cluster and strata options


From   [email protected]
To   [email protected]
Subject   st: bootstrap command -- cluster and strata options
Date   Wed, 14 Jul 2004 10:14:22 -0400



Dear Statalisters:

I am trying to understand what the "cluster" and "strata" options do on
-bootstrap-.  I may be misinterpreting the manual with respect to what
these options do because when I gin up a dataset to which I think I know
what the result should be,  the Stata answer doesn't seem to be what I
expected.

Basically, I set up a data set which is drawn from two distributions --
1000 observations from a uniform distribution of from 0 to 100 and 1000
observations from a uniform distribution from 0 to 1000.  "Score" is the
value, group is a "1" or "2" indicating whether it was drawn from the
U(0,100) or U(0.1000) distribution, and id is a unique identifier.
The final data set description and summary  is as follows:

Contains data from D:\scorestrata.dta
  obs:         2,000
 vars:             3                          14 Jul 2004 07:17
 size:        26,000 (97.5% of memory free)
-------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-------------------------------------------------------------------------------
id              float  %9.0g
group           byte   %9.0g
score           float  %9.0g
-------------------------------------------------------------------------------
Sorted by:  group
     Note:  dataset has changed since last saved

. summarize score, detail

                            score
-------------------------------------------------------------
      Percentiles      Smallest
 1%         1.52            .01
 5%        8.695            .08
10%        19.22             .1       Obs                2000
25%        46.35            .12       Sum of Wgt.        2000

50%       91.035                      Mean           273.2597
                        Largest       Std. Dev.      302.6721
75%       476.29         997.14
90%       806.22         997.52       Variance        91610.4
95%       899.03         999.48       Skewness       1.018213
99%       976.93          999.9       Kurtosis       2.590768


I am interested in sampling by "group" so tried both the -cluster- and
-strata- options (only the cluster option shown below -- but both
produce results I did not expect).  Specifically, I would like Stata to,
when it samples, to  repeatedly sample from only group 1 or group 2
(i.e., not mix a group 1 value with a group 2 value).  I am interested
in the 95th percentile values that result from the exercise.  I would
expect the -saving(bsout)- output from this command to contain a value
close to 95 half  of the time and close to 950 the remainder of the
time.  This would be true if Stata were consistently sampling from the
U(0,100) half of the time and the U(0,1000) the remaining half.  I used
the following command (output follows) :


. bootstrap "summarize score, detail" r(p95), reps(500) saving(bsout)
cluster(group) replace

command:      summarize score , detail
statistic:    _bs_1      = r(p95)

Warning:  Since summarize is not an estimation command or does not set
e(sample),
          bootstrap has no way to determine which observations are used
in calculating
          the statistics and so assumes that all observations are used.
This means no
          observations will be excluded from the resampling due to
missing values or
          other reasons.

          If the assumption is not true, press Break, save the data, and
drop the
          observations that are to be excluded.  Be sure the dataset in
memory contains
          only the relevant data.


Bootstrap statistics                              Number of obs    =
2000
                                                  N of clusters    =
2
                                                  Replications     =
500

------------------------------------------------------------------------------
Variable     |  Reps  Observed      Bias  Std. Err. [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _bs_1 |   500    899.03 -166.8533  343.0897   224.9517   1573.108
(N)
             |                                          95.48     950.77
(P)
             |                                         899.03     950.77
(BC)
------------------------------------------------------------------------------
Note:  N   = normal
       P   = percentile
       BC  = bias-corrected


I  then took a look at the file "bsout" which represent the 95th
percentile values from each of the 500 trials.  Here are the first 10
values:

list in 1/10, sep(0)

     +--------+
     |  _bs_1 |
     |--------|
  1. |  95.48 |
  2. | 899.03 |
  3. | 899.03 |
  4. | 899.03 |
  5. |  95.48 |
  6. |  95.48 |
  7. | 950.77 |
  8. | 950.77 |
  9. | 950.77 |
 10. | 950.77 |
     +--------+



Two things:  (1) I see the values close to 95 and 950 which I expected,
but also see a 899.03 which I don't expect if Stata is consistently
drawing from either the U(0,100) or the U(0, 1000) distributions for any
given trial; and (2) when Stata draws, it consistently gets the exact
same value for the 95th percentile -- I would expect it to vary
somewhat.

Here is a summary of the bsout file (tabulated):


. tabulate  _bs_1

     r(p95) |      Freq.     Percent        Cum.
------------+-----------------------------------
      95.48 |        112       22.40       22.40
     899.03 |        261       52.20       74.60
     950.77 |        127       25.40      100.00
------------+-----------------------------------
      Total |        500      100.00


Again, not at all what I expected (it's discrete and tri-valued and I
thought it would be continuous).  I thought the appropriate command
would (for the expected continuous distribution) be -histogram _bs_1-
and I would have seen a bimodal distribution centered on 95 and 950.
What I would like to see is a distribution which results from either
repeated sampling from group 1 (ca. half the time) OR repeated sampling
from group 2 (the remainder fo the time).  My reading and understanding
of  the -cluster-  and -strata- options under -bootstrap- must be
faulty.  Can anyone let me know what I am missing here?  Or what I might
do to obtain what I am looking for?

I am sure that the problem lies with my (mis)understanding, but I am
using Stata 8.2:

about

Intercooled Stata 8.2 for Windows
Born 1 July 2004
Copyright (C) 1985-2004








David Miller
Health Effects Division
Office of Pesticide Programs


visit: http://www.epa.gov/pesticides/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index