[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: Bootstrap variations
I have a question on stratified and/or cluster bootstrapping. I am using
Stata 8.2 and I am up-to-date for it.
Suppose I have a survey where I sample possibly multiple persons within
households (let's call the person-id variable SUBID and the household-id
Suppose I have a total of 30 households (HOMID = 1, 2, ..., 30), with a
total of 68 respondents (SUBID = 1, 2, ..., 68).
There are 1, 2, 3, 4, or 5 respondents per household. So, we can consider 5
strata, according to the number of respondents per household (let's call
this stratification variable HOMSIZ):
Stratum 1 (1 respondent per house, HOMSIZ = 1): 10 houses, 10 respondents
Stratum 2 (2 respondent per house, HOMSIZ = 2): 10 houses, 20 respondents
Stratum 3 (3 respondent per house, HOMSIZ = 3): 5 houses, 15 respondents
Stratum 4 (4 respondent per house, HOMSIZ = 4): 2 houses, 8 respondents
Stratum 5 (5 respondent per house, HOMSIZ = 5): 3 houses, 15 respondents
I am planning to use mixed effects or GEE regression for the analysis (and
use "homid" as the clustering variable).
What if I want to draw bootstraps from this setup?
I have the following alternatives:
. bsample, strata(homsiz)
. bsample, cluster(homid)
. bsample, strata(homsiz) cluster(homid)
(1) will produce bootstrap samples w/ N=68 respondents (but will not
preserve any other feature of the setup).
(2) will produce bootstrap samples w/ N=68 respondents and also preserve
the number of respondents in each of the 5 strata (10, 20, 15, 8, 15)
Neither (1) nor (2) will preserve my cluster setup (households), so I will
not consider them further.
(3) will produce bootstrap samples w/ M=30 households, but the total number
of respondents in each resample will vary (from a minimum of 30, if all 30
households are from the 1st stratum, to a maximum of 150, if all 30
households are drawn from stratum 5).
(4) will produce bootstrap samples w/ both M=30 households and N=68
respondents (and also preserve the number of respondents in each stratum as
10, 20, 15, 8, and 15).
In the 3rd scheme, the number of households with 1, 2, 3, etc respondents
is not fixed (and that reflects the way the data were obtained). However,
the resamples may have a variable number of observations (units of
analyses) and I am worried that I may overestimate the variability.
With the 4th scheme, I am worried that I might underestimate the
variability. For example, imagine that the strata are very sparse (i.e.,
two clusters in each stratum). Then, with this scheme, I will be getting
resamples that are more-or-less the original dataset over and over again.
Has anyone dealt with this kind of problem before? Any advice as to the
choice between the 3rd and 4th schemes of bootstrapping?
Thank you in advance.
The documents accompanying this transmission may contain confidential
health or business information. This information is intended for the use of
the individual or entity named above. If you have received this information
in error, please notify the sender immediately and arrange for the return
or destruction of these documents.
Constantine Daskalakis, ScD
Thomas Jefferson University, Division of Biostatistics,
211 S. 9th St., Suite 602, Philadelphia, PA 19107
*** NEW ADDRESS (AS OF 4/17/06) ***
*** 1015 Chestnut St., Suite M100, Philadelphia, PA 19107 ***
* For searches and help try: