# st: Bootstrap variations

 From Constantine Daskalakis To statalist@hsphsun2.harvard.edu Subject st: Bootstrap variations Date Thu, 20 Apr 2006 17:12:39 -0400

Hi all:

I have a question on stratified and/or cluster bootstrapping. I am using Stata 8.2 and I am up-to-date for it.

Suppose I have a survey where I sample possibly multiple persons within households (let's call the person-id variable SUBID and the household-id variable HOMID).

Suppose I have a total of 30 households (HOMID = 1, 2, ..., 30), with a total of 68 respondents (SUBID = 1, 2, ..., 68).

There are 1, 2, 3, 4, or 5 respondents per household. So, we can consider 5 strata, according to the number of respondents per household (let's call this stratification variable HOMSIZ):

Stratum 1 (1 respondent per house, HOMSIZ = 1): 10 houses, 10 respondents
Stratum 2 (2 respondent per house, HOMSIZ = 2): 10 houses, 20 respondents
Stratum 3 (3 respondent per house, HOMSIZ = 3): 5 houses, 15 respondents
Stratum 4 (4 respondent per house, HOMSIZ = 4): 2 houses, 8 respondents
Stratum 5 (5 respondent per house, HOMSIZ = 5): 3 houses, 15 respondents

I am planning to use mixed effects or GEE regression for the analysis (and use "homid" as the clustering variable).

What if I want to draw bootstraps from this setup?

I have the following alternatives:

(1)
. bsample

(2)
. bsample, strata(homsiz)

(3)
. bsample, cluster(homid)

(4)
. bsample, strata(homsiz) cluster(homid)

(1) will produce bootstrap samples w/ N=68 respondents (but will not preserve any other feature of the setup).

(2) will produce bootstrap samples w/ N=68 respondents and also preserve the number of respondents in each of the 5 strata (10, 20, 15, 8, 15)

Neither (1) nor (2) will preserve my cluster setup (households), so I will not consider them further.

(3) will produce bootstrap samples w/ M=30 households, but the total number of respondents in each resample will vary (from a minimum of 30, if all 30 households are from the 1st stratum, to a maximum of 150, if all 30 households are drawn from stratum 5).

(4) will produce bootstrap samples w/ both M=30 households and N=68 respondents (and also preserve the number of respondents in each stratum as 10, 20, 15, 8, and 15).

In the 3rd scheme, the number of households with 1, 2, 3, etc respondents is not fixed (and that reflects the way the data were obtained). However, the resamples may have a variable number of observations (units of analyses) and I am worried that I may overestimate the variability.

With the 4th scheme, I am worried that I might underestimate the variability. For example, imagine that the strata are very sparse (i.e., two clusters in each stratum). Then, with this scheme, I will be getting resamples that are more-or-less the original dataset over and over again.

Has anyone dealt with this kind of problem before? Any advice as to the choice between the 3rd and 4th schemes of bootstrapping?

Thank you in advance.
Constantine

The documents accompanying this transmission may contain confidential health or business information. This information is intended for the use of the individual or entity named above. If you have received this information in error, please notify the sender immediately and arrange for the return or destruction of these documents.

________________________________________________________________
Assistant Professor,
Thomas Jefferson University, Division of Biostatistics,
211 S. 9th St., Suite 602, Philadelphia, PA 19107
*** NEW ADDRESS (AS OF 4/17/06) ***
*** 1015 Chestnut St., Suite M100, Philadelphia, PA 19107 ***
Tel: 215-955-5695
Fax: 215-955-5681