    # Re: st: Subsample Bootstrap

 From Stas Kolenikov To statalist@hsphsun2.harvard.edu Subject Re: st: Subsample Bootstrap Date Fri, 20 Jan 2006 11:59:32 -0600

Think about the central limit theorem first: it says that

\sqrt{n} (\hat\theta - \theta_0) \to N(0,\sigma^2)

in case of i.i.d. data. What the bootstrap does is it says, "the
sample I have at hand is to the population as the bootstrap subsample
to the sample; so if I have theta-hat=theta(sample), it's relation to
the population theta_0 is roughly as that of theta*=theta(bootstrap
sample) to the theta-hat". Any standard book on the bootstrap would
fill you on the details. The additional beauty of the bootstrap is
that while converging to this normal distribution, it takes better
care of skewness and kurtosis (don't ask me why, I forgot it from my

If you do have i.i.d. data (which I doubt), and you have enough trust
in asymptotic derviations, then a fully (asymptotically) justified
bootstrap procedure may run as follows:

1. take b-th subsample of size N*, 1<<N*<<N, of your data
(Probably in your place, I would go down another order of magnitude
and use N*=10000, if the perfomance of your procedure is linear in the
# of data points.)
2. run your procedire and obtain \theta_b
3. repeat approximately infinitely many times
4. assess variability of your theta according to the above central
limit argument: \sqrt{N} (\hat\theta-\theta_0) would have the same
variance as \sqrt{N*} (\theta_b - \tilde\theta) where in place of
\tilde\theta you can take \hat\theta from the original sample, or the
mean of the simulated distribution. Again, a decent bootstrap book
would fill you with the details as to which choice is better.

The caveats you might have here: 1. dependent data (time series,
cluster correlated survey data) -- you need to resample in blocks or
whole clusters, respectively; 2. the rate of convergence is different
than sqrt(n) -- and there is no telling about that unless you start
analysing this with rigorous analytics. Also in the case of dependent
data, your convergence rate may be different (or rather scaled in a
different manner): for clustered survey data, it would be sqrt(#
clusters) rather than sqrt(# of observations).

An interesting case where the regular bootstrap breaks down, and one
is forced to use the above reduced size bootstrap (note that you need
to take subsamples of the size much smaller than the original size) is
estimating the distribution of the sample maximum. If you resample the
whole data set, then the bootstrap distribution of the bootstrap
sample maximum converges to a deterministic distribution, with a point
mass of 0.7 or so (1-exp(-1) or something like that) at the sample
max, and a sort of geometric series for the next largest order
statistics. The only way to unravel this is to use this smaller size
bootstrap. Also for this example, the caveat about faster convergence
may be in action -- the convergence rate there may be as high as 1/n,
rather than 1/sqrt(n), so the scaling factor needs to be thought of
properly.

If your procedure can be expressed in moments and estimating
equations, then probably you would be in the domain of the central
limit theorem, and your convergence rate should be sqrt(n).

On 1/20/06, Tim R. Sass <tsass@coss.fsu.edu> wrote:
> Dear Statalisters,
>
> I am iteratively estimating a model and then using the bootstrap procedure
> to derive the standard errors and t-stats.  The procedure works fine on a
> moderately sized sample, but each replication of the bootstrap takes over
> three hours on my full sample (over 1 million obs.).  In order to speed up
> the process I would like to perform each repetition of the bootstrap on a
> subsample of the data, say 100,000 observations.  This can of course be
> done by setting the size() option in bootstrap.  The FAQs warn against
> this, however, saying "the standard error estimates are dependent upon the
> number of observations in each replication.  Here, on average, we would
> expect the variance estimate of b[foreign] to be twice as large for a
> sample of 37 observations than that for 74 observations.  This is due
> mainly to the form of the variance of the sample mean."
>
> Is there some straightforward correction I can make to get the correct
> standard errors when using the size() option in bootstrap?
>
> Alternatively, I thought about just taking the estimated coefficients from
> each repetition of the bootstrap and then forming an empirical distribution
> from these estimates to get the standard errors.  But I am not quite sure
> how to accomplish this in Stata.
>
> Any help on these ideas or alternative solutions would be greatly appreciated.
>
> Tim
>
> Tim R. Sass
> Professor                               Voice:   (850)644-7087
> Department of Economics         Fax:      (850)644-4535
> Florida State University                E-mail:   tsass@coss.fsu.edu
> Tallahassee, FL  32306-2180     Internet: http://garnet.acns.fsu.edu/~tsass
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

--
Stas Kolenikov
http://stas.kolenikov.name

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/