Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Idea for a faster bootstrap


From   Mike Lacy <Michael.Lacy@colostate.edu>
To   statalist@hsphsun2.harvard.edu
Subject   st: Idea for a faster bootstrap
Date   Tue, 17 Oct 2006 15:46:01 -0600

I'm interested in ways to do resampling quickly. -bootstrap- can be excruciatingly slow, especially when the data set is large. While I much appreciate all the built-in features of -bootstrap-, I've thought that there might be approaches or algorithms for a DIY bootstrap that would be faster (presumably at the expense of not being so general purpose, etc.) I didn't seem to find anything in the archives.

What I came up with is an implementation of an algorithm popularized by a contributor to the SPSS list many years ago, in which a file of
Reps * Sample Size observations with random pointers into the original data has the original data merged onto it. I'll take the liberty of posting it below, since it is not much longer than reasonable pseudocode for it would be. For a larger data set (auto, expanded by 100 to 7400), it took about <4% as much time as -bootstrap-. (This was on simple problem, e.g. -summarize price-). Obviously, for some problems, the following algorithm would eat up too much memory, but Stata seems to run it happily with, e.g,, 10,000 samples of N = 100 on a data set of 100,000.

I'm suspecting a free lunch here. Obviously, for -bootstrap- problems for which the statistical calculation itself is slow, the overhead of
-bootstrap- won't matter, so improving bootstrap might be irrelevant.
Anyway, I'd appreciate any thoughts on the following as a possible
-bootstrap- alternative.

*Example data
sysuse auto
expand 100 // make it bigger for demonstration
*Algorithm starts
*---------------
local reps = 10000 // choose
local sampsize = 50 // choose
local popsize = _N
gen long ident = _n
sort ident
tempfile temp
save `temp', replace
clear
*
* Create a file to hold a resampled data set
local bigsize = `reps' * `sampsize'
set obs `bigsize'
gen long repnum = _n if _n <=`reps'
replace repnum = repnum[_n - `reps'] if _n > `reps'
* Create a pointer to the population for each resample element
gen long ident = 1 + int(`popsize' * uniform())
sort ident
merge ident using `temp', uniqusing
keep if _merge ==3
drop _merge ident
sort repnum
*
statsby mean = r(mean), by(repnum)clear : summ price



Regards,

=-=-=-=-=-=-=-=-=-=-=-=-=
Mike Lacy
Fort Collins CO USA
(970) 491-6721 office






*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index