Single-statistic bootstrap (STB-21: ssi6.2) -------------------------- ^bs^ cmd [^exp^|^macro^] exp [^, d^ots ^e^ci ^lea^ve ^l^evel^(^#^) r^eps^(^#^)^ ] Description ----------- ^bs^ provides bootstrap standard errors and confidence intervals for single sta- tistics. It is faster and easier to use than ^bstrap^; see ^help bstrap^. cmd may be any Stata command or ado-file, including user-written commands, that calculates and saves a statistic of interest. Although ^bs^ is limited to evaluating single statistics, it is not necessary that cmd calculate only a single statistic. ^bs^ allows any statistic calculated by cmd to be selected as the single one of interest. For instance, ^summarize^ calculates the mean, standard deviation, various per- centiles, skewness, and kurtosis. ^summarize^ would be a good candidate for use with ^bs^. Any of the statistics ^summarize^ calculates could be selected. Options ------- ^dots^ requests that a dot be placed on the screen at the beginning of every replication, thus providing entertainment if a large number of ^reps()^ are requested. ^eci^ requests that, in addition to normal-distribution based confidence inter- vals, an empirically based confidence interval be presented. ^eci^ is the default for ^reps()^>199; for smaller number of replications, ^noeci^ is the default. Empirical confidence intervals are calculated by the percentile method. ^leave^ specifies that a data set of the bootstrapped statistic be left behind in place of the data currently in memory. The default is to leave the original data undisturbed and to discard the bootstrapped statistics once the summary table has been reported. If ^leave^ is specified, the data left behind contains a single variable named ^result^ with ^reps()^ observations. ^level(^#^)^ specifies the significance level in percent for the confidence interval, whether empirically or normally based. Options, continued ------------------ ^reps(^#^)^ specifies the number of bootstrap replications to be performed; ^reps(20)^ is the default. Conventional wisdom is that for estimates of standard errors, 50 to 200 replications is adequate; for estimates of empirical confidence intervals, you should use at least 1,000 replications. We recommend you follow that advice. Alternatively, in approximations developed by us, the formulas 47.4/sqrt(n) and 138.6/sqrt(n) provide a crude measure of the maximum percentage varia- tion in the estimated standard error that will be observed 50 and 95 per- cent of the time in sequential runs of bs, where n is the number of repli- cations. E.g., for n=50, 47.4/sqrt(50)=6.7% and 138.6/sqrt(50)=19.6%, meaning in two runs, the estimated standard error will vary by less than 6.7% half the time and 19.6% 95% of the time. If you obtain an empirical 95% confidence interval (eci), increase the number of replications by 72% over what you would otherwise choose. A table of popular values for reps() is provided below. The table is merely suggestive. Options, continued ------------------ variation variation ^rep()^ 50% 95% ^rep()^ 50% 95% ----------------------- ------------------------- 20 10.6% 31.0% 500 2.1% 6.2% 50 6.7 19.6 1,000 1.5 4.4 100 4.7 13.6 10,000 .5 1.4 200 3.4 9.8 100,000 .1 .4 Example 1 --------- Problem: Obtain a bootstrap standard error and confidence interval for the median of mpg in the auto data. Solution: ^summarize^ with the ^detail^ option calculates, among other things, medians and, according to Saved Results in [5s] summarize, ^summarize^ stores the median in ^_result(10)^. Thus: . ^bs "summarize mpg, detail" _result(10)^ Bootstrap Reps Pt. Est. Std. Err. [95\% Conf. Interval] ------------------------------------------------------------- 50 20 .8668498 18.30101 21.69899 (normal based) (average) 19.94 Example 2 --------- Problem: Obtain a bootstrap standard error for the coefficient on weight in a regression of mpg on weight and displ in the auto data. Use 100 replications. Solution: ^regress^ estimates regression models. ^_b[weight]^ is the coefficient on weight after a regression: . ^bs "regress mpg weight displ" _b[weight], reps(100)^ Bootstrap Reps Pt. Est. Std. Err. [95\% Conf. Interval] ------------------------------------------------------------- 100 -.0065671 .0009789 -.0084857 -.0046486 (normal based) (average) -.0067433 Example 3 --------- Problem: Obtain a bootstrap standard error for the standard error of the mean of mpg in the auto data. Solution 1: The standard error of the mean is defined as sqrt{s^^2/n where s^^2 is the estimated variance of the sample and $n$ is the number of obser- vations. ^summarize^ saves s^^2 in ^_result(4)^ and n in ^_result(1)^: . ^bs "sum mpg" sqrt(_result(4)/_result(1))^ Bootstrap Reps Pt. Est. Std. Err. [95\% Conf. Interval] ------------------------------------------------------------- 50 .6725511 .0785286 .5186379 .8264643 (normal based) (average) .6607228 Example 3, continued -------------------- Solution 2: ci calculates the standard error of the mean. The standard error is saved in the global macro ^$S_4^. . ^bs "ci mpg" macro S_4^ Bootstrap Reps Pt. Est. Std. Err. [95\% Conf. Interval] ------------------------------------------------------------- 50 .6725511 .0610494 .5528964 .7922058 (normal based) (average) .667007 Note the user of the word ^macro^ in the ^bs^ command. Without it, ^bs^ would assume the expression following the command is an ordinary expression. When the expression is a macro, you type the word ^macro^ followed by the macro name, omitting the dollar sign. Example 3.1 ----------- Problem: Continuing with the previous example, obtain a more accurate estimate of the standard error of the standard error of the mean -- use 1,000 replica- tions. Obtain the empirically based confidence interval. Solution: We specify ^reps(1000)^. We could specify ^eci^, but do not have to because we specified more than 199 replications. . ^bs "ci mpg" macro S_4^ Bootstrap Reps Pt. Est. Std. Err. [95\% Conf. Interval] ------------------------------------------------------------- 1000 .6725511 .0695833 .5361703 .8089319 (normal based) (average) .6613738 .5329553 .8018642 (empirical) Speeding execution ------------------ Consider the command: . ^bs "sum mpg, det" _result(10)^ ^bs^ has no way of knowing that only mpg plays a role in the calculation and is thus forced to make bootstrap samples that include all the variables in the data set. ^bs^ will run faster if you keep only the variables relevant to the calculation: . ^keep mpg^ . ^bs "sum mpg, det" _result(10)^ In practice, unless you have hundreds of variables in your data, keeping the relevant variables will make little difference. Missing values -------------- Data sets invariably have missing values for some variables. Since ^bs^ does not know which variables play a role in the specified command, it has no way of excluding the missing values. That causes no problem in one sense because all Stata commands deal with missing values gracefully. It does, however, cause a statistical problem. Bootstrap sampling is defined as drawing, with replacement, subsamples of size N from a sample of size N. ^bs^ determines N by counting the number of observations in the data set, not counting the number of nonmissing observations on the relevant variables. The result is that too many observations are sampled and, moreover, the resulting samples, since drawn from a population with missing values, are of unequal sizes. Given a small fraction of missing values, this will not make a difference. If you have a large fraction, however, you should first omit the missing values: . ^drop if mpg==.^ . ^bs "sum mpg, d" _result(10)^ Saved results ------------- ^bs^ saves in the global S_# macros: S_1 point estimate calculated over entire data S_2 average estimate calculated across samples S_3 estimated standard error (standard deviation across samples) S_4 lower bound of empirical confidence interval S_5 upper bound of empirical confidence interval Also see -------- STB: ssi6.2 (STB-6) On-line: ^help^ for ^bstrap^