
Title | Resampling and missing values | |
Author | Jeff Pitblado, StataCorp |
When bootstrapping statistics on data with missing values, bootstrap may produce misleading or erroneous bias and variance statistics unless the command is an eclass command that generates e(sample). To better explain the problem, here is an example.
Consider the following dataset with one missing value:
. clear . set obs 10 obs was 0, now 10 . set seed 570971 . generate x = uniform() . generate y = invnormal(uniform()) . replace y = . in 5 (1 real change made, 1 to missing) . save resample, replace file resample.dta saved . list +----------------------+ | x y | |----------------------| 1. | .0901624 -.8072783 | 2. | .8839354 .0117225 | 3. | .423627 .6715007 | 4. | .8497756 -.026581 | 5. | .4759649 . | |----------------------| 6. | .3587709 -.6098545 | 7. | .2387148 -2.177713 | 8. | .915678 .6642656 | 9. | .4609539 .9534492 | 10. | .6992906 -1.15695 | +----------------------+
It is clear in the following output that only 9 values are used to calculate the sample standard deviation (SD) of y.
. summarize y Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- y | 9 -.275271 1.013946 -2.177713 .9534492
After using the describe command on the saved bootstrap sample dataset (sum.dta), we see that _bs_1 contains the bootstrap observations of r(mean). Similarly, _bs_2 contains the bootstrap observations of r(N).
. set seed 1423567 . bootstrap r(mean) r(N), reps(5) saving(sum, replace) nowarn: summarize y (running summarize on estimation sample) (note: file sum.dta not found) Bootstrap replications (5) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 ..... Bootstrap results Number of obs = 10 Replications = 5 command: summarize y _bs_1: r(mean) _bs_2: r(N) ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _bs_1 | -.275271 .1767023 -1.56 0.119 -.6216012 .0710592 _bs_2 | 9 .83666 10.76 0.000 7.360176 10.63982 ------------------------------------------------------------------------------ . describe using sum Contains data bootstrap: summarize obs: 5 4 May 2015 07:26 vars: 2 size: 60 ------------------------------------------------------------------------------- storage display value variable name type format label variable label -------------------------------------------------------------------------------- _bs_1 float %9.0g r(mean) _bs_2 float %9.0g r(N) ------------------------------------------------------------------------------- Sorted by: . use sum, clear (bootstrap: summarize) . list +-------------------+ | _bs_1 _bs_2 | |-------------------| 1. | -.0924903 10 | 2. | .0861323 10 | 3. | -.088269 9 | 4. | -.4005653 8 | 5. | -.0740297 9 | +-------------------+
The above listing of the boostrap data reveals the problem; not all of the bootstrap samples contained 9 observations. This problem is easily fixed for this example, since we can drop the observations that have a missing value from the original dataset before using bootstrap.
. use resample, clear . drop if y == . (1 observation deleted) . list +----------------------+ | x y | |----------------------| 1. | .0901624 -.8072783 | 2. | .8839354 .0117225 | 3. | .423627 .6715007 | 4. | .8497756 -.026581 | 5. | .3587709 -.6098545 | |----------------------| 6. | .2387148 -2.177713 | 7. | .915678 .6642656 | 8. | .4609539 .9534492 | 9. | .6992906 -1.15695 | +----------------------+ . set seed 1423567 . bootstrap r(mean) r(N), reps(5) saving(sum, replace) nowarn: summarize y (running summarize on estimation sample) Bootstrap replications (5) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 ..... Bootstrap results Number of obs = 9 Replications = 5 command: summarize y _bs_1: r(mean) _bs_2: r(N) ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _bs_1 | -.275271 .2803826 -0.98 0.326 -.8248108 .2742688 _bs_2 | 9 . . . . . ------------------------------------------------------------------------------ . use sum, clear (bootstrap: summarize) . list +-------------------+ | _bs_1 _bs_2 | |-------------------| 1. | .0178111 9 | 2. | -.5203212 9 | 3. | .1150261 9 | 4. | .092199 9 | 5. | -.3069329 9 | +-------------------+
In the examples above, I used the nowarn option on bootstrap to suppress the warning message it issues when no e(sample) is available.
bootstrap will not produce a warning message when an estimation command (eclass) that generates e(sample) is supplied. Here, e(sample) provides bootstrap with all the information it needs to keep unused observations out of the bootstrap samples. Similarly, to the mean of y, it is clear from the following output that only 9 observations are used to estimate the coefficient on the predictor for simple linear regression. The coefficient is saved in _b[x], and the number of observations used in the estimation is saved in e(N).
. use resample, clear . regress y x Source | SS df MS Number of obs = 9 -------------+---------------------------------- F(1, 7) = 1.60 Model | 1.53022378 1 1.53022378 Prob > F = 0.2464 Residual | 6.69446954 7 .956352791 R-squared = 0.1861 -------------+---------------------------------- Adj R-squared = 0.0698 Total | 8.22469332 8 1.02808666 Root MSE = .97793 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | 1.451699 1.147646 1.26 0.246 -1.262054 4.165451 _cons | -1.069013 .7071156 -1.51 0.174 -2.741075 .6030498 ------------------------------------------------------------------------------ . set seed 1423567 . bootstrap _b[x] e(N), reps(5) saving(reg, replace): regress y x (running regress on estimation sample) (note: file reg.dta not found) Bootstrap replications (5) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 ..... Linear regression Number of obs = 9 Replications = 5 command: regress y x _bs_1: _b[x] _bs_2: e(N) ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _bs_1 | 1.451699 1.172467 1.24 0.216 -.8462939 3.749691 _bs_2 | 9 . . . . . ------------------------------------------------------------------------------ . use reg, clear (bootstrap: regress) . list +-------------------+ | _bs_1 _bs_2 | |-------------------| 1. | -.5315873 9 | 2. | 2.245691 9 | 3. | .9832834 9 | 4. | 1.318368 9 | 5. | 2.373077 9 | +-------------------+