Why does bootstrap give a warning message for non-eclass commands?
|
Title
|
|
Resampling and missing values
|
|
Author
|
Jeff Pitblado, StataCorp
|
|
Date
|
August 2001; updated July 2005
|
When bootstrapping statistics on data with missing values,
bootstrap may
produce misleading or erroneous bias and variance statistics unless the
command is an eclass command that generates
e(sample). To
better explain the problem, here is an example.
Consider the following dataset with one missing value:
. clear
. set obs 10
obs was 0, now 10
. set seed 570971
. generate x = uniform()
. generate y = invnormal(uniform())
. replace y = . in 5
(1 real change made, 1 to missing)
. save resample, replace
file resample.dta saved
. list
+----------------------+
| x y |
|----------------------|
1. | .7503739 -.621165 |
2. | .6177279 .4850219 |
3. | .989426 -1.084084 |
4. | .4899037 -1.27354 |
5. | .7327343 . |
|----------------------|
6. | .9458812 1.022817 |
7. | .0838971 .2310362 |
8. | .4090274 .8443562 |
9. | .9312586 -.0218735 |
10. | .8493695 -.6778926 |
+----------------------+
It is clear in the following output that only 9 values are used to calculate
the sample standard deviation (SD) of y.
. summarize y
Variable | Obs Mean Std. Dev. Min Max
-------------+-----------------------------------------------------
y | 9 -.1217026 .833473 -1.27354 1.022817
After using the describe command on the saved bootstrap sample
dataset (sum.dta), we see that _bs_1 contains the bootstrap
observations of r(mean). Similarly, _bs_2 contains the
bootstrap observations of r(N).
. set seed 1423567
. bootstrap r(mean) r(N), reps(5) saving(sum, replace) nowarn: summarize y
(running summarize on estimation sample)
Bootstrap replications (5)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.....
Bootstrap results Number of obs = 10
Replications = 5
command: summarize y
_bs_1: r(mean)
_bs_2: r(N)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_bs_1 | -.1217026 .4225559 -0.29 0.773 -.9498969 .7064918
_bs_2 | 9 1.67332 5.38 0.000 5.720353 12.27965
------------------------------------------------------------------------------
. describe using sum
Contains data bootstrap: summarize
obs: 5 2 Jul 2005 12:02
vars: 2
size: 60
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
_bs_1 float %9.0g r(mean)
_bs_2 float %9.0g r(N)
-------------------------------------------------------------------------------
Sorted by:
. use sum, clear
(bootstrap: summarize)
. list
+-------------------+
| _bs_1 _bs_2 |
|-------------------|
1. | .3876454 9 |
2. | -.6965898 6 |
3. | .1137314 10 |
4. | -.381191 8 |
5. | -.2104959 10 |
+-------------------+
The above listing of the boostrap data reveals the problem; not all of the
bootstrap samples contained 9 observations. This problem is easily fixed
for this example, since we can drop the observations that have a missing
value from the original dataset before using bootstrap.
. use resample, clear
. drop if y == .
(1 observation deleted)
. list
+----------------------+
| x y |
|----------------------|
1. | .7503739 -.621165 |
2. | .6177279 .4850219 |
3. | .989426 -1.084084 |
4. | .4899037 -1.27354 |
5. | .9458812 1.022817 |
|----------------------|
6. | .0838971 .2310362 |
7. | .4090274 .8443562 |
8. | .9312586 -.0218735 |
9. | .8493695 -.6778926 |
+----------------------+
. set seed 1423567
. bootstrap r(mean) r(N), reps(5) saving(sum, replace) nowarn: summarize y
(running summarize on estimation sample)
Bootstrap replications (5)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.....
Bootstrap results Number of obs = 9
Replications = 5
command: summarize y
_bs_1: r(mean)
_bs_2: r(N)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_bs_1 | -.1217026 .2345252 -0.52 0.604 -.5813635 .3379584
_bs_2 | 9 . . . . .
------------------------------------------------------------------------------
. use sum, clear
(bootstrap: summarize)
. list
+-------------------+
| _bs_1 _bs_2 |
|-------------------|
1. | -.1850747 9 |
2. | -.4956241 9 |
3. | -.1272637 9 |
4. | .1650546 9 |
5. | -.1634492 9 |
+-------------------+
In the examples above, I used the nowarn option on bootstrap
to suppress the warning message it issues when no e(sample) is
available.
bootstrap will not produce a warning message when an estimation
command (eclass) that generates e(sample) is supplied. Here,
e(sample) provides bootstrap with all the information it needs
to keep unused observations out of the bootstrap samples. Similarly, to the
mean of y, it is clear from the following output that only 9
observations are used to estimate the coefficient on the predictor for
simple linear regression. The coefficient is saved in _b[x], and the
number of observations used in the estimation is saved in e(N).
. use resample, clear
. regress y x
Source | SS df MS Number of obs = 9
-------------+------------------------------ F( 1, 7) = 0.27
Model | .20640433 1 .20640433 Prob > F = 0.6193
Residual | 5.35101354 7 .764430506 R-squared = 0.0371
-------------+------------------------------ Adj R-squared = -0.1004
Total | 5.55741787 8 .694677234 Root MSE = .87432
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | -.5311304 1.022141 -0.52 0.619 -2.948109 1.885849
_cons | .2363303 .7481222 0.32 0.761 -1.532698 2.005358
------------------------------------------------------------------------------
. set seed 1423567
. bootstrap _b[x] e(N), reps(5) saving(reg, replace): regress y x
(running regress on estimation sample)
Bootstrap replications (5)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.....
Linear regression Number of obs = 9
Replications = 5
command: regress y x
_bs_1: _b[x]
_bs_2: e(N)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_bs_1 | -.5311304 1.652094 -0.32 0.748 -3.769176 2.706915
_bs_2 | 9 . . . . .
------------------------------------------------------------------------------
. use reg, clear
(bootstrap: regress)
. list
+-------------------+
| _bs_1 _bs_2 |
|-------------------|
1. | -1.860403 9 |
2. | -1.788892 9 |
3. | 2.142807 9 |
4. | -.7537238 9 |
5. | -1.248657 9 |
+-------------------+
|