Stata: Data Analysis and Statistical Software
   >> Home >> Resources & support >> FAQs >> Resampling and missing values

Why does bootstrap give a warning message for non-eclass commands?

Title   Resampling and missing values
Author Jeff Pitblado, StataCorp
Date August 2001; updated July 2005

When bootstrapping statistics on data with missing values, bootstrap may produce misleading or erroneous bias and variance statistics unless the command is an eclass command that generates e(sample). To better explain the problem, here is an example.

Consider the following dataset with one missing value:

 . clear

 . set obs 10
 obs was 0, now 10

 . set seed 570971

 . generate x = uniform()

 . generate y = invnormal(uniform())

 . replace y = . in 5
 (1 real change made, 1 to missing)

 . save resample, replace
 file resample.dta saved

 . list

      +----------------------+
      |        x           y |
      |----------------------|
   1. | .7503739    -.621165 |
   2. | .6177279    .4850219 |
   3. |  .989426   -1.084084 |
   4. | .4899037    -1.27354 |
   5. | .7327343           . |
      |----------------------|
   6. | .9458812    1.022817 |
   7. | .0838971    .2310362 |
   8. | .4090274    .8443562 |
   9. | .9312586   -.0218735 |
  10. | .8493695   -.6778926 |
      +----------------------+

It is clear in the following output that only 9 values are used to calculate the sample standard deviation (SD) of y.

 . summarize y
    
     Variable |     Obs        Mean   Std. Dev.       Min        Max
 -------------+-----------------------------------------------------
            y |       9   -.1217026    .833473   -1.27354   1.022817

After using the describe command on the saved bootstrap sample dataset (sum.dta), we see that _bs_1 contains the bootstrap observations of r(mean). Similarly, _bs_2 contains the bootstrap observations of r(N).

 . set seed 1423567
    
 . bootstrap r(mean) r(N), reps(5) saving(sum, replace) nowarn: summarize y
 (running summarize on estimation sample)

 Bootstrap replications (5)
 ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
 .....

 Bootstrap results                               Number of obs      =        10
                                                 Replications       =         5

       command:  summarize y
         _bs_1:  r(mean)
         _bs_2:  r(N)

 ------------------------------------------------------------------------------
              |   Observed   Bootstrap                         Normal-based
              |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        _bs_1 |  -.1217026   .4225559    -0.29   0.773    -.9498969    .7064918
        _bs_2 |          9    1.67332     5.38   0.000     5.720353    12.27965
 ------------------------------------------------------------------------------

     . describe using sum

 Contains data                                 bootstrap: summarize
   obs:             5                          2 Jul 2005 12:02
  vars:             2                          
  size:            60                          
 -------------------------------------------------------------------------------
               storage  display     value
 variable name   type   format      label      variable label
 -------------------------------------------------------------------------------
 _bs_1           float  %9.0g                  r(mean)
 _bs_2           float  %9.0g                  r(N)
 -------------------------------------------------------------------------------
 Sorted by:  
     
 . use sum, clear
 (bootstrap: summarize)
 
 . list

      +-------------------+
      |     _bs_1   _bs_2 |
      |-------------------|
   1. |  .3876454       9 |
   2. | -.6965898       6 |
   3. |  .1137314      10 |
   4. |  -.381191       8 |
   5. | -.2104959      10 |
      +-------------------+
 

The above listing of the boostrap data reveals the problem; not all of the bootstrap samples contained 9 observations. This problem is easily fixed for this example, since we can drop the observations that have a missing value from the original dataset before using bootstrap.

 . use resample, clear

  . drop if y == .
  (1 observation deleted)

  . list

       +----------------------+
       |        x           y |
       |----------------------|
    1. | .7503739    -.621165 |
    2. | .6177279    .4850219 |
    3. |  .989426   -1.084084 |
    4. | .4899037    -1.27354 |
    5. | .9458812    1.022817 |
       |----------------------|
    6. | .0838971    .2310362 |
    7. | .4090274    .8443562 |
    8. | .9312586   -.0218735 |
    9. | .8493695   -.6778926 |
       +----------------------+
  
  . set seed 1423567

  . bootstrap r(mean) r(N), reps(5) saving(sum, replace) nowarn: summarize y
  (running summarize on estimation sample)

  Bootstrap replications (5)
  ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
  .....

  Bootstrap results                               Number of obs      =         9
                                                  Replications       =         5

        command:  summarize y
          _bs_1:  r(mean)
          _bs_2:  r(N)

  ------------------------------------------------------------------------------
               |   Observed   Bootstrap                         Normal-based
               |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
  -------------+----------------------------------------------------------------
         _bs_1 |  -.1217026   .2345252    -0.52   0.604    -.5813635    .3379584
         _bs_2 |          9          .        .       .            .           .
  ------------------------------------------------------------------------------

  . use sum, clear
  (bootstrap: summarize)

  . list

       +-------------------+
       |     _bs_1   _bs_2 |
       |-------------------|
    1. | -.1850747       9 |
    2. | -.4956241       9 |
    3. | -.1272637       9 |
    4. |  .1650546       9 |
    5. | -.1634492       9 |
       +-------------------+

In the examples above, I used the nowarn option on bootstrap to suppress the warning message it issues when no e(sample) is available.

bootstrap will not produce a warning message when an estimation command (eclass) that generates e(sample) is supplied. Here, e(sample) provides bootstrap with all the information it needs to keep unused observations out of the bootstrap samples. Similarly, to the mean of y, it is clear from the following output that only 9 observations are used to estimate the coefficient on the predictor for simple linear regression. The coefficient is saved in _b[x], and the number of observations used in the estimation is saved in e(N).

 . use resample, clear

 . regress y x

       Source |       SS       df       MS              Number of obs =       9
 -------------+------------------------------           F(  1,     7) =    0.27
        Model |   .20640433     1   .20640433           Prob > F      =  0.6193
     Residual |  5.35101354     7  .764430506           R-squared     =  0.0371
 -------------+------------------------------           Adj R-squared = -0.1004
        Total |  5.55741787     8  .694677234           Root MSE      =  .87432

 ------------------------------------------------------------------------------
            y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
            x |  -.5311304   1.022141    -0.52   0.619    -2.948109    1.885849
        _cons |   .2363303   .7481222     0.32   0.761    -1.532698    2.005358
 ------------------------------------------------------------------------------

 . set seed 1423567

 . bootstrap _b[x] e(N), reps(5) saving(reg, replace): regress y x
 (running regress on estimation sample)

 Bootstrap replications (5)
 ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
 .....

 Linear regression                               Number of obs      =         9
                                                 Replications       =         5
 
       command:  regress y x
         _bs_1:  _b[x]
         _bs_2:  e(N)

 ------------------------------------------------------------------------------
              |   Observed   Bootstrap                         Normal-based
              |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
 -------------+----------------------------------------------------------------
        _bs_1 |  -.5311304   1.652094    -0.32   0.748    -3.769176    2.706915
        _bs_2 |          9          .        .       .            .           .
 ------------------------------------------------------------------------------

 . use reg, clear
 (bootstrap: regress)

 . list

      +-------------------+
      |     _bs_1   _bs_2 |
      |-------------------|
   1. | -1.860403       9 |
   2. | -1.788892       9 |
   3. |  2.142807       9 |
   4. | -.7537238       9 |
   5. | -1.248657       9 |
      +-------------------+
Bookmark and Share 
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Mac
Technical support
Like us on Facebook Follow us on Twitter Follow us on LinkedIn Google+ Watch us on YouTube
Follow us
© Copyright 1996–2013 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index   |   View mobile site