FAQ: Resampling and missing values

Home / Resources & support / FAQs / Resampling and missing values

Why does bootstrap give a warning message for non-eclass commands?

Title		Resampling and missing values
Author		Jeff Pitblado, StataCorp

When bootstrapping statistics on data with missing values, bootstrap may produce misleading or erroneous bias and variance statistics unless the command is an eclass command that generates e(sample). To better explain the problem, here is an example.

Consider the following dataset with one missing value:

. clear

. set obs 10
Number of observations (_N) was 0, now 10.

. set seed 570971

. generate x = uniform()

. generate y = invnormal(uniform())

. replace y = . in 5
(1 real change made, 1 to missing)

. save resample, replace
(file resample.dta not found)
file resample.dta saved

. list


             x           y 
  1.  .0901624   -.8072783 
  2.  .8839354    .0117225 
  3.   .423627    .6715007 
  4.  .8497756    -.026581 
  5.  .4759649           . 
      
  6.  .3587709   -.6098545 
  7.  .2387148   -2.177713 
  8.   .915678    .6642656 
  9.  .4609539    .9534492 
 10.  .6992906    -1.15695

It is clear in the following output that only 9 values are used to calculate the sample standard deviation (SD) of y.

. summarize y


     Variable          Obs        Mean    Std. dev.       Min        Max
    
            y            9    -.275271    1.013946  -2.177713   .9534492

After using the describe command on the saved bootstrap sample dataset (sum.dta), we see that _bs_1 contains the bootstrap observations of r(mean). Similarly, _bs_2 contains the bootstrap observations of r(N).

. set seed 1423567

. bootstrap r(mean) r(N), reps(5) saving(sum, replace) nowarn: summarize y
(running summarize on estimation sample)
(file sum.dta not found)

Bootstrap replications (5): ..... done

Bootstrap results                                           Number of obs = 10
                                                            Replications  =  5

      Command: summarize y
        _bs_1: r(mean)
        _bs_2: r(N)



                 Observed   Bootstrap                         Normal-based 
               coefficient  std. err.      z    P>|z|     [95% conf. interval]
   
       _bs_1     -.275271   .1767023    -1.56   0.119    -.6216012    .0710592
       _bs_2            9     .83666    10.76   0.000     7.360176    10.63982



. describe using sum

Contains data                                 bootstrap: summarize
 Observations:             5                  1 Aug 2023 13:34
    Variables:             2



Variable      Storage   Display    Value
    name         type    format    label      Variable label
 
_bs_1           float   %9.0g                 r(mean)
_bs_2           float   %9.0g                 r(N)

Sorted by:


. use sum, clear
(bootstrap: summarize)

. list


          _bs_1   _bs_2 
  1.  -.0924903      10 
  2.   .0861323      10 
  3.   -.088269       9 
  4.  -.4005653       8 
  5.  -.0740297       9

The above listing of the boostrap data reveals the problem; not all of the bootstrap samples contained 9 observations. This problem is easily fixed for this example, since we can drop the observations that have a missing value from the original dataset before using bootstrap.

. use resample, clear

. drop if y == .
(1 observation deleted)

. list


             x           y 
  1.  .0901624   -.8072783 
  2.  .8839354    .0117225 
  3.   .423627    .6715007 
  4.  .8497756    -.026581 
  5.  .3587709   -.6098545 
      
  6.  .2387148   -2.177713 
  7.   .915678    .6642656 
  8.  .4609539    .9534492 
  9.  .6992906    -1.15695 


. set seed 1423567

. bootstrap r(mean) r(N), reps(5) saving(sum, replace) nowarn: summarize y
(running summarize on estimation sample)

Bootstrap replications (5): ..... done

Bootstrap results                                            Number of obs = 9
                                                             Replications  = 5

      Command: summarize y
        _bs_1: r(mean)
        _bs_2: r(N)



                 Observed   Bootstrap                         Normal-based 
               coefficient  std. err.      z    P>|z|     [95% conf. interval]
   
       _bs_1     -.275271   .2803826    -0.98   0.326    -.8248108    .2742688
       _bs_2            9          .        .       .            .           .



. use sum, clear
(bootstrap: summarize)

. list


          _bs_1   _bs_2 
  1.   .0178111       9 
  2.  -.5203212       9 
  3.   .1150261       9 
  4.    .092199       9 
  5.  -.3069329       9

In the examples above, I used the nowarn option on bootstrap to suppress the warning message it issues when no e(sample) is available.

bootstrap will not produce a warning message when an estimation command (eclass) that generates e(sample) is supplied. Here, e(sample) provides bootstrap with all the information it needs to keep unused observations out of the bootstrap samples. Similarly, to the mean of y, it is clear from the following output that only 9 observations are used to estimate the coefficient on the predictor for simple linear regression. The coefficient is saved in _b[x], and the number of observations used in the estimation is saved in e(N).

. use resample, clear

. regress y x


      Source         SS           df       MS     Number of obs   =         9
      F(1, 7)         =      1.60
       Model    1.53022378         1  1.53022378    Prob > F        =    0.2464
    Residual    6.69446954         7  .956352791    R-squared       =    0.1861
      Adj R-squared   =    0.0698
       Total    8.22469332         8  1.02808666    Root MSE        =    .97793




           y   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
           x     1.451699   1.147646     1.26   0.246    -1.262054    4.165451
       _cons    -1.069013   .7071156    -1.51   0.174    -2.741075    .6030498



. set seed 1423567

. bootstrap _b[x] e(N), reps(5) saving(reg, replace): regress y x
(running regress on estimation sample)
(file reg.dta not found)

Bootstrap replications (5): ..... done

Linear regression                                            Number of obs = 9
                                                             Replications  = 5

      Command: regress y x
        _bs_1: _b[x]
        _bs_2: e(N)



                 Observed   Bootstrap                         Normal-based
               coefficient  std. err.      z    P>|z|     [95% conf. interval]
   
       _bs_1     1.451699   1.172467     1.24   0.216    -.8462939    3.749691
       _bs_2            9          .        .       .            .           .



. use reg, clear
(bootstrap: regress)

. list


          _bs_1   _bs_2 
  1.  -.5315873       9 
  2.   2.245691       9 
  3.   .9832834       9 
  4.   1.318368       9 
  5.   2.373077       9

Why does bootstrap give a warning message for non-eclass commands?

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

	x y
1.	.0901624 -.8072783
2.	.8839354 .0117225
3.	.423627 .6715007
4.	.8497756 -.026581
5.	.4759649 .

6.	.3587709 -.6098545
7.	.2387148 -2.177713
8.	.915678 .6642656
9.	.4609539 .9534492
10.	.6992906 -1.15695

	Variable		Obs Mean Std. dev. Min Max

	y		9 -.275271 1.013946 -2.177713 .9534492


		Observed Bootstrap Normal-based
		coefficient std. err. z P>\|z\| [95% conf. interval]

_bs_1		-.275271 .1767023 -1.56 0.119 -.6216012 .0710592
_bs_2		9 .83666 10.76 0.000 7.360176 10.63982


Variable Storage Display Value
name type format label Variable label

_bs_1 float %9.0g r(mean)
_bs_2 float %9.0g r(N)

	_bs_1 _bs_2
1.	-.0924903 10
2.	.0861323 10
3.	-.088269 9
4.	-.4005653 8
5.	-.0740297 9

	_bs_1 _bs_2
1.	.0178111 9
2.	-.5203212 9
3.	.1150261 9
4.	.092199 9
5.	-.3069329 9

Source	SS df MS	Number of obs = 9
		F(1, 7) = 1.60
Model	1.53022378 1 1.53022378	Prob > F = 0.2464
Residual	6.69446954 7 .956352791	R-squared = 0.1861
		Adj R-squared = 0.0698
Total	8.22469332 8 1.02808666	Root MSE = .97793


y		Coefficient Std. err. t P>\|t\| [95% conf. interval]

x		1.451699 1.147646 1.26 0.246 -1.262054 4.165451
_cons		-1.069013 .7071156 -1.51 0.174 -2.741075 .6030498

	_bs_1 _bs_2
1.	-.5315873 9
2.	2.245691 9
3.	.9832834 9
4.	1.318368 9
5.	2.373077 9

Stata/MP4 Annual License (download)

Why does bootstrap give a warning message for non-eclass commands?

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies