Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Re: Bootstrapping with unbalanced panel

From	Benjamin M Miller <[email protected]>
To	[email protected]
Subject	Re: st: Re: Bootstrapping with unbalanced panel
Date	Fri, 9 Aug 2013 20:16:51 -0700
I'm new to statalist, so hopefully the below is appropriately documented.

I just finished dealing with a similar issue; there are related
problems in several command files.  The -bsample- command may not
correctly specify -idcluster()-, but users of the -bootstrap- command
will continue to have problems even if that is resolved.

There have been many related threads on problems with the -cluster-
and -idcluster- options of the -bootstrap- command when using panel
data (ex. http://www.stata.com/statalist/archive/2010-06/msg01295.html,
http://www.stata.com/statalist/archive/2006-05/msg00188.html,
http://www.stata.com/statalist/archive/2011-04/msg01348.html,
http://www.stata.com/statalist/archive/2010-12/msg00654.html).
Hopefully the below explanation sheds some light on these issues.

In a nutshell, the problem is this: When bootstrapping declared panel
data, each resampling requires the panel structure of the data to be
re-declared appropriately.  -bootstrap- calls -_bs_loop- to loop over
this sampling process, which in turn calls -bsample- to do the actual
sampling.  Even if -bsample- creates the correct -idcluster()-
variable, -_loop_bs- declares structure with the original panel
variable and not any new variable specified by -bsample-.  The result
is, of course, "repeated time values within panel - the most likely
cause for this error is misspecifying the cluster(), idcluster(), or
group() option"


Here's some documentation:

I created an test dataset with two random variables X and Y
(distributed U(0,1), but that's unimportant).  For a panel structure,
there are ten individual (ID 1-10) with ten observations (Year 2000 -
2009).  Hence this dataset has 100 observations and looks like this:

Year    ID    X    Y
2000    1    0.984563    0.9534
2001    1    0.596068    0.67932
...
2009    1    0.363387    0.483985
2000    2    0.636904    0.89323
...
2008    10   0.41201     0.38558
2009    10   0.077976    0.231712

Now we start running some bootstrap commands.  I've sent the number of
repetitions to 2 because more is unnecessary for this point.  These
three version of -bootstrap- work just fine (I'll only show output for
the last one)

. bootstrap, reps(2): reg Y X

. bootstrap, reps(2) cluster(ID) idcluster(newID): reg Y X

. xtset ID Year
       panel variable:  ID (strongly balanced)
        time variable:  Year, 2000 to 2009
                delta:  1 unit

. bootstrap, reps(2): reg Y X
(running regress on estimation sample)

Bootstrap replications (2)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..

Linear regression                               Number of obs      =       100
                                                Replications       =         2
                                                Wald chi2(1)       =      0.00
                                                Prob > chi2        =    0.9473
                                                R-squared          =    0.0001
                                                Adj R-squared      =   -0.0101
                                                Root MSE           =    0.2874

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           X |  -.0100076   .1514095    -0.07   0.947    -.3067647    .2867495
       _cons |   .5055374   .0887203     5.70   0.000     .3316488     .679426
------------------------------------------------------------------------------


Now let's keep the panel structure, but also cluster at the panel
variable level.  Because we will inevitably resample some clusters, we
use -idcluster(newID)- to declare a new panel variable should be
created for each subsample, and it will be called "newID".  This
variable should assign duplicate clusters unique values.  However, we
find

. xtset ID Year
       panel variable:  ID (strongly balanced)
        time variable:  Year, 2000 to 2009
                delta:  1 unit

. bootstrap, reps(2) cluster(ID) idcluster(newID): reg Y X
(running regress on estimation sample)

Bootstrap replications (2)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
repeated time values within panel
the most likely cause for this error is misspecifying the cluster(),
idcluster(), or group() option


Here's a good question: Why we didn't get complaints of repeated time
values in case three (the one with declared panel data but without
clusters)?   We still had declared panel data, and we should still
have had repeated time values within panel.  The answer is as follows:
-_loop_bs- does declare panel data using the original panel variable
and not what you told it to in -idcluster()-.  However, -bootstrap-
only passes the names of the time time and panel variables to
-_loop_bs- when the -cluster()- option is declared.  When -cluster()-
is not declared, the sampling routine doesn't know it is working with
panel data,.  Hence it doesn't complain about repeated time values
because it never declares the re-sample to be panel data.  This means
you can't use things like lag operators, even on declared panel data:

. xtset ID Year
       panel variable:  ID (strongly balanced)
        time variable:  Year, 2000 to 2009
                delta:  1 unit

. bootstrap, reps(2): reg Y X L.X
time-series operators are not allowed with bootstrap without panels, see tsset

I fixed this by creating -mybootstrap- which always passes panel
information to -my_loop_bs- (How does/should one share new or edited
.ado files?  I assume most users don't want to replicate this
editing.).  -my_loop_bs- then sets the variable specified in
-idcluster()- to uniquely identify duplicate clusters and uses that as
the panel variable for each re-sampling.  Now the -idcluster()- option
is required for all panel data, and this seems to work.

. xtset ID Year
       panel variable:  ID (strongly balanced)
        time variable:  Year, 2000 to 2009
                delta:  1 unit

. mybootstrap, reps(2) cluster(ID) idcluster(newID): reg Y X
(running regress on estimation sample)

Bootstrap replications (2)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..

Linear regression                               Number of obs      =       100
                                                Replications       =         2
                                                Wald chi2(1)       =      0.33
                                                Prob > chi2        =    0.5660
                                                R-squared          =    0.0001
                                                Adj R-squared      =   -0.0101
                                                Root MSE           =    0.2874

                                     (Replications based on 10 clusters in ID)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           X |  -.0100076   .0174344    -0.57   0.566    -.0441783    .0241631
       _cons |   .5055374    .020016    25.26   0.000     .4663068     .544768
------------------------------------------------------------------------------

. mybootstrap, reps(2) idcluster(newID): reg Y X L.X
(running regress on estimation sample)

Bootstrap replications (2)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..

Linear regression                               Number of obs      =        90
                                                Replications       =         2
                                                Wald chi2(1)       =         .
                                                Prob > chi2        =         .
                                                R-squared          =    0.0034
                                                Adj R-squared      =   -0.0195
                                                Root MSE           =    0.2903

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
           Y |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           X |
         --. |   .0555275   .0914356     0.61   0.544    -.1236831     .234738
         L1. |  -.0152741   .3603821    -0.04   0.966      -.72161    .6910618
             |
       _cons |   .4729284    .265143     1.78   0.074    -.0467423    .9925992
------------------------------------------------------------------------------


This solution sweeps a couple more complex questions under the rug.

First, if we use an -idcluster()- approach on a sample that was not
selected at the cluster level (such as the lag example), we'd be
turning a balanced panel into an unbalanced panel, or an unbalances
panel into a "less" balanced panel.  My intuition says because the
lags are missing at random, resulting standard errors should be fine.
But I haven't thought about it deeply.

Second, even after all these fixes, you will still be returned error
messages when your regression includes panel-level fixed effects or
any other set of variables which will necessarily include at least one
variable with no observations when some observations are not sampled.
For good reason -bootstrap- does not return standard errors when the
independent variables have changed.  You *can* still get
asymptotically accurate bootstrapped standard errors in this case, but
the edits to .ado files are more complex.  If there is demand for
that, I can write something up (I have a clunky but working version,
because that scenario is exactly what made me dig through all those
.ado files).


Hope that helps someone,
Ben
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: st: RE: standardized % bias before and after matching using the "diff" command
Next by Date: Re: Re: st: a user-written program for clustering SE on more than one clustering variable?
Previous by thread: Re: st: Re: Bootstrapping with unbalanced panel
Next by thread: st: Proportional hazard assumption test for stcrreg
Index(es):
- Date
- Thread