Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Benjamin M Miller <BenMillerUCSD@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Re: Bootstrapping with unbalanced panel |
Date | Fri, 9 Aug 2013 20:16:51 -0700 |
I'm new to statalist, so hopefully the below is appropriately documented. I just finished dealing with a similar issue; there are related problems in several command files. The -bsample- command may not correctly specify -idcluster()-, but users of the -bootstrap- command will continue to have problems even if that is resolved. There have been many related threads on problems with the -cluster- and -idcluster- options of the -bootstrap- command when using panel data (ex. http://www.stata.com/statalist/archive/2010-06/msg01295.html, http://www.stata.com/statalist/archive/2006-05/msg00188.html, http://www.stata.com/statalist/archive/2011-04/msg01348.html, http://www.stata.com/statalist/archive/2010-12/msg00654.html). Hopefully the below explanation sheds some light on these issues. In a nutshell, the problem is this: When bootstrapping declared panel data, each resampling requires the panel structure of the data to be re-declared appropriately. -bootstrap- calls -_bs_loop- to loop over this sampling process, which in turn calls -bsample- to do the actual sampling. Even if -bsample- creates the correct -idcluster()- variable, -_loop_bs- declares structure with the original panel variable and not any new variable specified by -bsample-. The result is, of course, "repeated time values within panel - the most likely cause for this error is misspecifying the cluster(), idcluster(), or group() option" Here's some documentation: I created an test dataset with two random variables X and Y (distributed U(0,1), but that's unimportant). For a panel structure, there are ten individual (ID 1-10) with ten observations (Year 2000 - 2009). Hence this dataset has 100 observations and looks like this: Year ID X Y 2000 1 0.984563 0.9534 2001 1 0.596068 0.67932 ... 2009 1 0.363387 0.483985 2000 2 0.636904 0.89323 ... 2008 10 0.41201 0.38558 2009 10 0.077976 0.231712 Now we start running some bootstrap commands. I've sent the number of repetitions to 2 because more is unnecessary for this point. These three version of -bootstrap- work just fine (I'll only show output for the last one) . bootstrap, reps(2): reg Y X . bootstrap, reps(2) cluster(ID) idcluster(newID): reg Y X . xtset ID Year panel variable: ID (strongly balanced) time variable: Year, 2000 to 2009 delta: 1 unit . bootstrap, reps(2): reg Y X (running regress on estimation sample) Bootstrap replications (2) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .. Linear regression Number of obs = 100 Replications = 2 Wald chi2(1) = 0.00 Prob > chi2 = 0.9473 R-squared = 0.0001 Adj R-squared = -0.0101 Root MSE = 0.2874 ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based Y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- X | -.0100076 .1514095 -0.07 0.947 -.3067647 .2867495 _cons | .5055374 .0887203 5.70 0.000 .3316488 .679426 ------------------------------------------------------------------------------ Now let's keep the panel structure, but also cluster at the panel variable level. Because we will inevitably resample some clusters, we use -idcluster(newID)- to declare a new panel variable should be created for each subsample, and it will be called "newID". This variable should assign duplicate clusters unique values. However, we find . xtset ID Year panel variable: ID (strongly balanced) time variable: Year, 2000 to 2009 delta: 1 unit . bootstrap, reps(2) cluster(ID) idcluster(newID): reg Y X (running regress on estimation sample) Bootstrap replications (2) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 repeated time values within panel the most likely cause for this error is misspecifying the cluster(), idcluster(), or group() option Here's a good question: Why we didn't get complaints of repeated time values in case three (the one with declared panel data but without clusters)? We still had declared panel data, and we should still have had repeated time values within panel. The answer is as follows: -_loop_bs- does declare panel data using the original panel variable and not what you told it to in -idcluster()-. However, -bootstrap- only passes the names of the time time and panel variables to -_loop_bs- when the -cluster()- option is declared. When -cluster()- is not declared, the sampling routine doesn't know it is working with panel data,. Hence it doesn't complain about repeated time values because it never declares the re-sample to be panel data. This means you can't use things like lag operators, even on declared panel data: . xtset ID Year panel variable: ID (strongly balanced) time variable: Year, 2000 to 2009 delta: 1 unit . bootstrap, reps(2): reg Y X L.X time-series operators are not allowed with bootstrap without panels, see tsset I fixed this by creating -mybootstrap- which always passes panel information to -my_loop_bs- (How does/should one share new or edited .ado files? I assume most users don't want to replicate this editing.). -my_loop_bs- then sets the variable specified in -idcluster()- to uniquely identify duplicate clusters and uses that as the panel variable for each re-sampling. Now the -idcluster()- option is required for all panel data, and this seems to work. . xtset ID Year panel variable: ID (strongly balanced) time variable: Year, 2000 to 2009 delta: 1 unit . mybootstrap, reps(2) cluster(ID) idcluster(newID): reg Y X (running regress on estimation sample) Bootstrap replications (2) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .. Linear regression Number of obs = 100 Replications = 2 Wald chi2(1) = 0.33 Prob > chi2 = 0.5660 R-squared = 0.0001 Adj R-squared = -0.0101 Root MSE = 0.2874 (Replications based on 10 clusters in ID) ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based Y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- X | -.0100076 .0174344 -0.57 0.566 -.0441783 .0241631 _cons | .5055374 .020016 25.26 0.000 .4663068 .544768 ------------------------------------------------------------------------------ . mybootstrap, reps(2) idcluster(newID): reg Y X L.X (running regress on estimation sample) Bootstrap replications (2) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .. Linear regression Number of obs = 90 Replications = 2 Wald chi2(1) = . Prob > chi2 = . R-squared = 0.0034 Adj R-squared = -0.0195 Root MSE = 0.2903 ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based Y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- X | --. | .0555275 .0914356 0.61 0.544 -.1236831 .234738 L1. | -.0152741 .3603821 -0.04 0.966 -.72161 .6910618 | _cons | .4729284 .265143 1.78 0.074 -.0467423 .9925992 ------------------------------------------------------------------------------ This solution sweeps a couple more complex questions under the rug. First, if we use an -idcluster()- approach on a sample that was not selected at the cluster level (such as the lag example), we'd be turning a balanced panel into an unbalanced panel, or an unbalances panel into a "less" balanced panel. My intuition says because the lags are missing at random, resulting standard errors should be fine. But I haven't thought about it deeply. Second, even after all these fixes, you will still be returned error messages when your regression includes panel-level fixed effects or any other set of variables which will necessarily include at least one variable with no observations when some observations are not sampled. For good reason -bootstrap- does not return standard errors when the independent variables have changed. You *can* still get asymptotically accurate bootstrapped standard errors in this case, but the edits to .ado files are more complex. If there is demand for that, I can write something up (I have a clunky but working version, because that scenario is exactly what made me dig through all those .ado files). Hope that helps someone, Ben * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/