[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
wgould@stata.com (William Gould) |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Bootstrapping panel data with certain constraints |

Date |
Fri, 09 Aug 2002 09:28:55 -0500 |

Eva Poen <Eva.Poen@student.unisg.ch> writes, > [...] I am working on an analysis of some panel data. [...] the > observations in each period [are not] independent [...] Observations > (called "subjects") are organised in groups (of 3 or 4 people), which are > constant over time. Subjects within the groups are dependent (because they > strategically interact), but groups are independent. > > What I plan to do now is some sort of 'bootstrapping'. As I obviously cannot > include all observations at a time in regressions, I want to draw random > samples from my data in which only one subject of each group appears at a > time, and do my analyses on theses independent observations. The goal is to > repeat this over and over and then to compare results. > > Now two questions on this: > > 1) After reading manuals and FAQ's and trying a bit around I could not > find a possibility to do this with Stata's bootstrap capabilities. I > would be very happy if someone knew a solution to this very special > kind of bootstrapping programming issue. > > 2) I talked to several econometricians on this subject, but they could not > really tell if this procedure gives valid results on point estimates > and confidence intervals. Any comments on those statistical issues are > very much appreciated. I believe what Eva proposes is statistically valid. 1. The logic of the bootstrap ------------------------------ The mechanics of the bootstrap work like this: M1. With some dataset D, we estimate a model to obtain estimates b. M2. To obtain with standard errors for b, we repeatedly draw samples with replacement from D of the same sample size, reestimate the model to obtain estimates b_i for the i-th resampling, and we calculate the standard deviations of b_i. We use those standard deviations as the standard errors for b. The justification of the bootstrap step (2) is J1. If we had access to the population P from which D was drawn, clearly we could repeatedly draw samples of size N from that, calculate standard deviations, and use those as standard errors. They would be the standard errors if we did that an infinite number of times. We draw samples of that size because we want to evaluate the variance function at that sample size -- the sample size we used in the production of b. J2. We do not really have to do (J1) an infinite number of times; a large number of times will yield approximate results that are very good. J3. We could use D as a proxy for P under the assmption the D is large enough. Justification J3 is important to appreciate. D had better be large, because what we really need for step M2 is the population P and we are pretending that P==D. Note that, in performing the bootstrap, there is no accounting for how well D approximates P. That step is all handwaving. The bootstrap produces correct standard errors under the assumption that D==P. Exercise 1 ---------- We have dataset D1 of sample size N1 and we carry out steps M1 and M2. Someone comes to us later and says they have a new dataset D2 (drawn from the same population). It has N2>N1 observations. Can we carry reperform M2 using D2? Should we? Answer: Yes to both, but in reperforming step M2 with D2, we must be careful to draw N1 observations, not N2. We want to evaluate the variance of the estimator and sample sizes of N1 because that is what we used to obtain b in step M1. We should do this because N2 being larger than N1 means that we can expect D2 to better reflect the population P than D1 did. In fact, we can do something even better. We could combione the two datasets and reperform step M2 drawing samples of N1 observations from N1+N2. D1+D2 should be an even better proxy for the population. In fact, we can do something even better. We could go back and reperform steps M1 and M2 using all N2 observations. We would estimate on N2 obserations and evaluate the variance function (step M2) at N2 obsrvations. In fact, we can do something even better. We could combine the two datasets and reperform steps M1 and M2 using all N1+N2 observations. But if for some reason we could not reperform M1, using a better proxy for P can only make results more accurate. Eva's problem ------------- Eva has a dataset D with M independent groups and (say) 3 observations per group, for a total dataset size of N=3*M. I will assume a fixed 3 observations per group to make notation simplier, but nothing below hinges on the fact that I have fixed the number of observations per group. Eva is concerned about within-group correlation. There are estimators that would perhaps "handle" the problem, but they require assumptions, and Eva is so concerned about within-group correlation that she is willing to give up efficiency to rid herself of the problem. She says: I will sample one subject from each group and estimate my model on N/3 observations. Fine. Let the D1 be the dataset drawn from D on which Eva performs her estimation; N1=N/3. Eva now wants to calculate the bootstrap variance estimate for the estimate of b that she obtains. The standard bootstrap way to do this would be to repeatedly resample N/3 obsrvations from D1. That will yield fine results under the assumption that D1 is large enough to reflect the population of groups. Whether we meet that assumption is an interesting question. Even if D1 had an infinite number of observations it would still not equal P because Eva has told us that P has multiple observations per group. However, let's consider two extremes: the correlation within group is (+/-)1 and the correlation within group is 0. In the first case, there is no extra information in adding observations within group and so a sample of one-observation per group is sufficient. In the second case, observations within group are independent and so on, taking the limit as N->infinity, a sample of N/3 is equal to N. Eva in fact has a larger dataset D from which she could draw N/3 observations. To substitute D for D1 in step M2, the right way to proceed is (1) draw a sample of N/3 groups from D1 and (2) for each group selected, draw one of the three observations available in D for the group. This turns out to be equivalent to simply drawing N/3 observations from D because of the fixed number of subjects per group that I assumed. If the number of subjects per group varies, we must use the two-step sampling scheme. Programming technique --------------------- Eva cannot use -bs- or -bstrap-. Eva will have to build her own bootstrap estimator. Eva formed D1 by selecting one subject per group, so we know that D1 and D have the same groups. Thus, to form our boostrap sample, we can start with D, cluster sample the groups, and then select one subject from each group. That is, with D in memory, we will bsample, cluster(group) idcluster(newgroup) gen u = uniform() sort newgroup u by newgroup: keep if _n==1 The rest of the program is the "standard stuff" to loop over replications, perform the estimates, and post the results: program define myboot args nreps postfile myres b1 b2 b3 ... using myres.dta, replace forvalues i=1(1)`nreps' { qui use D, clear qui bsample, cluster)(group) idcluster(newgroup) qui gen u = uniform() sort newgroup u qui by newgroup: keep if _n==1 qui <perform estimation> post myres (_b[v1]) (_b[v2]) (_b[v3]) ... } postfile close use myres, clear summarize end -- Bill wgould@stata.com * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**Re: st: Keep string data** - Next by Date:
**st: outreg -> table - a suggestion** - Previous by thread:
**st: Bootstrapping panel data with certain constraints** - Next by thread:
**st: Keep string data** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |