Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Bootstrapping panel data with certain constraints

From (William Gould)
Subject   Re: st: Bootstrapping panel data with certain constraints
Date   Fri, 09 Aug 2002 09:28:55 -0500

Eva Poen <> writes, 

> [...] I am working on an analysis of some panel data.  [...]  the
> observations in each period [are not] independent [...]  Observations
> (called "subjects") are organised in groups (of 3 or 4 people), which are
> constant over time. Subjects within the groups are dependent (because they
> strategically interact), but groups are independent.
> What I plan to do now is some sort of 'bootstrapping'. As I obviously cannot
> include all observations at a time in regressions, I want to draw random
> samples from my data in which only one subject of each group appears at a
> time, and do my analyses on theses independent observations. The goal is to
> repeat this over and over and then to compare results.
> Now two questions on this:
>   1) After reading manuals and FAQ's and trying a bit around I could not
>      find a possibility to do this with Stata's bootstrap capabilities. I
>      would be very happy if someone knew a solution to this very special
>      kind of bootstrapping programming issue.
>   2) I talked to several econometricians on this subject, but they could not
>      really tell if this procedure gives valid results on point estimates
>      and confidence intervals. Any comments on those statistical issues are
>      very much appreciated.

I believe what Eva proposes is statistically valid.

1.  The logic of the bootstrap

The mechanics of the bootstrap work like this:

    M1.  With some dataset D, we estimate a model to obtain estimates b.

    M2.  To obtain with standard errors for b, we repeatedly draw samples 
         with replacement from D of the same sample size, reestimate the model
         to obtain estimates b_i for the i-th resampling, and we calculate the
         standard deviations of b_i.  We use those standard deviations as the
         standard errors for b.

The justification of the bootstrap step (2) is 

    J1.  If we had access to the population P from which D was drawn, clearly 
         we could repeatedly draw samples of size N from that, calculate 
         standard deviations, and use those as standard errors.  They would 
         be the standard errors if we did that an infinite number of times.
         We draw samples of that size because we want to evaluate the 
         variance function at that sample size -- the sample size we used 
         in the production of b.

    J2.  We do not really have to do (J1) an infinite number of times;
         a large number of times will yield approximate results that are 
         very good.

    J3.  We could use D as a proxy for P under the assmption the D is large 
Justification J3 is important to appreciate.  D had better be large, because
what we really need for step M2 is the population P and we are pretending that

Note that, in performing the bootstrap, there is no accounting for how well
D approximates P.  That step is all handwaving.  The bootstrap produces 
correct standard errors under the assumption that D==P.

Exercise 1

We have dataset D1 of sample size N1 and we carry out steps M1 and M2.
Someone comes to us later and says they have a new dataset D2 (drawn from 
the same population).  It has N2>N1 observations.  Can we carry reperform 
M2 using D2?  Should we?

Answer:  Yes to both, but in reperforming step M2 with D2, we must be careful
to draw N1 observations, not N2.  We want to evaluate the variance of the 
estimator and sample sizes of N1 because that is what we used to obtain b
in step M1.  We should do this because N2 being larger than N1 means that 
we can expect D2 to better reflect the population P than D1 did.

In fact, we can do something even better.  We could combione the two datasets
and reperform step M2 drawing samples of N1 observations from N1+N2.  D1+D2
should be an even better proxy for the population.

In fact, we can do something even better.  We could go back and reperform 
steps M1 and M2 using all N2 observations.  We would estimate on N2 obserations
and evaluate the variance function (step M2) at N2 obsrvations.

In fact, we can do something even better.  We could combine the two datasets
and reperform steps M1 and M2 using all N1+N2 observations.

But if for some reason we could not reperform M1, using a better proxy for 
P can only make results more accurate.

Eva's problem

Eva has a dataset D with M independent groups and (say) 3 observations per 
group, for a total dataset size of N=3*M.  I will assume a fixed 3 observations
per group to make notation simplier, but nothing below hinges on the fact 
that I have fixed the number of observations per group.

Eva is concerned about within-group correlation.  There are estimators that 
would perhaps "handle" the problem, but they require assumptions, and Eva 
is so concerned about within-group correlation that she is willing to give 
up efficiency to rid herself of the problem.  She says:  I will sample 
one subject from each group and estimate my model on N/3 observations.

Fine.  Let the D1 be the dataset drawn from D on which Eva performs her
estimation; N1=N/3.

Eva now wants to calculate the bootstrap variance estimate for the estimate of
b that she obtains.  The standard bootstrap way to do this would be to
repeatedly resample N/3 obsrvations from D1.  That will yield fine results
under the assumption that D1 is large enough to reflect the population of

Whether we meet that assumption is an interesting question.  Even if D1 had an
infinite number of observations it would still not equal P because Eva has
told us that P has multiple observations per group.  
However, let's consider two extremes:  the correlation within group is (+/-)1
and the correlation within group is 0.  In the first case, there is no 
extra information in adding observations within group and so a sample of 
one-observation per group is sufficient.  In the second case, observations 
within group are independent and so on, taking the limit as N->infinity, 
a sample of N/3 is equal to N.  

Eva in fact has a larger dataset D from which she could draw N/3 observations.
To substitute D for D1 in step M2, the right way to proceed is (1) draw a
sample of N/3 groups from D1 and (2) for each group selected, draw one of the
three observations available in D for the group.  This turns out to be
equivalent to simply drawing N/3 observations from D because of the fixed
number of subjects per group that I assumed.  If the number of subjects per
group varies, we must use the two-step sampling scheme.

Programming technique

Eva cannot use -bs- or -bstrap-.  Eva will have to build her own bootstrap
estimator.  Eva formed D1 by selecting one subject per group, so we know 
that D1 and D have the same groups.  Thus, to form our boostrap sample, 
we can start with D, cluster sample the groups, and then select one subject
from each group.  That is, with D in memory, we will

        bsample, cluster(group) idcluster(newgroup)
        gen u = uniform() 
        sort newgroup u 
        by newgroup: keep if _n==1

The rest of the program is the "standard stuff" to loop over replications, 
perform the estimates, and post the results:

        program define myboot
                args nreps 

                postfile myres b1 b2 b3 ... using myres.dta, replace

                forvalues i=1(1)`nreps' {
                        qui use D, clear 
                        qui bsample, cluster)(group) idcluster(newgroup)
                        qui gen u = uniform()
                        sort newgroup u 
                        qui by newgroup: keep if _n==1
                        qui <perform estimation>
                        post myres (_b[v1]) (_b[v2]) (_b[v3]) ...
                postfile close 
                use myres, clear 

-- Bill
*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index