How do I obtain bootstrapped standard errors with panel data?
| Title |
|
Bootstrap with panel data |
| Author |
Gustavo Sanchez, StataCorp |
| Date |
November 2005, minor revisions July 2011 |
In general, the bootstrap is used in statistics as a resampling method to
approximate standard errors, confidence intervals, and p-values for
test statistics, based on the sample data. This method is significantly
helpful when the theoretical distribution of the test statistic is unknown.
In Stata, you can use the
bootstrap
command or the vce(bootstrap) option
(available for many estimation commands) to bootstrap the standard errors of
the parameter estimates. We recommend using the
vce() option whenever possible because it
already accounts for the specific characteristics of the data. This
adjustment is particularly relevant for panel data where the randomly
selected observations for the bootstrap cannot be chosen by individual
record but by panel.
In the vce() option we can include all the
specifications we would regularly include in the
bootstrap command. For example, if we need to
perform a test on a linear combination of some of the coefficients of the
regression model, we can directly incorporate the linear combination
expression into vce(). The example below
shows the bootstrap for the standard errors of the difference between the
coefficients for age and
wks_work on a fixed-effects regression for
ln_wage:
. use http://www.stata-press.com/data/r12/nlswork
(National Longitudinal Survey. Young Women 14-26 years of age in 1968)
. xtset idcode
. xtreg ln_wage wks_work age tenure ttl_exp, fe
> vce(bootstrap (_b[age] - _b[wks_work]),rep(10) seed(123))
(running xtreg on estimation sample)
Bootstrap replications (10)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..........
Bootstrap results Number of obs = 27408
Replications = 10
command: xtreg ln_wage wks_work age tenure ttl_exp, fe
_bs_1: _b[age] - _b[wks_work]
(Replications based on 4674 clusters in idcode)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_bs_1 | -.0056473 .0008539 -6.61 0.000 -.0073209 -.0039736
------------------------------------------------------------------------------
As we mentioned above, we can get the same results with the
bootstrap command. However, by using the
vce() option, we do not have to explicitly specify
the panel-data characteristics of our dataset.
With user-written commands or with non-estimation commands, we need to use
bootstrap because there is no equivalent to the
vce() option. The example below shows the
bootstrap results for the ratio of the means of the first difference of two
variables variables (ttl_exp and
hours). We need to let the command know we are
dealing with panel data and, therefore, each random selection must
correspond to a panel. Moreover, repeated selections of the same panel
within one bootstrapped sample should be internally treated as different
panels.
Let’s first write a program that computes the ratio of the means of
two variables:
. program my_xtboot,rclass
1. summarize d.`1',meanonly
2. scalar mean`1'=r(mean)
3. summarize d.`2',meanonly
4. scalar mean`2'=r(mean)
5. return scalar ratio=scalar(mean`1')/scalar(mean`2')
6. end
Next let’s create and set the identifier cluster variable for the
bootstrapped panels, and then mark the sample to keep only those
observations that do not contain missing values for the variables of
interest.
. generate newid = idcode
. tsset newid year
panel variable: newid (unbalanced)
time variable: year, 68 to 88, but with gaps
delta: 1 unit
. generate sample=1-missing(ttl_exp,hours)
. keep if sample
(67 observations deleted)
Finally, we perform the simulation, specifying the panel characteristics of
the dataset:
. bootstrap ratio=r(ratio),rep(10) seed(123)
> cluster(idcode) idcluster(newid) nowarn:my_xtboot ttl_exp hours
(running my_xtboot on estimation sample)
Bootstrap replications (10)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
..........
Bootstrap results Number of obs = 28467
Replications = 10
command: my_xtboot ttl_exp hours
ratio: r(ratio)
(Replications based on 4710 clusters in idcode)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ratio | 2.830833 1.838615 1.54 0.124 -.7727853 6.434452
------------------------------------------------------------------------------
There are two cluster options in the bootstrap
command line. The first option, cluster(idcode),
identifies the original panel variable in the dataset, whereas the second,
idcluster(newid), creates a unique identifier
for each of the selected clusters (panels in this case). Thus if some panels
were selected more than once, the temporary variable
newid would assign a different ID number to
each resampled panel. If the two clusters indicators are omitted,
bootstrap will not take into account the panel
structure of the data; rather, it will construct the simulated samples by
randomly selecting individual observations from the pooled data.
|