Title | Bootstrap with panel data | |

Author | Gustavo Sanchez, StataCorp |

In general, the bootstrap is used in statistics as a resampling method to
approximate standard errors, confidence intervals, and *p*-values for
test statistics, based on the sample data. This method is significantly
helpful when the theoretical distribution of the test statistic is unknown.
In Stata, you can use the
bootstrap
command or the **vce(bootstrap)** option
(available for many estimation commands) to bootstrap the standard errors of
the parameter estimates. We recommend using the
**vce()** option whenever possible because it
already accounts for the specific characteristics of the data. This
adjustment is particularly relevant for panel data where the randomly
selected observations for the bootstrap cannot be chosen by individual
record but must be chosen by panel.

In the **vce()** option we can include all the
specifications we would regularly include in the
**bootstrap** command. For example, if we need to
perform a test on a linear combination of some of the coefficients of the
regression model, we can directly incorporate the linear combination
expression into **vce()**. The example below
shows the bootstrap for the standard errors of the difference between the
coefficients for **age** and
**wks_work** on a fixed-effects regression for
**ln_wage**:

1 | 2 | 3 | 4 | 5 |

Observed Bootstrap Normal-based | ||

Coef. Std. Err. z P>|z| [95% Conf. Interval] | ||

_bs_1 | -.0056473 .0011328 -4.99 0.000 -.0078675 -.003427 | |

As we mentioned above, we can get the same results with the
**bootstrap** command. However, by using the
**vce()** option, we do not have to explicitly specify
the panel-data characteristics of our dataset.

With community-contributed commands or with non-estimation commands, we need to use
**bootstrap** because there is no equivalent to the
**vce()** option. The example below shows the
bootstrap results for the ratio of the means of the first difference of two
variables variables (**ttl_exp** and
**hours**). We need to let the command know we are
dealing with panel data and, therefore, each random selection must
correspond to a panel. Moreover, repeated selections of the same panel
within one bootstrapped sample should be internally treated as different
panels.

Let’s first write a program that computes the ratio of the means of two variables:

Next let’s create and set the identifier cluster variables for the bootstrapped panels, and then mark the sample to keep only those observations that do not contain missing values for the variables of interest.

Finally, we perform the simulation, specifying the panel characteristics of the dataset:

1 | 2 | 3 | 4 | 5 |

Observed Bootstrap Normal-based | ||

Coef. Std. Err. z P>|z| [95% Conf. Interval] | ||

ratio | 2.830833 1.542854 1.83 0.067 -.1931047 5.854771 | |

There are two cluster options in the **bootstrap**
command line. The first option, **cluster(idcode)**,
identifies the original panel variable in the dataset, whereas the second,
**idcluster(newid)**, creates a unique identifier
for each of the selected clusters (panels in this case). Thus if some panels
were selected more than once, the temporary variable
**newid** would assign a different ID number to
each resampled panel. If the two clusters indicators are omitted,
**bootstrap** will not take into account the panel
structure of the data; rather, it will construct the simulated samples by
randomly selecting individual observations from the pooled data.