Robust inference for linear models

Order

Watch video demo

<- See Stata 18's new features

Highlights

Multiway cluster–robust standard errors
HC2 standard errors:
- Degrees-of-freedom adjustment
- Cluster–robust
- Cluster–robust and degrees-of-freedom adjustment
Wild cluster bootstrap confidence intervals and p-values

Stata 18 offers more precise standard errors and confidence intervals (CIs) for three commonly used linear models in Stata: regress, areg, and xtreg, fe.

Small number of clusters? Uneven number of observations per cluster? Use HC2 with degrees-of-freedom adjustment, option vce(hc2 ..., dfadjust), or wild cluster bootstrap to obtain valid inference.

Multiple nonnested clusters? Use multiway clustering, option vce(cluster group1 group2 ... groupk), to account for potential correlation of observations within different clusters.

Let's see it work

We have a panel of individuals and would like to study the effect of belonging to a union on the log of wages ln_wage. We control for whether the individual has a college degree collgrad, for length of job tenure, and for time fixed effects.

We compare several methods of computing standard errors: robust, cluster–robust, cluster–robust HC2 with degrees-of-freedom adjustment, and two-way clustering. The second and third methods account for correlation at the industry level. The last method accounts for correlation at both the industry level and occupation level. In our example, we use only 12 clusters, which violates the assumption of asymptotic approximation that the number of clusters grows with the sample size. We restrict our sample to observations where industry code ind_code is available. We also store the estimation results. We type

. webuse nlswork
(National Longitudinal Survey of Young Women, 14-24 years old in 1968)

. keep if ind_code!=.
(341 observations deleted)

. quietly regress ln_wage tenure union collgrad i.year, vce(robust)

. estimates store robust

. quietly regress ln_wage tenure union collgrad i.year, vce(cluster ind_code)

. estimates store cluster

. quietly regress ln_wage tenure union collgrad i.year, vce(hc2 ind_code, dfadjust)

. estimates store HC2

. quietly regress ln_wage tenure union collgrad i.year, vce(cluster idcode ind_code)

. estimates store multiway

Instead of looking at all the regression output tables, we combine them into an estimates table by using etable.

. etable, estimates(robust cluster HC2 multiway) 
		cstat(_r_ci, nformat(%9.4f)) 
		column(estimates) keep(union) 
		novarlabel nofvlab center 
		export(setable.html, replace) 
		title(Confidence-intervals comparison)

Confidence-intervals comparison


                             robust             cluster              HC2              multiway     

union                  [0.1352    0.1622] [0.0422    0.2553] [-0.0097    0.3072] [0.0420    0.2554]
Number of observations        18925              18925              18925               18925      
(collection ETable exported to file setable.html)

We asked etable to use the estimates we stored and to present only the CIs, cstat(_r_ci, ...), for the coefficient on union, keep(union). We then export the table to the .html table you see on this page, export(setable.html, replace).

The CIs are the narrowest with robust standard errors. They are the widest with HC2 degrees of freedom–adjusted standard errors. In the latter case, 0 is inside the CI, which suggests we should be careful when interpreting the effect of belonging to a union on wages. This is in contrast with the conclusion we would have made had we used only robust standard errors. Finally, there appears to be little difference between clustering at the industry level and clustering at both industry and occupation levels.

We can also use wild cluster bootstrap to account for a small number of clusters and an unequal number of observations per cluster. It is implemented in the new wildbootstrap command. We describe this feature in detail in Wild cluster bootstrap, but let's also use it here for comparison.

. wildbootstrap regress ln_wage tenure union collgrad i.year,
	cluster(ind_code) coefficients(union) rseed(111)

wildbootstrap calls regress. So after it is done, you can still access the regress results. But, additionally, wildbootstrap constructs wild cluster bootstrap CIs for the null hypothesis that a coefficient is 0. By default, it uses all coefficients, but you may select which ones you would like to study. We focus on union. Because we are resampling at the cluster level, we specify the ind_code variable in cluster(), and we set a seed for reproducibility.

. wildbootstrap regress ln_wage tenure union collgrad i.year, 
	 cluster(ind_code) coefficients(union) rseed(111)

Performing 1,000 replications for p-value for constraint
  union = 0 ...
Computing confidence interval for union 
  Lower bound: .........10.........20....... done (27)
  Upper bound: .........10.........20..... done (25)

Wild cluster bootstrap                            Number of obs      = 18,925
Linear regression                                 Number of clusters =     12
                                                  Cluster size:
Cluster variable: ind_code                                       min =     37
Error weight: Rademacher                                         avg = 1577.1
                                                                 max =   6296


	 ln_wage     Estimate      t  p-value    [95% conf. interval]

constraint                
               union = 0     .1487097    3.07   0.048    .0558148    .3660002

The CI reported by wildbootstrap is almost as wide as that reported when we used HC2 standard errors. Although 0 is not in the CI, it suggests that there is a wide variability in the point estimate.

Tell me more

Also see Wild cluster bootstrap.

View all the new features in Stata 18 and, in particular, New in linear models.

Made for data science.

Get started today.

Order

Upgrade

2024 Stata Conference · 1-2 August · Portland, OR

View the program →

View the program →