Home  /  Resources & support  /  FAQs  /  Small-sample adjustments to sandwich estimate of variance
Note: This question was originally posed on Statalist.

What are some of the small-sample adjustments to the sandwich estimate of variance?

Title   Small-sample adjustments to the sandwich estimate of variance
Author James Hardin, Texas A&M University

Roger Newson wrote:

Al Feiveson is quoted as saying,

When one runs a Stata program to fit a marginal model to clustered data using robust standard errors, statistical inference on each parameter estimate is reported in terms of a p-value calculated under the assumption that the z value (estimated coefficient divided by its robust-estimated standard error) has a standard normal distribution under the null hypothesis of a zero coefficient.

I realize that all sorts of disclaimers on robust variance estimation abound because proven properties are only “asymptotic”; however, considering the manner in which the robust variance estimator is calculated, it seems to me that if the number of groups, say, N, is not large (say, less than 50), the null distribution of a z value would be better approximated by a t distribution with N − 1 degrees of freedom—especially if the cluster sizes are all equal (if they are not equal, then using a normal distribution for z would probably be even worse).

This variance estimate has an interesting history. It provides an asymptotically consistent estimate without making distributional assumptions, even if the assumed underlying model is incorrect. It has become so common that its estimation is no longer considered much of an issue. Stata has a long history with this particular variance estimate largely because of an interesting academic genealogy.

Huber is generally credited with the first mention of this variance estimate (1967), and White is also credited for his independent description (1982). Bill Rogers, one of the original Stata employees, was a student of Huber. Rogers saw the utility of this estimate and set about including it wherever possible in Stata. Longtime Stata users will recall the old hreg, hlogit, hprobit, etc., commands in which the inclusion of the sandwich estimate of variance begat new commands where Rogers prepended an ‘h’ to the associated estimation command (I assume) in honor of his advisor.

In fact, Rogers is credited (at least by StataCorp) with the first mention of the modified sandwich estimate of variance. The modified sandwich estimate of variance is usually called the “robust cluster” variance estimate by Stata users. It takes advantage of the fact that while the observations are not independent, there are sets (clusters) of observations that are independent. By summing over the clusters, a modified sandwich estimate of variance may be constructed using the independent sums such that the resulting estimate is robust to within-cluster correlation. This robustness does not depend on any particular form of within-cluster correlation. Rogers claimed that the extension was obvious and probably known to Huber. There is no specific paper I know of that focuses on a detailed derivation/critique of this generalization.

Newson goes on to point out (in regard to the usual and modified sandwich estimates of variance) that:

Al has raised an important issue regarding Huber variances in general, namely that their robustness to heteroskedasticity, overdispersion, underdispersion and specification are purchased, in part, at the price of being non-robust to finiteness of the number of clusters in the sample. The assumption of a normal distribution for z is, in general, over-optimistic, in that it assumes that the variance estimate is not itself subject to sampling error. That is to say, the quantity

t=(thetahat-theta)/SEhat(thetahat)

(where theta is the parameter, thetahat is the estimate, and SEhat(thetahat) is the estimated standard error of thetahat) is assumed to be equal to the true z, defined as

z=(thetahat-theta)/SE(thetahat)

where SE(thetahat) is the true standard deviation of the sampling distribution of thetahat. As the reciprocal is a convex function, the mean magnitude of t will be greater than the mean magnitude of z if the variance estimate is unbiased, leading to confidence intervals that are systematically too narrow if the sample number is finite.

The introduction of the t distribution by Gosset (1908) was an attempt to correct for the sampling error of the standard error itself, in the case where the observations are themselves sampled independently from a Gaussian population. The “degrees of freedom” introduced by Gosset is really a statistical shorthand for “twice the inverse squared coefficient of variation of the variance estimate itself”. Therefore, the more the variance estimator itself is subject to sampling error, the fewer degrees of freedom you are entitled to use. In the case of the unequal-variance t test (a simple case of using Huber variances), the sampling error of the sample standard error is usually greater than in the equal-variance case, and the Satterthwaite degrees of freedom formula (Satterthwaite 1946) is intended to correct for this. (See also [R] ttest.)

Since the sandwich estimate of variance has found such favor across disciplines, recent research has focused on the behavior of the asymptotically consistent estimate in small samples. Generally, there are two approaches to improving the behavior of the sandwich estimate of variance in small samples:

  1. Scalar adjustment of the variance estimate
  2. Distributional changes in construction of confidence intervals

Stata has a history for providing (1), but not for providing (2). It benefits us to emphasize that both of these adjustments are ad hoc. The goal, after all, is to improve the small sample behavior of an asymptotically justified estimator. Mostly, researchers have applied both (1) and (2) as listed in the table below.

There are two questions to consider here:

  • How does the sandwich estimate of variance compare to the usual parametric estimate?
  • Does inefficiency have any impact on inference for small samples?

The discussion so far has focused on the second question. The first question is addressed in Kauermann and Carroll (2001), which states that the sandwich estimate of variance for simple linear regression when estimating a slope parameter has an asymptotic inefficiency equal to the inverse of the kurtosis of the design values. For example, if the design values are generated according to a Laplace distribution, then the sandwich method has asymptotic efficiency equal to 1/6 of the usual estimate when the model holds. So, let’s take just that case. How should we attempt to improve the small-sample behavior?

  • Should we scalar adjust the variance estimate using the sample kurtosis?
  • Should we follow the arguments of Newson and Feiveson and acknowledge that the practice of using the maximum likelihood estimate of the variance and normal (instead of t) percentiles makes little sense? As such, we should alter the confidence interval construction by using the t distribution with some sort of adjustment for the degrees of freedom.

Most people have focused on the second question (just as Feiveson suspected and Newson supported):

In the more general case of generalized linear models, Al’s idea of using the degrees of freedom N − 1 will presumably alter the confidence interval width in the right direction. My rglm program (Newson 1999) allows you to use the t distribution with Huber variances for generalized linear models. In this case, the degrees of freedom are Np, where N is the number of clusters and p is the number of parameters, and this will compensate a little bit more than Al’s method. However, I would expect both degrees of freedom formulae, in general, either to over-compensate or to under-compensate. I have since worked out a more general quasi-Satterthwaite degrees of freedom formula for generalized linear models. However, I have not got around to testing my formula in any way, and would like to see in detail what improvements to glm have already been included in Stata 7 before I start any projects of my own in that direction. I am also looking forward to the appearance of Hardin and Hilbe’s book Generalized Linear Models and Extensions, which is forthcoming from Stata Press. (This book is now available; see stata.com/bookstore/generalized-linear-models-and-extensions.)

Using the t distribution instead of the normal and adjusting the degrees of freedom has been done in the literature. Often this is combined with a scalar adjustment to the variance matrix. Commonly, the degrees of freedom np

n = number of observations in the usual sandwich or number of clusters in the modified sandwich
p = length of beta (number of parameters)

Some applications that have been tried are

             Scalar adjustment                  Degrees of freedom
             -----------------                  ------------------
        0)   n/(n-p)                            Z
        1)   n/(n-p)                            t(n-p)
        2)   n/(n-p)                            t((n-p)/kappa)
        3)   n/(n-2-p)                          t(n-2-p)
        4)   n/(n-p)                            F(1,n-p)

I believe that Feiveson was simulating results for (1). This is a reasonable adjustment (possibly universally). (2) calculates a degrees of freedom adjustment that is equal to (1) divided by the sample kurtosis of the design values. This results in a different degrees of freedom for each parameter. (3) is an alternative to (1) that is applied to the unbiased sandwich estimate of variance. The unbiased sandwich estimate adjusts the scores used in the middle of the sandwich estimate of variance to address the biasedness of the residuals. Stata has had this for linear regression (the hc2 option since version 6, I believe) and for the generalized linear model since version 7 (discussed in Hardin & Hilbe). (4) attempts to modify likelihood-ratio tests for consistency (Kent 1982). I am sure that there are other combinations to the above.

Feiveson’s original question basically asked if it were reasonable to apply (1) in the case of xtgee. The answer is yes. I would probably choose (1) as a global adjustment absent further investigation of any particular model. However, I do not know if the answer is still yes if the question is changed to whether the adjustment is the best adjustment.

In general, Stata uses (0).

The discussion thus far has not focused on what Stata should or could do, so let’s examine that.

Does StataCorp have a responsibility to protect users from themselves? This is a much-debated topic in-house. StataCorp, in fact, has a history of trying to protect us when it can do so without preventing us from accomplishing our goals. Note the use of a different command from generating values for replacing values. There are sometimes notes in our output in regard to completely determined observations. However, they cannot protect us in this case any more than they can protect us when we calculate bootstrap variance estimates. There is an art and a science to the sandwich estimate of variance (and the bootstrap variance estimate). The science is easy and admirably performed by Stata. The art, however, absolutely requires us to think about the application. Stata cannot address the appropriateness of these estimators in every application.

If StataCorp were to support the use of the t distribution in the estimation results, what degrees of freedom should it use? Would its choice imply some sort of approval on the part of StataCorp over the other choices? If so, does this open them up to scientific criticism? Could they, instead, offer a t() option that required specification of degrees of freedom from the user? This seems reasonable to me. Conservatively, they could make the default 10,000 or some such large number so as to not effectively change the default from the normal and makes explicit the responsibility of the user—nothing happens that wasn’t explicitly requested.

In fact, Joe Hilbe requested this very idea when we were writing the glm command for Stata 7. I liked the idea, but in the rush of getting everything together, I let the idea slide, as I never got around to discussing it with StataCorp. I was concerned over the possible implications for adding the option to the rest of Stata’s appropriate estimation commands. I feared that this was too great a task (programming and documenting) for a last-minute request of StataCorp given their focus on the new release.

To summarize, the sandwich estimate of variance is, in general, biased downward. It is adversely affected by leverage points and its asymptotic efficiency is (at least for linear regression) a function of the kurtosis of the design points.

I do not believe that there is one correct adjustment for the sandwich estimate of variance for all models that could be globally implemented. Case-by-case adjustments could be offered from users (the Stata Journal seems the appropriate vehicle). Newson’s notice of his work for GLM, for example, is received with keen interest.

I am sure that StataCorp would appreciate hearing from users as to what they want. While this does not bind them in any way, it might stir ideas of their own that are better than those already offered in regard to options and default behavior.

References

Gosset, W. S. (under the pseudonym of "Student"). 1908.
The probable error of a mean. Biometrika 6: 1–25.
Hardin, J. W., and J. Hilbe. 2001.
Generalized Linear Models and Extensions. College Station, TX: Stata Press.
Huber, P. J. 1967.
The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. 221–223. Berkeley, CA: University of California Press.
Kauermann, G., and R. J. Carroll. 2001.
A note on the efficiency of sandwich covariance matrix estimation. Journal of the American Statistical Association 96: 1387–1396.
Kent, J. T. 1982.
Robust properties of likelihood ratio tests. Biometrika 69: 19–27.
Newson, R. 1999.
sg114: rglm—Robust variance estimates for generalized linear models. Stata Technical Bulletin 50: 27–33. Reprinted in Stata Technical Bulletin Reprints, vol. 9, pp. 181–190.
Satterthwaite, F. E. 1946.
An approximate distribution of estimates of variance components. Biometrics Bulletin 2: 110–114.
White, H. 1982.
Maximum likelihood estimation of misspecified models. Econometrica 50: 1–25.