This question was originally posed on Statalist.

Title | Small-sample adjustments to the sandwich estimate of variance | |

Author | James Hardin, Texas A&M University | |

Date | March 2001; minor revisions April 2015 |

Roger Newson wrote:

Al Feiveson is quoted as saying,

When one runs a Stata program to fit a marginal model to clustered data
using robust standard errors, statistical inference on each parameter
estimate is reported in terms of a *p*-value calculated under the
assumption that the *z* value (estimated coefficient
divided by its robust-estimated standard error) has a standard normal
distribution under the null hypothesis of a zero coefficient.

I realize that all sorts of disclaimers on robust variance estimation abound
because proven properties are only “asymptotic”; however,
considering the manner in which the robust variance estimator is calculated,
it seems to me that if the number of groups, say, *N*, is not large
(say, less than 50), the null distribution of a *z* value
would be better approximated by a *t* distribution with *N*
− 1 degrees of freedom—especially if the cluster sizes are all equal (if they
are not equal, then using a normal distribution for *z*
would probably be even worse).

This variance estimate has an interesting history. It provides an asymptotically consistent estimate without making distributional assumptions, even if the assumed underlying model is incorrect. It has become so common that its estimation is no longer considered much of an issue. Stata has a long history with this particular variance estimate largely because of an interesting academic genealogy.

Huber is generally credited with the first mention of this variance estimate
(1967), and White is also credited for his independent description (1982).
Bill Rogers, one of the original Stata employees, was a student of Huber.
Rogers saw the utility of this estimate and set about including it wherever
possible in Stata. Longtime Stata users will recall the old
**hreg**, **hlogit**,
**hprobit**, etc., commands in which the inclusion of
the sandwich estimate of variance begat new commands where Rogers prepended
an ‘h’ to the associated estimation command (I assume) in honor
of his advisor.

In fact, Rogers is credited (at least by StataCorp) with the first mention of the modified sandwich estimate of variance. The modified sandwich estimate of variance is usually called the “robust cluster” variance estimate by Stata users. It takes advantage of the fact that while the observations are not independent, there are sets (clusters) of observations that are independent. By summing over the clusters, a modified sandwich estimate of variance may be constructed using the independent sums such that the resulting estimate is robust to within-cluster correlation. This robustness does not depend on any particular form of within-cluster correlation. Rogers claimed that the extension was obvious and probably known to Huber. There is no specific paper I know of that focuses on a detailed derivation/critique of this generalization.

Newson goes on to point out (in regard to the usual and modified sandwich estimates of variance) that:

Al has raised an important issue regarding Huber variances in general,
namely that their robustness to heteroskedasticity, overdispersion,
underdispersion and specification are purchased, in part, at the price of
being non-robust to finiteness of the number of clusters in the sample. The
assumption of a normal distribution for *z* is, in
general, over-optimistic, in that it assumes that the variance estimate is
not itself subject to sampling error. That is to say, the quantity

t=(thetahat-theta)/SEhat(thetahat)

(where theta is the parameter, thetahat is the estimate, and SEhat(thetahat)
is the estimated standard error of thetahat) is assumed to be equal to the true
*z*, defined as

z=(thetahat-theta)/SE(thetahat)

where SE(thetahat) is the true standard deviation of the sampling distribution of thetahat.
As the reciprocal is a convex function, the mean magnitude of *t* will
be greater than the mean magnitude of *z* if the variance estimate is
unbiased, leading to confidence intervals that are systematically too narrow
if the sample number is finite.

The introduction of the *t* distribution by Gosset (1908) was an
attempt to correct for the sampling error of the standard error itself, in
the case where the observations are themselves sampled independently from a
Gaussian population. The “degrees of freedom” introduced by Gosset is
really a statistical shorthand for “twice the inverse squared coefficient of
variation of the variance estimate itself”. Therefore, the more the
variance estimator itself is subject to sampling error, the fewer degrees of
freedom you are entitled to use. In the case of the unequal-variance
*t* test (a simple case of using Huber variances), the sampling error
of the sample standard error is usually greater than in the equal-variance case, and the
Satterthwaite degrees of freedom formula (Satterthwaite 1946) is intended to
correct for this. (See also
[R] **ttest**.)

Since the sandwich estimate of variance has found such favor across disciplines, recent research has focused on the behavior of the asymptotically consistent estimate in small samples. Generally, there are two approaches to improving the behavior of the sandwich estimate of variance in small samples:

- Scalar adjustment of the variance estimate
- Distributional changes in construction of confidence intervals

Stata has a history for providing (1), but not for providing (2). It benefits us to emphasize that both of these adjustments are ad hoc. The goal, after all, is to improve the small sample behavior of an asymptotically justified estimator. Mostly, researchers have applied both (1) and (2) as listed in the table below.

There are two questions to consider here:

- How does the sandwich estimate of variance compare to the usual parametric estimate?
- Does inefficiency have any impact on inference for small samples?

The discussion so far has focused on the second question. The first question is addressed in Kauermann and Carroll (2001), which states that the sandwich estimate of variance for simple linear regression when estimating a slope parameter has an asymptotic inefficiency equal to the inverse of the kurtosis of the design values. For example, if the design values are generated according to a Laplace distribution, then the sandwich method has asymptotic efficiency equal to 1/6 of the usual estimate when the model holds. So, let’s take just that case. How should we attempt to improve the small-sample behavior?

- Should we scalar adjust the variance estimate using the sample kurtosis?
- Should we follow the arguments of Newson and Feiveson
and acknowledge that the practice of using the maximum likelihood
estimate of the variance and normal (instead of
*t*) percentiles makes little sense? As such, we should alter the confidence interval construction by using the*t*distribution with some sort of adjustment for the degrees of freedom.

Most people have focused on the second question (just as Feiveson suspected and Newson supported):

In the more general case of generalized linear models, Al’s idea of
using the degrees of freedom *N* − 1 will
presumably alter the confidence interval width in the right direction. My
**rglm** program (Newson 1999) allows you to use the
*t* distribution with Huber variances for generalized
linear models. In this case, the degrees of freedom are
*N* − *p*, where
*N* is the number of clusters and
*p* is the number of parameters, and this will
compensate a little bit more than Al’s method. However, I would expect
both degrees of freedom formulae, in general, either to over-compensate or
to under-compensate. I have since worked out a more general
quasi-Satterthwaite degrees of freedom formula for generalized linear
models. However, I have not got around to testing my formula in any way, and
would like to see in detail what improvements to
**glm** have already been included in Stata 7 before I
start any projects of my own in that direction. I am also looking forward to
the appearance of Hardin and Hilbe’s book *Generalized Linear Models
and Extensions*, which is forthcoming from Stata Press. (This book is
now available; see
stata.com/bookstore/generalized-linear-models-and-extensions.)

Using the *t* distribution instead of the normal and
adjusting the degrees of freedom has been done in the literature. Often
this is combined with a scalar adjustment to the variance matrix. Commonly,
the degrees of freedom *n* −
*p*

*n* = number of observations in the usual sandwich or
number of clusters in the modified sandwich

*p* = length of beta (number of parameters)

Some applications that have been tried are

Scalar adjustment Degrees of freedom ----------------- ------------------ 0) n/(n-p) Z 1) n/(n-p) t(n-p) 2) n/(n-p) t((n-p)/kappa) 3) n/(n-2-p) t(n-2-p) 4) n/(n-p) F(1,n-p)

I believe that Feiveson was simulating results for (1). This is a
reasonable adjustment (possibly universally). (2) calculates a degrees of
freedom adjustment that is equal to (1) divided by the sample kurtosis of
the design values. This results in a different degrees of freedom for each
parameter. (3) is an alternative to (1) that is applied to the unbiased
sandwich estimate of variance. The unbiased sandwich estimate adjusts the
scores used in the middle of the sandwich estimate of variance to address
the biasedness of the residuals. Stata has had this for linear regression
(the **hc2** option since version 6, I believe)
and for the generalized linear model since version 7 (discussed in Hardin &
Hilbe). (4) attempts to modify likelihood-ratio tests for consistency (Kent
1982). I am sure that there are other combinations to the above.

Feiveson’s original question basically asked if it were reasonable to apply (1) in the case of xtgee. The answer is yes. I would probably choose (1) as a global adjustment absent further investigation of any particular model. However, I do not know if the answer is still yes if the question is changed to whether the adjustment is the best adjustment.

In general, Stata uses (0).

The discussion thus far has not focused on what Stata should or could do, so let’s examine that.

Does StataCorp have a responsibility to protect users from
themselves? This is a much-debated topic in-house. StataCorp, in fact, has
a history of trying to protect us when it can do so without preventing us
from accomplishing our goals. Note the use of a different command from
generating values for replacing values. There are sometimes notes in our
output in regard to completely determined observations. However, they
cannot protect us in this case any more than they
can protect us when we calculate bootstrap variance estimates. There is an
art and a science to the sandwich estimate of variance (and the bootstrap
variance estimate). The science is easy and admirably performed by Stata.
The art, however, absolutely **requires** us to think about the
application. Stata cannot address the appropriateness of these estimators
in every application.

If StataCorp were to support the use of the *t*
distribution in the estimation results, what degrees of freedom should it
use? Would its choice imply some sort of approval on the part of StataCorp
over the other choices? If so, does this open them up to scientific
criticism? Could they, instead, offer a **t()**
option that required specification of degrees of freedom from the user?
This seems reasonable to me. Conservatively, they could make the default
10,000 or some such large number so as to not effectively change the default
from the normal and makes explicit the responsibility of the
user—nothing happens that wasn’t explicitly requested.

In fact, Joe Hilbe requested this very idea when we were writing the
**glm** command for Stata 7. I liked the idea, but in
the rush of getting everything together, I let the idea slide, as I never
got around to discussing it with StataCorp. I was concerned over the
possible implications for adding the option to the rest of Stata’s
appropriate estimation commands. I feared that this was too great a task
(programming and documenting) for a last-minute request of StataCorp given
their focus on the new release.

To summarize, the sandwich estimate of variance is, in general, biased downward. It is adversely affected by leverage points and its asymptotic efficiency is (at least for linear regression) a function of the kurtosis of the design points.

I do not believe that there is one correct adjustment for the sandwich estimate of variance for all models that could be globally implemented. Case-by-case adjustments could be offered from users (the Stata Journal seems the appropriate vehicle). Newson’s notice of his work for GLM, for example, is received with keen interest.

I am sure that StataCorp would appreciate hearing from users as to what they want. While this does not bind them in any way, it might stir ideas of their own that are better than those already offered in regard to options and default behavior.

- Gosset, W. S. (under the pseudonym of "Student"). 1908.
- The probable error of a mean.
*Biometrika*6: 1–25.

- Hardin, J. W., and J. Hilbe. 2001.
- Generalized Linear Models and Extensions. College Station, TX: Stata Press.

- Huber, P. J. 1967.
- The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. 221–223. Berkeley, CA: University of California Press.

- Kauermann, G., and R. J. Carroll. 2001.
- A note on the efficiency of sandwich covariance matrix estimation.
*Journal of the American Statistical Association*96: 1387–1396.

- Kent, J. T. 1982.
- Robust properties of likelihood ratio tests.
*Biometrika*69: 19–27.

- Newson, R. 1999.
- sg114: rglm—Robust variance estimates for generalized
linear models. Stata Technical
Bulletin 50: 27–33. Reprinted in
*Stata Technical Bulletin Reprints*, vol. 9, pp. 181–190.

- Satterthwaite, F. E. 1946.
- An approximate distribution of estimates of variance components.
*Biometrics Bulletin*2: 110–114.

- White, H. 1982.
- Maximum likelihood estimation of misspecified
models.
*Econometrica*50: 1–25.