Title | Advantages of the robust variance estimator | |

Author | Bill Sribney, StataCorp | |

Date | January 1998 |

I once overheard a famous statistician say the robust variance estimator for (unclustered) logistic regression is stupid. His reason was that if the outcome variable is binary then it’s got to be a Bernoulli distribution. It’s not like linear regression with data that stand a good chance of being nonnormal and heteroskedastic.

Well, it’s not as simple as this; there are a bunch of subtle nuisances. Let me lay them out here. I'm sure the famous statistician is aware of them, but they don’t necessarily lead to his conclusion.

It’s true that it has got to be a Bernoulli distribution. That
is, if** Y _{i}** is a random variable for the outcome of the

P(Y_{i=1}) = p_{i}

or, equivalently,

E(Y_{i}) = p_{i}

This *has* to be true. Note how I indexed the RHS by **i**. The
term **p _{i}** is dependent on

In logistic regression, we model **p _{i}** with a likelihood that
assumes

logit(p_{i}) = x_{i}*b

So these are our assumptions:

- The
**logit**link function is correct. -
**logit(p**; i.e., that the relation is linear and all necessary predictors are in the model; i.e., the model is correctly specified._{i}) = x_{i}*b - The
**i=1**,..., N observations are independent.

The robust variance estimator is robust to assumptions (1) and (2). It does require (3), but you can specify clusters and just assume independence of the clusters if you wish.

The MLE is also quite robust to (1) being wrong. If the link function is really probit and you estimate a logit, everything’s almost always fine.

Now if (2) is wrong, strictly speaking, you are in trouble with the
interpretation of the point estimates of your model, never mind the variance
estimates. Imagine **logit(p _{i})** is truly quadratic in

Well, the robust variance estimator will do a good job of giving you variance estimates and confidence intervals for this problematic case of a misspecified model. That is, if one imagines resampling the data and each time fitting the same misspecified model, then you get good coverage probabilities with respect to the “true” population parameters of the misspecified model, i.e., the best-fitting straight line in the population to something that’s not straight.

On the other hand, if you have confidence that your model is not misspecified, then the ML variance estimator is theoretically more efficient.

In summary, my personal advice (and I have respect for conflicting opinions) is

- I never worry about whether (1) is true. I assume the
**logit**link is OK. - If I think the model is reasonably specified, I use the ML variance estimator for logistic regression.
- Only if I have good reason to believe that the model is poorly specified would I use the robust variance estimator. That is, if the model fails goodness-of-fit tests, etc. Sometimes one just has to live with missing predictors and badly fitting models because data were collected for only a few predictors. In this case, I’d use the robust variance estimator.

And, obviously, I’d use the robust variance estimator if I had clustered data.

This recommendation is in contrast to the advice I’d give for linear
regression for which I’d say *always* use the robust variance
estimator.

There are also other theoretical reasons to be keener on the robust variance
estimator for linear regression than for general ML models. The robust
variance estimator uses a one-term Taylor series approximation. In linear
regression, the coefficient estimates, **b**, are a linear function of
**y**; namely,

-1 b = (X'X) X'y

Thus the one-term Taylor series is exact and not an approximation.

For logistic regression and other MLEs, the ignored higher-order terms in the Taylor series are nonzero. So it’s truly an approximation in these cases.

One can come up with a robust variance estimator that uses a second-order correction. It’s been worked out for logistic regression, and it will likely be implemented in Stata at some point.

What are the “true” population parameters for which the robust variance estimator gives good coverage properties?

First let me assume a linear regression model, and later I’ll discuss MLEs.

Consider the entire population from which the sample is drawn. Let **X**
be the matrix of independent variables and** Y** the vector of dependent
variables for the *entire population*. Then consider

B = (X'X)^-1 X'Y

The parameter **B** is the coefficient vector for the linear model for
the entire population. **Y** may be linear in **X** or it may not.
**B** is simply the best least-squares coefficients for the entire
population. **B** is what I was referring to when I said “the
‘true’ population parameters” in my above explanation.

The parameter **B** is what the robust variance estimator considers you
to be estimating. The sample estimate is

b = (x'x)^-1 x'y

where **x** is the matrix of the sample values of the independent
variables and **y** is the vector of sample dependent variables. If
there were sampling weights, the above equation would have weights added in
the appropriate places.

Anyhow, **b** is an estimate of **B**. The robust variance estimator
estimates **V(b)** such that nominal (1 − alpha) confidence
intervals constructed from it have **B** in the interval about (1 −
alpha) of the time if one was to repeatedly resample from this population.

If **Y** is not linear in **X** because of incorrect functional form
or missing predictors, then the interpretation of **B** is problematic.
**B** can be considered to be the best least-squares linear fit for this
set of predictors. **b** and **V(b)** are “robust to
misspecification” in that **b** estimates **B** and that
**V(b)** is a valid estimate of the variance of **b** even though
misspecification is present.

This theory requires no assumptions about the distribution of **Y**.

Contrast this with the case of OLS estimates, which do not give valid
variance estimates, in general, under misspecification, and which do require
distributional assumptions on **Y**—i.e., normality and
homoskedasticity.

For ML models, consider **L(B; Y, X)**, an arbitrary likelihood function
with data **Y**, **X** for the entire population. Let **B*** give
the maximum of **L(B; Y, X)**. The sample estimate **b*** is the
maximum of **L(b; y, x)**. If there are weights, we add weights to the
likelihood function so that **L(b; y, x)** estimates **L(B; Y, X)**.
Because **L(b; y, x)** estimates **L(B; Y, X)**, **b*** estimates
**B***. The robust variance estimator produces correct variance
estimates **V(b*)** for **b*** in the same sense discussed above:
nominal (1 − alpha) confidence intervals constructed from it have
**B*** in the interval about (1 − alpha) of the time if one were to
repeatedly resample from this population.

**L(B; Y, X)** is not necessarily the true likelihood for the population;
i.e., it is not necessarily the correct distribution of **Y|X**. The
theory doesn’t require it; it can be any function.

Standard MLE theory, on the other hand, requires **L(b; y, x)** to be the
true distribution function for the sample.

When there are sampling weights or clustering, **L(b; y, x)** is in no
sense a valid likelihood; it’s clearly not the distribution of the
sample when there are weights or cluster sampling. **L(b; y, x)** merely
has to estimate the arbitrary **L(B; Y, X)** for our theory to hold.
This is why the survey theorists call **L(b; y, x)** a pseudolikelihood,
and it’s also why you can’t do standard likelihood ratio tests
with it.

However, if **L(B; Y, X)** is not close to the true distribution, its
interpretation is problematic, just as in the case of a misspecified linear
regression.