What are the advantages of using the robust variance estimator over
the standard maximum-likelihood variance estimator in logistic regression?
|
Title
|
|
Advantages of the robust variance estimator
|
|
Author
|
Bill Sribney, StataCorp
|
|
Date
|
January 1998
|
I once overheard a famous statistician say the robust variance
estimator for (unclustered) logistic regression is stupid. His reason was
that if the outcome variable is binary then it’s got to be a Bernoulli
distribution. It’s not like linear regression with data that stand a
good chance of being nonnormal and heteroskedastic.
Well, it’s not as simple as this; there are a bunch of subtle
nuisances. Let me lay them out here. I'm sure the famous
statistician is aware of them, but they don’t necessarily lead to his
conclusion.
It’s true that it has got to be a Bernoulli distribution. That
is, if Yi is a random variable for the outcome of the
i-th unit, then
P(Yi=1) = pi
or, equivalently,
E(Yi) = pi
This has to be true. Note how I indexed the RHS by i. The
term pi is dependent on i. It’s certainly
not true in general that P(Yi=1) = p, where p is a
constant independent of i.
In logistic regression, we model pi with a likelihood that
assumes
logit(pi) = xi*b
So these are our assumptions:
- The logit link function is correct.
- logit(pi) = xi*b; i.e., that the relation
is linear and all necessary predictors are in the model; i.e., the
model is correctly specified.
- The i=1,..., N observations are independent.
The robust variance estimator is robust to assumptions (1) and (2). It
does require (3), but you can specify clusters and just assume independence
of the clusters if you wish.
The MLE is also quite robust to (1) being wrong. If the link function is
really probit and you estimate a logit, everything’s almost always
fine.
Now if (2) is wrong, strictly speaking, you are in trouble with the
interpretation of the point estimates of your model, never mind the variance
estimates. Imagine logit(pi) is truly quadratic in
xi, but you fit it linear in xi.
What’s the interpretation of the coefficient? It’s the best fit
of a straight line to something that’s not straight! It’s got
some meaning, but it’s somewhat problematic.
Well, the robust variance estimator will do a good job of giving you
variance estimates and confidence intervals for this problematic case of a
misspecified model. That is, if one imagines resampling the data and each
time fitting the same misspecified model, then you get good coverage
probabilities with respect to the “true” population parameters
of the misspecified model, i.e., the best-fitting straight line in the
population to something that’s not straight.
On the other hand, if you have confidence that your model is not
misspecified, then the ML variance estimator is theoretically more
efficient.
Advice
In summary, my personal advice (and I have respect for conflicting opinions)
is
- I never worry about whether (1) is true. I assume the logit
link is OK.
- If I think the model is reasonably specified, I use the ML
variance estimator for logistic regression.
- Only if I have good reason to believe that the model is poorly
specified would I use the robust variance estimator. That is, if the
model fails goodness-of-fit tests, etc. Sometimes one just has to live
with missing predictors and badly fitting models because data were
collected for only a few predictors. In this case, I’d use the
robust variance estimator.
And, obviously, I’d use the robust variance estimator if I had
clustered data.
This recommendation is in contrast to the advice I’d give for linear
regression for which I’d say always use the robust variance
estimator.
The robust variance estimator is only approximate for ML models
There are also other theoretical reasons to be keener on the robust variance
estimator for linear regression than for general ML models. The robust
variance estimator uses a one-term Taylor series approximation. In linear
regression, the coefficient estimates, b, are a linear function of
y; namely,
-1
b = (X'X) X'y
Thus the one-term Taylor series is exact and not an approximation.
For logistic regression and other MLEs, the ignored higher-order terms in
the Taylor series are nonzero. So it’s truly an approximation in
these cases.
One can come up with a robust variance estimator that uses a second-order
correction. It’s been worked out for logistic regression, and it will
likely be implemented in Stata at some point.
Follow-up question
What are the “true” population parameters for which the robust
variance estimator gives good coverage properties?
Linear regression model
First let me assume a linear regression model, and later I’ll discuss MLEs.
Consider the entire population from which the sample is drawn. Let X
be the matrix of independent variables and Y the vector of dependent
variables for the entire population. Then consider
B = (X'X)^-1 X'Y
The parameter B is the coefficient vector for the linear model for
the entire population. Y may be linear in X or it may not.
B is simply the best least-squares coefficients for the entire
population. B is what I was referring to when I said “the
‘true’ population parameters” in my above explanation.
The parameter B is what the robust variance estimator considers you
to be estimating. The sample estimate is
b = (x'x)^-1 x'y
where x is the matrix of the sample values of the independent
variables and y is the vector of sample dependent variables. If
there were sampling weights, the above equation would have weights added in
the appropriate places.
Anyhow, b is an estimate of B. The robust variance estimator
estimates V(b) such that nominal (1 − alpha) confidence
intervals constructed from it have B in the interval about (1 −
alpha) of the time if one was to repeatedly resample from this population.
If Y is not linear in X because of incorrect functional form
or missing predictors, then the interpretation of B is problematic.
B can be considered to be the best least-squares linear fit for this
set of predictors. b and V(b) are “robust to
misspecification” in that b estimates B and that
V(b) is a valid estimate of the variance of b even though
misspecification is present.
This theory requires no assumptions about the distribution of Y.
Contrast this with the case of OLS estimates, which do not give valid
variance estimates, in general, under misspecification, and which do require
distributional assumptions on Y—i.e., normality and
homoskedasticity.
ML models
For ML models, consider L(B; Y, X), an arbitrary likelihood function
with data Y, X for the entire population. Let B* give
the maximum of L(B; Y, X). The sample estimate b* is the
maximum of L(b; y, x). If there are weights, we add weights to the
likelihood function so that L(b; y, x) estimates L(B; Y, X).
Because L(b; y, x) estimates L(B; Y, X), b* estimates
B*. The robust variance estimator produces correct variance
estimates V(b*) for b* in the same sense discussed above:
nominal (1 − alpha) confidence intervals constructed from it have
B* in the interval about (1 − alpha) of the time if one were to
repeatedly resample from this population.
L(B; Y, X) is not necessarily the true likelihood for the population;
i.e., it is not necessarily the correct distribution of Y|X. The
theory doesn’t require it; it can be any function.
Standard MLE theory, on the other hand, requires L(b; y, x) to be the
true distribution function for the sample.
When there are sampling weights or clustering, L(b; y, x) is in no
sense a valid likelihood; it’s clearly not the distribution of the
sample when there are weights or cluster sampling. L(b; y, x) merely
has to estimate the arbitrary L(B; Y, X) for our theory to hold.
This is why the survey theorists call L(b; y, x) a pseudolikelihood,
and it’s also why you can’t do standard likelihood ratio tests
with it.
However, if L(B; Y, X) is not close to the true distribution, its
interpretation is problematic, just as in the case of a misspecified linear
regression.
|