Title | Maximum likelihood estimation with vce(cluster clustvar) | |
Author | Bill Sribney, StataCorp |
No, they are not true maximum likelihood estimates.
Traditional maximum likelihood theory requires that the likelihood function be the distribution function for the sample.
When you have clustering, the observations are no longer independent; thus the joint distribution function for the sample is no longer the product of the distribution functions for each observation. That is, the joint distribution f(Y) is not
n Õ f_{i}(y_{i}) i=1Thus
n S log f_{i}(y_{i}) i=1
is not the true log-likelihood for the sample.
Unless one fully parameterizes the correlation within clusters (as in, say, a random-effects probit), one cannot write down the true likelihood for the sample.
The robust estimator used by probit, vce(cluster clustvar), and svy: probit, does not assume any particular model for the within-cluster correlation. Instead, these commands merely assume the values of b that maximize
n S log f_{i}(b; y_{i}) i=1
(call them bhat) are a reasonable estimate of the true b.
At this point in this discussion, the key question to ask is, What is the true b that is being estimated? It is the values of b that maximize
N S log f_{i}(b; y_{i}) i=1
where now the sum is over all individuals (i = 1,...,N) in the population from which the sample was drawn. That is, the true b is the solution of the maximum likelihood equation that we would have if we had data on all individuals in the population.
We are justified in using bhat as an estimate for the true b if
n S log f_{i}(b; y_{i}) i=1
is a good estimate for
N S log f_{i}(b; y_{i}) i=1which is a reasonable assumption, even if we have clustering.
If we have sampling weights, w_{i}, then we get bhat as the solution to
n S w_{i} * log f_{i}(b; y_{i}) i=1
since it is reasonable to assume
n S w_{i} * log f_{i}(b; y_{i}) i=1
is a good estimate for
N S log f_{i}(b; y_{i}) i=1
Since the likelihood used to derive bhat in the case of clustering or sampling weights is not a true likelihood, it is called a pseudolikelihood.
The variance estimates are now computed using sampling theory. That is, we say, what if the sample was drawn again and again using the same scheme (i.e., clustered or weighted), and bhat was mechanically computed as the maximum of the pseudolikelihood, what would the variance of bhat be?
Since traditional likelihood theory cannot be invoked for clustering or weighted sampling, one should not use traditional likelihood-ratio tests in these cases.
Is there a difference between the estimates produced by the svy: probit command and probit, vce(cluster clustvar) (and, similarly, between svy: logit, with psu variable specified in svyset and logit, vce(cluster clustvar))?
The point estimates and variance estimates are always the same.
The commands differ only in some small details. svy: probit and svy: logit use t statistics, whereas probit, vce(cluster clustvar) and logit, vce(cluster clustvar) use z statistics. The degrees of freedom for the t in svy: probit and svy: logit are the number of clusters (PSUs) minus the number of strata (one if unstratified). Strictly speaking, svy: probit and svy: logit are doing things right, but the difference matters only if you have a small number of clusters (say <40).
svy: probit and svy: logit also use an adjusted Wald test for the model test. probit, vce(cluster clustvar) and logit, vce(cluster clustvar) use an ordinary Wald test. Again, this difference matters only if you have a small number of clusters.
For a description of the variance estimator, see [SVY] variance estimation and [P] _robust in the Stata reference manuals.
Two standard references for this variance estimator as applied to pseudolikelihoods are