Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Strange -robust- results with a singleton dummy


From   vwiggins@stata.com (Vince Wiggins, StataCorp)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Strange -robust- results with a singleton dummy
Date   Mon, 30 Jun 2003 16:25:54 -0500

Mark Schaffer <M.E.Schaffer@hw.ac.uk> is estimating a model with an indicator
(dummy) variable that is 1 in only a single observation and 0 everywhere else
and he wants an explanation for some things he notices about the
variance-covariance matrix,

> I've encountered (via David Stromberg) a peculiar feature of
> regression with heteroskedastic-robust SEs when using dummy
> variables.
>
> If a dummy variable takes the value of 1 for a single observation,
> and zeros for the rest, some strange things happen:
>
> 1. The robust SEs still look quite plausible.
>
> 2. The F-stat is reported as missing.  There is a hyperlink for the
> missing F-stat in the regression output (Stata v7) but it doesn't
> mention the singleton dummy as a possible explanation.
>
> 3. The robust var-cov matrix is not of full rank.  Invert it and one
> of the row/columns becomes all zeros (but not necessarily the one
> corresponding to the singleton dummy).

Mark then goes on to ask 3 questions.  The questions help tell the story, so
let's take them in order.

> Does anybody have any ideas on how to interpret this?

Mechanically it is pretty easy to see what is happening.  The robust
covariance matrix is:

        V_robust = DGD

        where:
                D  is the negative inverse hessian (the most often used
                   estimate of the covariance matrix).

                G  is the outer product of the score (or gradient) vectors
                   for each observation, often called the OPG. (Also, a
                   perfectly valid estimate of the covariance matrix and
                   typically used when estimating by BHHH.)


        G = g'g

        where:
                       d(L_i)
                g_ik = ------
                       d(B_k)

        and:
                L_i  is the quasi-likelihood of the ith observation
                B_k  is the vector of coefficients

        So, g is a N by k vector where k is the number of parameters.

We have started from a quasi-likelihood using L_i, but we could have started
from the estimating equations (or normal equations) for OLS, it makes no
difference.

When we have an indicator variable that is 1 for a single observation and 0
everywhere else, the column vector g_k has a very distinctive pattern -- it is
all zeros.

              d(L_i)
        g_i = --------------  = 0  whenever the indicator is 0 
              d(B_indicator)       because B_indicator*0 is 0. 


              d(L_i)
        g_i = --------------  = 0  for the single observation where 
              d(B_indicator)       indicator=1 because the moment conditions
                                   for maximizing the quasi-likelihood are that
                                   the gradient for each coefficient is 0.
                                   Since all of the other observations have
                                   the indicator set to 0, only this
                                   observation contributes to the gradient and
                                   is is set to 0 by the moment condition in
                                   choosing B_indicator.

                                   Put another way, the scores for a
                                   coefficient (g_i) must sum to 0 and since
                                   the score is 0 when the variable is 0 and
                                   since variable is non-zero for only one
                                   observations, the score for that
                                   observation must also be 0.

All of this means that the column of g corresponding to the indicator variable
is all 0 and thus G = g'g is not full rank, and thus V_robust=DGD is also not
full rank.

That means Stata cannot compute an overall model F-statistic because the rank
of the covariance matrix is not sufficient to test the hypothesis that all of
the coefficients are simultaneously 0.  This is what Mark noticed in his items
(2) and (3).


Mark's second question was,

> Are the robust SEs usable anyway?

Yes.  

We wrote everything in matrix notation because it is easier and because it
clearly shows why the covariance matrix is not full rank.  If, however, we
simply wrote out the formula for the SE of a single parameter (let's not) we
would see that it can be evaluated and is just a sum of specific element-wise
products from the elements of D and G.  That G is not full rank does not cause
us any problems in computing the SE for a single coefficient.

Intuitively, the lack of information from the gradients of the singleton
indicator variable do not cause a problem even when estimating the robust SE
for the indicator variable itself.  The gradients from the remaining
coefficients are leveraged to form that estimate.  It is not much different
from our ability to estimate a standard (non-robust) SE for the indicator even
though all of the information content of the single positive observation went
into the parameter estimate.


> Is the robust var-cov matrix still usable?

Mostly.

We can use the covariance matrix to test any subset of joint hypotheses that
do not exceed its rank.


Mark mentioned the link from the unreported F-statistic to an explanation of
why the statistic is not reported.  That link was created when we were
considering the issue of fewer clusters than parameters using a clustered
version of the robust variance estimator.  The link does not discuss the issue
that Mark and David uncovered.  We had not even considered the question of
singleton indicators, or as it turns out ANY data and model that lead to all 0
scores or by extension scores that are collinear.  The two cases, too few
clusters and singleton indicators, produce the same problem, a G matrix that
is not full rank.  We will update that link to be more complete, but
unfortunately anyone who has read the nice clear discussion in the link, and
also read the above will realize the discussion in the link is about to become
more complicated.

 
-- Vince
   vwiggins@stata.com

P.S. A little (unrelated) story
     --------------------------

You might wonder why I always use the word "indicator" to describe binary
regressors, even when the original poster called them "dummy" variables.  I
was once briefing a group of mainly military personal about the implications
of some model.  About the third time I referred to the "colonel dummy"
everyone at the table broke out in laughter, everyone that is except the older
gentleman at the head of the table with lots of bars on his label and little
bird emblems on his shoulders.

I long ago forgot the subject of the talk, but I never forgot the lesson.

<end>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index