Stata
Products Purchase Support Company
Search
   >> Home >> Resources & support >> FAQs >> Within group collinearity in clogit

Why does clogit sometimes report a coefficient but missing value for the standard error, confidence interval, etc.?

Why is there no intercept in the clogit model?

Why can’t I use covariates that are constant within panel?

Title   Within group collinearity in conditional logistic regression
Author William Gould, StataCorp
Date November 1999

The short answer is that the variable reported with the missing standard error is “within-group collinear” with other covariates in the model. You need to drop the within-group collinear variable and reestimate. You can verify that within-group collinearity is the problem by using fixed-effects regressions on the covariates.

All of that is explained below and, along the way, we also explain why clogit sometimes produces the messages “var dropped due to no within-group variance” and “var dropped due to collinearity”.

The contents of this FAQ are

1. The conditional logistic model
2. Model derivation
2.1 Notation
2.2 Intercept
2.3 Within-group constants
2.4 Collinearity
2.5 Within-group collinearity
3. Recommendation

1. The conditional logistic model

Conditional logistic regression is similar to ordinary logistic regression except that the data occur in groups,

    group 1:
            obs. 1  outcome=1   x1 = ...  x2 = ... 
            obs. 2  outcome=0   x1 = ...  x2 = ...
    group 2:
            obs. 3  outcome=1   x1 = ...  x2 = ...
            obs. 4  outcome=0   x1 = ...  x2 = ...
            obs. 5  outcome=0   x1 = ...  x2 = ...
    group 3:
            ...
    .
    .
    Group G:
            ...

and we wish to condition on the number of positive outcomes within group. That is, we seek to fit a logistic model that explains why obs. 1 had a positive outcome in group 1 conditional on one of the observations in the group having a positive outcome.

In biostatistical applications, this need arises because researchers collect data on the sick and infected (the so-called positive outcomes) and then match those cases with controls who are not sick and infected. Thus the number of positive outcomes is not a random variable. Within each group, there had to be the observed number of positive outcomes because that is how the data were constructed.

Economists refer to this same model as the McFadden choice model. An individual is faced with an array of choices and must choose one.

Regardless of the justification, we are seeking to fit a model that explains why obs. 1 had a positive outcome in group 1, obs. 3 in group 2, and so on.

2. Model derivation

We assume that the unconditional probability of a positive outcome is given by the standard logit equation

    Pr(positive outcome) = G(x*b) 
                         = e^(x*b)/(1+e^(x*b))                      (1)

Equation (1) is not the appropriate probability for our data because it does not account for the conditioning. In the first group, for instance, we want

    Pr(obs. 1 positive and obs. 2 negative | one positive outcome) 

and that is easy enough to write down in terms of the unconditional probabilities. It is

                  Pr(1 positive)*Pr(2 negative)
    -------------------------------------------------------------    (2)
    Pr(1 positive)*Pr(2 negative) + Pr(1 negative)*Pr(1 positive)

From now on, when I write Pr(1 positive) and Pr(2 negative), etc., I mean the probability that observation 1 had a positive outcome, the probability that observation 2 had a negative outcome, and so on.

Substituting (1) into (2), we obtain

    Pr(1 positive and 2 negative | one positive outcome) 
    
                   e^(x1*b)
          =  --------------------                                    (3)
             e^(x1*b) + e^(x2*b)

So that is the model we seek to fit. (At least, that is the term for group 1 and there are similar terms for all the other groups. I have ignored the possibility of multiple positive outcomes within group because that just complicates things and is irrelevant to my point.)

2.1 Notation

In this FAQ, we will use the following mathematical notation. If you wish, you can skip to the next section and return here if our notation confuses you.

  • Pr(1 positive), Pr(2 negative), etc.
    Probability obs. 1 had a positive outcome,
    Probability obs. 2 had a negative outcome, etc.
  • Pr(1 positive and 2 negative | one positive outcome)
    Probability obs. 1 positive and obs. 2 negative given one positive outcome in the group.
  • e
    2.7182818...; we will write e^anything to mean exp(anything).
  • x
    Vector of values of explanatory variables for an observation.
  • x1, x2, etc.
    Vector of values of explanatory variables for obs. 1, obs. 2, etc.
  • b
    Vector of coefficients.
    x*b is thus the summed product of the explanatory variables with their respective coefficients.
  • var1, var2, etc.
    variables in the x vector.
  • var1_1, var1_2, var2_1, var2_2, etc.
    var1_1: value of var1 in obs. 1.
    var1_2: value of var1 in obs. 2.
    var2_1: value of var2 in obs. 1.
    var2_2: value of var2 in obs. 2.
  • a, b, c
    Scalars; elements of b.
    x*b = a + b*var1 + c*var2 + ...
    x1*b = a + b*var1_1 + c*var2_2 + ...
  • A, B, d
    More scalars.
  • G(x*b)
    Cumulative “logistic” distribution.

2.2 Intercept

Equation (3) has an unfortunate property. Let’s pretend that x, the vector of explanatory variables, includes var1 and var2. Thus our model of the probabilities is, from (1),

    Pr(positive outcome) = G(a + b*var1 + c*var2)
    
                              e^(a + b*var1 + c*var2)
                         = ---------------------------
                           1 + e^(a + b*var1 + c*var2)

Equation (3), the probability for the first group is similarly

                        e^(a + b*var1_1 + c*var2_1)
         ----------------------------------------------------------
          e^(a + b*var1_1 + c*var2_1) + e^(a + b*var1_2 + c*var2_2)

        
                        e^a e^(b*var1_1) e^(c*var2_1)
    =    --------------------------------------------------------------
          e^a e^(b*var1_1) e^(c*var2_1) + e^a e^(b*var1_2) e^(c*var2_2)

                        e^(b*var1_1) e^(c*var2_1)
    =    ------------------------------------------------------
          e^(b*var1_1) e^(c*var2_1) + e^(b*var1_2) e^(c*var2_2)

where var1_1 and var1_2 are the values of var1 in observations 1 and 2, respectively.

e^a cancelled in the numerator and denominator. Whatever is the true value of the intercept, it plays no role in determining the conditional probabilities of positive outcomes within groups. a could be 0, −10, or 57.12, and it would make no difference.

Since a plays no role, we will not be able to estimate it. In our model for the unconditional probabilities, we have

Pr(positive outcome) = G(a + b*var1 + c*var2)
                         ^   ^        ^
                         |   |        |
                         |   can be estimated by conditional logistic
                         |
                 cannot be estimated
                 by conditional logistic

That’s too bad but most researchers do not care much about the intercept anyway.

2.3 Within-group constants

The problem, however, can be worse than that. Say that var2 is constant within group. Remember that our term for the first group is

                   e^(b*var1_1) e^(c*var2_1)
    ------------------------------------------------------
     e^(b*var1_1) e^(c*var2_1) + e^(b*var1_2) e^(c*var2_2)

If var2_1==var2_2 (var2 is equal for the first two observations), then e^(c*var2) cancels and we are left with

            e^(b*var1_1) 
    -----------------------------
     e^(b*var1_1) + e^(b*var1_2) 

If this same cancellation occurs in groups 2, 3, ...—if var2 is a constant value in each group—then whatever is the true value of c, it plays no role in our model. c could be anything and it would not change any part of our calculation. For this problem to arise, var2 does not have to be a single constant value, it merely has to be constant within group.

So now, in our unconditional model, we have

    Pr(positive outcome) = G(a + b*var1 + c*var2)
                             ^            ^
                             |            |
                             |            cannot be estimated by conditional
                             |            logistic because var2 is constant
                             |            within group
                             |
                     cannot be estimated
                     by conditional logistic
                     because constant

None of this is very surprising. The conditional logistic model attempts to explain which observations within each group had positive outcomes, and things that do not vary within group play no role in the explanation. Moreover, there can be a real advantage in this. I may think that var2 belongs in the Pr(positive outcome) model but not know how it should be specified. Does var2 have a linear effect c*var2 or should it be quadratic c*var2+d*var2^2 or should be in the logs c*ln(var2) or how? In the conditional logistic model, if var2 is constant within group, it drops out no matter how the effect ought to be parameterized. This is a great advantage if my interest is in the effect of var1 and not var2.

All of this is a long explanation for why, when you fit a conditional logistic model, Stata sometimes says

    . clogit outcome var1 var2 var3 ..., group(id)
    note:  var2 dropped due to no within-group variance
    
    Iteration 0: ...
    ...
    (model without var2 reported)

2.4 Collinearity

I want to go back to our model

    Pr(positive outcome) = G(a + b*var1 + c*var2)

for which, in the first group,

    Pr(1 positive and 2 negative | one positive outcome) 
    
                           e^(b*var1_1) e^(c*var2_1)
         =  ------------------------------------------------------
             e^(b*var1_1) e^(c*var2_1) + e^(b*var1_2) e^(c*var2_2)

This time, let’s assume that var1 and var2 are collinear, meaning we can write

    var2 = A + B*var1

It will not surprise you to learn that we will not be able to estimate b and c. Substituting var2 = A + B*var1 into our formula for the conditional probability for group 1, we obtain

                     e^(b*var1_1) e^(c*(A+B*var1_1))
        ----------------------------------------------------------------
        e^(b*var1_1) e^(c*(A+B*var1_1)) + e^(b*var1_2) e^(c*(A+B*var1_2))

              e^(b*var1_1) e^(c*A) e^(c*B*var1_1)
    =   ------------------------------------------------------------------------
        e^(b*var1_1) e^(c*A) e^(c*B*var1_1) + e^(b*var1_2) e^(c*A) e^(c*B*var1_2)
   
                     e^(b*var1_1) e^(c*B*var1_1)
    =   -----------------------------------------------------
        e^(b*var1_1) e^(c*B*var1_1) + e^(b*var1_2) e^(c*B*var1_2)
   
                     e^((b+c*B)*var1_1)
    =   ---------------------------------------
        e^((b+c*B)*var1_1) + e^((b+c*B)*var1_2)

Let us write d = b+c*B. The term can then be written

           e^(d*var1_1)
    --------------------------
    e^(d*var1_1) + e^(d*var1_2)

This is just what the term would look like if we estimated on var_1 alone. Thus to fit this model we could

  1. Estimate on var1 alone to obtain d.
  2. Solve d = b+c*B to obtain b and c.

The problem occurs in step 2. We have one equation and two unknowns (b and c).

All of this is a long explanation for why, when you fit a conditional logistic model, Stata sometimes says

    . clogit outcome var1 var2 var3 ..., group(id)
    note:  var2 dropped due to collinearity
    
    Iteration 0: ...
    ...
    (model without var2 reported)

2.5 Within-group collinearity

The conditional logistic model is subject to another form of collinearity. As before, let us assume that

    Pr(positive outcome) = G(a + b*var1 + c*var2)

but this time var1 and var2 are *NOT* collinear,

    var2   *IS NOT EQUAL TO*   A + B*var1

Instead, however, let us assume that, for each group

    var2 = A_g + B*var1

That is, var1 and var2 are linearly related in the first group, linearly related in the second group, and so on. The coefficient B multiplying var1 is the same across groups but the intercept A is allowed to differ.

If you go back through the algebra for the simple collinearity case, you will note that it is all applicable because only the within-group collinearity of var1 and var2 were used.

The final equation still holds. The conditional probability for the first group can be written

                  e^((b+c*B)*var1_1)
     ---------------------------------------
     e^((b+c*B)*var1_1) + e^((b+c*B)*var1_2)

and again, this is just what the term would look like if we estimated on var_1 alone.

Here, however, Stata is not nearly so elegant. Stata does not notice this strange form of collinearity and so tries to estimate the model on var1 and var2. The result looks something like this:

 . clogit outcome var1 var2, group(id) 

 Iteration 0:  log likelihood = -32.91737
 Iteration 1:  log likelihood =-32.850723
 Iteration 2:  log likelihood = -32.85072
 
 Conditional (fixed-effects) logistic regression         Number of obs =     90
                                                         chi2(2)       =   0.22
                                                         Prob > chi2   = 0.8979
 Log likelihood =  -32.85072                             Pseudo R2     = 0.0033
 
 ------------------------------------------------------------------------------
      out |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
     var1 |  -.1890361   .4671095     -0.405   0.686      -1.104554    .7264817
     var2 |   .0269112          .          .       .              .           .
 ------------------------------------------------------------------------------

clogit attempted to estimate separate coefficients b and c for var1 and var2 but, because there is really only one, the covariance matrix of the estimators turned singular.

There is actually no error in the output above; it is merely one solution in terms of b and c to d = b+c*B. It is the solution under the assumption c=.0269112.

I can verify my suspicion about the within-group collinearity using fixed-effects regression (xtreg with the fe option):

 . xtreg var2 var1, fe i(id)
    
 Fixed-effects (within) regression               Number of obs      =       900
 Group variable (i) : id                         Number of groups   =        30
 
 R-sq:  within  = 1.0000                         Obs per group: min =        30
        between = 0.1937                                        avg =      30.0
        overall = 0.0012                                        max =        30
 
                                                 F(1,869)           =         .
 corr(u_i, Xb)  = -0.0035                        Prob > F           =         .
 
 ------------------------------------------------------------------------------
     var2 |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
     var1 |         -1          .          .       .              .           .
    _cons |   .6437411          .          .       .              .           .
 ------------------------------------------------------------------------------
  sigma_u |   .4605791
  sigma_e |          0
      rho |          1   (fraction of variance due to u_i)
 ------------------------------------------------------------------------------
 F test that all u_i=0:     F(29,869) =        .              Prob > F =      .

I obtained an R-sq within of 1 and the model is var2 = A_g - 1*var1.

Just to prove that all the formulas work, I refit the conditional logistic model and dropped var2, obtaining d = −.2159473:

 . clogit out var1, group(id)

 Iteration 0:  log likelihood =-32.939481
 Iteration 1:  log likelihood =-32.850725
 Iteration 2:  log likelihood = -32.85072
 
 Conditional (fixed-effects) logistic regression         Number of obs =     90
                                                         chi2(1)       =   0.22
                                                         Prob > chi2   = 0.6426
 Log likelihood =  -32.85072                             Pseudo R2     = 0.0033
 
 ------------------------------------------------------------------------------
      out |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
 ---------+--------------------------------------------------------------------
     var1 |  -.2159473   .4671095     -0.462   0.644      -1.131465    .6995704
 ------------------------------------------------------------------------------

Therefore,

                 b + c*B    = d
    −.1890361 + .0269112*B  =  −.2159473 
                .0269112*B  =  −.2159473 − −.1890361
                .0269112*B  =  −.0269112
                         B  =  −1

All the results are consistent.

3. Recommendation

If you suspect this kind of collinearity,

  1. Take the variable that was dropped—let’s call it var2—and estimate a fixed-effects regression on all the other independent variables using xtreg with the fe option:
        . xtreg var2 ..., i(group) fe
    
  2. If you obtain an R-sq within of 1, then you do have within-group collinearity. You will have to admit that you cannot estimate the var2 effect. Refit your clogit model, omitting the variable.
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Macintosh
Technical support
Resources & support
FAQs
Technical support
NetCourses
Short courses
Users Group meetings
Statalist
Links
Software updates
Software archives
Customer service
Manuals & supplements
Stata Journal
STB
Stata News
Stata Automation
Plugins

Site overview
Products
Resources & support
Company
Site index

© Copyright 1996–2008 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index