Home  /  Resources & support  /  FAQs  /  Within group collinearity in clogit

Why does clogit sometimes report a coefficient but missing value for the standard error, confidence interval, etc.?

Why is there no intercept in the clogit model?

Why can’t I use covariates that are constant within panel?

Title   Within group collinearity in conditional logistic regression
Author William Gould, StataCorp

The short answer is that the variable reported with the missing standard error is “within-group collinear” with other covariates in the model. You need to drop the within-group collinear variable and reestimate. You can verify within-group collinearity is the problem by using fixed-effects regressions on the covariates.

All of this is explained below and, along the way, we also explain why clogit sometimes produces the messages “var omitted because of no within-group variance” and “var omitted because of collinearity”.

The contents of this FAQ are

1. The conditional logistic model
2. Model derivation
2.1 Notation
2.2 Intercept
2.3 Within-group constants
2.4 Collinearity
2.5 Within-group collinearity
3. Recommendation

1. The conditional logistic model

Conditional logistic regression is similar to ordinary logistic regression except the data occur in groups,

    group 1:
            obs. 1  outcome=1   x1 = ...  x2 = ... 
            obs. 2  outcome=0   x1 = ...  x2 = ...
    group 2:
            obs. 3  outcome=1   x1 = ...  x2 = ...
            obs. 4  outcome=0   x1 = ...  x2 = ...
            obs. 5  outcome=0   x1 = ...  x2 = ...
    group 3:
            ...
    .
    .
    group G:
            ...

and we wish to condition on the number of positive outcomes within group. That is, we seek to fit a logistic model that explains why obs. 1 had a positive outcome in group 1 conditional on one of the observations in the group having a positive outcome.

In biostatistical applications, this need arises because researchers collect data on the sick and infected (the so-called positive outcomes), and then match those cases with controls who are not sick and infected. Thus the number of positive outcomes is not a random variable. Within each group, there had to be the observed number of positive outcomes because that is how the data were constructed.

Economists refer to this same model as the McFadden choice model. An individual is faced with an array of choices and must choose one.

Regardless of the justification, we are seeking to fit a model that explains why obs. 1 had a positive outcome in group 1, obs. 3 in group 2, and so on.

2. Model derivation

We assume the unconditional probability of a positive outcome is given by the standard logit equation

    Pr(positive outcome) = G(x*b) 
                         = e^(x*b)/(1+e^(x*b))                       (1)

Equation (1) is not the appropriate probability for our data because it does not account for the conditioning. In the first group, for instance, we want

    Pr(obs. 1 positive and obs. 2 negative | one positive outcome) 

and that is easy enough to write down in terms of the unconditional probabilities. It is

                  Pr(1 positive)*Pr(2 negative)
    -------------------------------------------------------------    (2)
    Pr(1 positive)*Pr(2 negative) + Pr(1 negative)*Pr(2 positive)

From now on, when I write Pr(1 positive) and Pr(2 negative), etc., I mean the probability that observation 1 had a positive outcome, the probability that observation 2 had a negative outcome, and so on.

Substituting (1) into (2), we obtain

    Pr(1 positive and 2 negative | one positive outcome) 
    
                   e^(x1*b)
          =  --------------------                                    (3)
             e^(x1*b) + e^(x2*b)

So that is the model we seek to fit. (At least, that is the term for group 1, and there are similar terms for all the other groups. I have ignored the possibility of multiple positive outcomes within group because that just complicates things and is irrelevant to my point.)

2.1 Notation

In this FAQ, we will use the following mathematical notation. If you wish, you can skip to the next section and return here if our notation confuses you.

  • Pr(1 positive), Pr(2 negative), etc.
    Probability obs. 1 had a positive outcome,
    Probability obs. 2 had a negative outcome, etc.
  • Pr(1 positive and 2 negative | one positive outcome)
    Probability obs. 1 positive and obs. 2 negative given one positive outcome in the group.
  • e
    2.7182818...; we will write e^anything to mean exp(anything).
  • x
    Vector of values of explanatory variables for an observation.
  • x1, x2, etc.
    Vector of values of explanatory variables for obs. 1, obs. 2, etc.
  • b
    Vector of coefficients.
    x*b is thus the summed product of the explanatory variables with their respective coefficients.
  • var1, var2, etc.
    variables in the x vector.
  • var1_1, var1_2, var2_1, var2_2, etc.
    var1_1: value of var1 in obs. 1.
    var1_2: value of var1 in obs. 2.
    var2_1: value of var2 in obs. 1.
    var2_2: value of var2 in obs. 2.
  • a, b, c
    Scalars; elements of b.
    x*b = a + b*var1 + c*var2 + ...
    x1*b = a + b*var1_1 + c*var2_2 + ...
  • A, B, d
    More scalars.
  • G(x*b)
    Cumulative “logistic” distribution.

2.2 Intercept

Equation (3) has an unfortunate property. Let’s pretend x, the vector of explanatory variables, includes var1 and var2. Thus our model of the probabilities is, from (1),

    Pr(positive outcome) = G(a + b*var1 + c*var2)
    
                              e^(a + b*var1 + c*var2)
                         = ---------------------------
                           1 + e^(a + b*var1 + c*var2)

Equation (3), the probability for the first group is similarly

                        e^(a + b*var1_1 + c*var2_1)
         ----------------------------------------------------------
          e^(a + b*var1_1 + c*var2_1) + e^(a + b*var1_2 + c*var2_2)

        
                        e^a e^(b*var1_1) e^(c*var2_1)
    =    --------------------------------------------------------------
          e^a e^(b*var1_1) e^(c*var2_1) + e^a e^(b*var1_2) e^(c*var2_2)

                        e^(b*var1_1) e^(c*var2_1)
    =    ------------------------------------------------------
          e^(b*var1_1) e^(c*var2_1) + e^(b*var1_2) e^(c*var2_2)

where var1_1 and var1_2 are the values of var1 in observations 1 and 2, respectively.

e^a cancelled in the numerator and denominator. Whatever is the true value of the intercept, it plays no role in determining the conditional probabilities of positive outcomes within groups. a could be 0, −10, or 57.12, and it would make no difference.

Since a plays no role, we will not be able to estimate it. In our model for the unconditional probabilities, we have

Pr(positive outcome) = G(a + b*var1 + c*var2)
                         ^   ^        ^
                         |   |        |
                         |   can be estimated by conditional logistic
                         |
                 cannot be estimated
                 by conditional logistic

That’s too bad but most researchers do not care much about the intercept anyway.

2.3 Within-group constants

The problem, however, can be worse than that. Say var2 is constant within group. Remember, our term for the first group is

                   e^(b*var1_1) e^(c*var2_1)
    ------------------------------------------------------
     e^(b*var1_1) e^(c*var2_1) + e^(b*var1_2) e^(c*var2_2)

If var2_1==var2_2 (var2 is equal for the first two observations), then e^(c*var2) cancels, and we are left with

            e^(b*var1_1) 
    -----------------------------
     e^(b*var1_1) + e^(b*var1_2) 

If this same cancellation occurs in groups 2, 3, ...—if var2 is a constant value in each group—then whatever is the true value of c, it plays no role in our model. c could be anything, and it would not change any part of our calculation. For this problem to arise, var2 does not have to be a single constant value, it merely has to be constant within group.

So now, in our unconditional model, we have

    Pr(positive outcome) = G(a + b*var1 + c*var2)
                             ^            ^
                             |            |
                             |            cannot be estimated by conditional
                             |            logistic because var2 is constant
                             |            within group
                             |
                     cannot be estimated
                     by conditional logistic
                     because constant

None of this is very surprising. The conditional logistic model attempts to explain which observations within each group had positive outcomes, and things that do not vary within group play no role in the explanation. Moreover, there can be a real advantage in this. I may think that var2 belongs in the Pr(positive outcome) model but not know how it should be specified. Does var2 have a linear effect c*var2 or should it be quadratic c*var2+d*var2^2 or should be in the logs c*ln(var2) or how? In the conditional logistic model, if var2 is constant within group, it drops out no matter how the effect ought to be parameterized. This is a great advantage if my interest is in the effect of var1 and not var2.

All of this is a long explanation for why, when you fit a conditional logistic model, Stata sometimes says

    . clogit outcome var1 var2 var3 ..., group(id)
    note:  var2 omitted because of no within-group variance
    
    Iteration 0: ...
    ...
    (model without var2 reported)

2.4 Collinearity

I want to go back to our model

    Pr(positive outcome) = G(a + b*var1 + c*var2)

for which, in the first group,

    Pr(1 positive and 2 negative | one positive outcome) 
    
                           e^(b*var1_1) e^(c*var2_1)
         =  ------------------------------------------------------
             e^(b*var1_1) e^(c*var2_1) + e^(b*var1_2) e^(c*var2_2)

This time, let’s assume that var1 and var2 are collinear, meaning we can write

    var2 = A + B*var1

It will not surprise you to learn that we will not be able to estimate b and c. Substituting var2 = A + B*var1 into our formula for the conditional probability for group 1, we obtain

                     e^(b*var1_1) e^(c*(A+B*var1_1))
        ----------------------------------------------------------------
        e^(b*var1_1) e^(c*(A+B*var1_1)) + e^(b*var1_2) e^(c*(A+B*var1_2))

              e^(b*var1_1) e^(c*A) e^(c*B*var1_1)
    =   ------------------------------------------------------------------------
        e^(b*var1_1) e^(c*A) e^(c*B*var1_1) + e^(b*var1_2) e^(c*A) e^(c*B*var1_2)
   
                     e^(b*var1_1) e^(c*B*var1_1)
    =   -----------------------------------------------------
        e^(b*var1_1) e^(c*B*var1_1) + e^(b*var1_2) e^(c*B*var1_2)
   
                     e^((b+c*B)*var1_1)
    =   ---------------------------------------
        e^((b+c*B)*var1_1) + e^((b+c*B)*var1_2)

Let us write d = b+c*B. The term can then be written

           e^(d*var1_1)
    --------------------------
    e^(d*var1_1) + e^(d*var1_2)

This is just what the term would look like if we estimated on var_1 alone. Thus to fit this model we could

  1. Estimate on var1 alone to obtain d.
  2. Solve d = b+c*B to obtain b and c.

The problem occurs in step 2. We have one equation and two unknowns (b and c).

All of this is a long explanation for why, when you fit a conditional logistic model, Stata sometimes says

    . clogit outcome var1 var2 var3 ..., group(id)
    note:  var2 omitted because of collinearity
    
    Iteration 0: ...
    ...
    (model without var2 reported)

2.5 Within-group collinearity

The conditional logistic model is subject to another form of collinearity. As before, let us assume

    Pr(positive outcome) = G(a + b*var1 + c*var2)

but this time var1 and var2 are *NOT* collinear,

    var2   *IS NOT EQUAL TO*   A + B*var1

Instead, however, let us assume that, for each group

    var2 = A_g + B*var1

That is, var1 and var2 are linearly related in the first group, linearly related in the second group, and so on. The coefficient B multiplying var1 is the same across groups but the intercept A is allowed to differ.

If you go back through the algebra for the simple collinearity case, you will note that it is all applicable because only the within-group collinearity of var1 and var2 were used.

The final equation still holds. The conditional probability for the first group can be written

                  e^((b+c*B)*var1_1)
     ---------------------------------------
     e^((b+c*B)*var1_1) + e^((b+c*B)*var1_2)

and again, this is just what the term would look like if we estimated on var_1 alone.

All of this is an explanation for why, when you fit a conditional logistic model, Stata sometimes says

     . clogit outcome var1 var2
     note: var2 omitted because of collinearity.

     Iteration 0: ...

3. Recommendation

If you suspect this kind of collinearity,

  1. Take the variable that was dropped—let’s call it var2—and estimate a fixed-effects regression on all the other independent variables using xtreg with the fe option:
        . xtset group
        . xtreg var2 ..., fe
    
  2. If you obtain an R-sq within of 1, then you do have within-group collinearity. You will have to admit that you cannot estimate the var2 effect. Refit your clogit model, omitting the variable.