Title | Within group collinearity in conditional logistic regression | |
Author | William Gould, StataCorp |
The short answer is that the variable reported with the missing standard error is “within-group collinear” with other covariates in the model. You need to drop the within-group collinear variable and reestimate. You can verify within-group collinearity is the problem by using fixed-effects regressions on the covariates.
All of this is explained below and, along the way, we also explain why clogit sometimes produces the messages “var omitted because of no within-group variance” and “var omitted because of collinearity”.
The contents of this FAQ are
Conditional logistic regression is similar to ordinary logistic regression except the data occur in groups,
group 1: obs. 1 outcome=1 x1 = ... x2 = ... obs. 2 outcome=0 x1 = ... x2 = ... group 2: obs. 3 outcome=1 x1 = ... x2 = ... obs. 4 outcome=0 x1 = ... x2 = ... obs. 5 outcome=0 x1 = ... x2 = ... group 3: ... . . group G: ...
and we wish to condition on the number of positive outcomes within group. That is, we seek to fit a logistic model that explains why obs. 1 had a positive outcome in group 1 conditional on one of the observations in the group having a positive outcome.
In biostatistical applications, this need arises because researchers collect data on the sick and infected (the so-called positive outcomes), and then match those cases with controls who are not sick and infected. Thus the number of positive outcomes is not a random variable. Within each group, there had to be the observed number of positive outcomes because that is how the data were constructed.
Economists refer to this same model as the McFadden choice model. An individual is faced with an array of choices and must choose one.
Regardless of the justification, we are seeking to fit a model that explains why obs. 1 had a positive outcome in group 1, obs. 3 in group 2, and so on.
We assume the unconditional probability of a positive outcome is given by the standard logit equation
Pr(positive outcome) = G(x*b) = e^(x*b)/(1+e^(x*b)) (1)
Equation (1) is not the appropriate probability for our data because it does not account for the conditioning. In the first group, for instance, we want
Pr(obs. 1 positive and obs. 2 negative | one positive outcome)
and that is easy enough to write down in terms of the unconditional probabilities. It is
Pr(1 positive)*Pr(2 negative) ------------------------------------------------------------- (2) Pr(1 positive)*Pr(2 negative) + Pr(1 negative)*Pr(2 positive)
From now on, when I write Pr(1 positive) and Pr(2 negative), etc., I mean the probability that observation 1 had a positive outcome, the probability that observation 2 had a negative outcome, and so on.
Substituting (1) into (2), we obtain
Pr(1 positive and 2 negative | one positive outcome) e^(x1*b) = -------------------- (3) e^(x1*b) + e^(x2*b)
So that is the model we seek to fit. (At least, that is the term for group 1, and there are similar terms for all the other groups. I have ignored the possibility of multiple positive outcomes within group because that just complicates things and is irrelevant to my point.)
In this FAQ, we will use the following mathematical notation. If you wish, you can skip to the next section and return here if our notation confuses you.
Equation (3) has an unfortunate property. Let’s pretend x, the vector of explanatory variables, includes var1 and var2. Thus our model of the probabilities is, from (1),
Pr(positive outcome) = G(a + b*var1 + c*var2) e^(a + b*var1 + c*var2) = --------------------------- 1 + e^(a + b*var1 + c*var2)
Equation (3), the probability for the first group is similarly
e^(a + b*var1_1 + c*var2_1) ---------------------------------------------------------- e^(a + b*var1_1 + c*var2_1) + e^(a + b*var1_2 + c*var2_2) e^a e^(b*var1_1) e^(c*var2_1) = -------------------------------------------------------------- e^a e^(b*var1_1) e^(c*var2_1) + e^a e^(b*var1_2) e^(c*var2_2) e^(b*var1_1) e^(c*var2_1) = ------------------------------------------------------ e^(b*var1_1) e^(c*var2_1) + e^(b*var1_2) e^(c*var2_2)
where var1_1 and var1_2 are the values of var1 in observations 1 and 2, respectively.
e^a cancelled in the numerator and denominator. Whatever is the true value of the intercept, it plays no role in determining the conditional probabilities of positive outcomes within groups. a could be 0, −10, or 57.12, and it would make no difference.
Since a plays no role, we will not be able to estimate it. In our model for the unconditional probabilities, we have
Pr(positive outcome) = G(a + b*var1 + c*var2) ^ ^ ^ | | | | can be estimated by conditional logistic | cannot be estimated by conditional logistic
That’s too bad but most researchers do not care much about the intercept anyway.
The problem, however, can be worse than that. Say var2 is constant within group. Remember, our term for the first group is
e^(b*var1_1) e^(c*var2_1) ------------------------------------------------------ e^(b*var1_1) e^(c*var2_1) + e^(b*var1_2) e^(c*var2_2)
If var2_1==var2_2 (var2 is equal for the first two observations), then e^(c*var2) cancels, and we are left with
e^(b*var1_1) ----------------------------- e^(b*var1_1) + e^(b*var1_2)
If this same cancellation occurs in groups 2, 3, ...—if var2 is a constant value in each group—then whatever is the true value of c, it plays no role in our model. c could be anything, and it would not change any part of our calculation. For this problem to arise, var2 does not have to be a single constant value, it merely has to be constant within group.
So now, in our unconditional model, we have
Pr(positive outcome) = G(a + b*var1 + c*var2) ^ ^ | | | cannot be estimated by conditional | logistic because var2 is constant | within group | cannot be estimated by conditional logistic because constant
None of this is very surprising. The conditional logistic model attempts to explain which observations within each group had positive outcomes, and things that do not vary within group play no role in the explanation. Moreover, there can be a real advantage in this. I may think that var2 belongs in the Pr(positive outcome) model but not know how it should be specified. Does var2 have a linear effect c*var2 or should it be quadratic c*var2+d*var2^2 or should be in the logs c*ln(var2) or how? In the conditional logistic model, if var2 is constant within group, it drops out no matter how the effect ought to be parameterized. This is a great advantage if my interest is in the effect of var1 and not var2.
All of this is a long explanation for why, when you fit a conditional logistic model, Stata sometimes says
. clogit outcome var1 var2 var3 ..., group(id) note: var2 omitted because of no within-group variance Iteration 0: ... ... (model without var2 reported)
I want to go back to our model
Pr(positive outcome) = G(a + b*var1 + c*var2)
for which, in the first group,
Pr(1 positive and 2 negative | one positive outcome) e^(b*var1_1) e^(c*var2_1) = ------------------------------------------------------ e^(b*var1_1) e^(c*var2_1) + e^(b*var1_2) e^(c*var2_2)
This time, let’s assume that var1 and var2 are collinear, meaning we can write
var2 = A + B*var1
It will not surprise you to learn that we will not be able to estimate b and c. Substituting var2 = A + B*var1 into our formula for the conditional probability for group 1, we obtain
e^(b*var1_1) e^(c*(A+B*var1_1)) ---------------------------------------------------------------- e^(b*var1_1) e^(c*(A+B*var1_1)) + e^(b*var1_2) e^(c*(A+B*var1_2)) e^(b*var1_1) e^(c*A) e^(c*B*var1_1) = ------------------------------------------------------------------------ e^(b*var1_1) e^(c*A) e^(c*B*var1_1) + e^(b*var1_2) e^(c*A) e^(c*B*var1_2) e^(b*var1_1) e^(c*B*var1_1) = ----------------------------------------------------- e^(b*var1_1) e^(c*B*var1_1) + e^(b*var1_2) e^(c*B*var1_2) e^((b+c*B)*var1_1) = --------------------------------------- e^((b+c*B)*var1_1) + e^((b+c*B)*var1_2)
Let us write d = b+c*B. The term can then be written
e^(d*var1_1) -------------------------- e^(d*var1_1) + e^(d*var1_2)
This is just what the term would look like if we estimated on var_1 alone. Thus to fit this model we could
The problem occurs in step 2. We have one equation and two unknowns (b and c).
All of this is a long explanation for why, when you fit a conditional logistic model, Stata sometimes says
. clogit outcome var1 var2 var3 ..., group(id) note: var2 omitted because of collinearity Iteration 0: ... ... (model without var2 reported)
The conditional logistic model is subject to another form of collinearity. As before, let us assume
Pr(positive outcome) = G(a + b*var1 + c*var2)
but this time var1 and var2 are *NOT* collinear,
var2 *IS NOT EQUAL TO* A + B*var1
Instead, however, let us assume that, for each group
var2 = A_g + B*var1
That is, var1 and var2 are linearly related in the first group, linearly related in the second group, and so on. The coefficient B multiplying var1 is the same across groups but the intercept A is allowed to differ.
If you go back through the algebra for the simple collinearity case, you will note that it is all applicable because only the within-group collinearity of var1 and var2 were used.
The final equation still holds. The conditional probability for the first group can be written
e^((b+c*B)*var1_1) --------------------------------------- e^((b+c*B)*var1_1) + e^((b+c*B)*var1_2)
and again, this is just what the term would look like if we estimated on var_1 alone.
All of this is an explanation for why, when you fit a conditional logistic model, Stata sometimes says
. clogit outcome var1 var2 note: var2 omitted because of collinearity. Iteration 0: ...
If you suspect this kind of collinearity,
. xtset group . xtreg var2 ..., fe