Why does clogit sometimes report a coefficient but missing value for the
standard error, confidence interval, etc.?
Why is there no intercept in the clogit model?
Why can’t I use covariates that are constant within panel?
| Title |
|
Within group collinearity in conditional logistic regression |
| Author |
William Gould, StataCorp |
| Date |
November 1999; updated July 2011 |
The short answer is that the variable reported with the missing standard
error is “within-group collinear” with other covariates in the
model. You need to drop the within-group collinear variable and reestimate.
You can verify within-group collinearity is the problem by using
fixed-effects regressions on the covariates.
All of this is explained below and, along the way, we also explain why
clogit sometimes
produces the messages “var omitted because of no within-group
variance” and “var omitted because of collinearity”.
The contents of this FAQ are
- 1. The conditional logistic model
- 2. Model derivation
- 2.1 Notation
- 2.2 Intercept
- 2.3 Within-group constants
- 2.4 Collinearity
- 2.5 Within-group collinearity
- 3. Recommendation
1. The conditional logistic model
Conditional logistic regression is similar to ordinary logistic regression
except the data occur in groups,
group 1:
obs. 1 outcome=1 x1 = ... x2 = ...
obs. 2 outcome=0 x1 = ... x2 = ...
group 2:
obs. 3 outcome=1 x1 = ... x2 = ...
obs. 4 outcome=0 x1 = ... x2 = ...
obs. 5 outcome=0 x1 = ... x2 = ...
group 3:
...
.
.
Group G:
...
and we wish to condition on the number of positive outcomes within group.
That is, we seek to fit a logistic model that explains why obs. 1 had a
positive outcome in group 1 conditional on one of the observations in
the group having a positive outcome.
In biostatistical applications, this need arises because researchers collect
data on the sick and infected (the so-called positive outcomes), and then
match those cases with controls who are not sick and infected. Thus the
number of positive outcomes is not a random variable. Within each group,
there had to be the observed number of positive outcomes because that is how
the data were constructed.
Economists refer to this same model as the McFadden choice model. An
individual is faced with an array of choices and must choose one.
Regardless of the justification, we are seeking to fit a model that explains
why obs. 1 had a positive outcome in group 1, obs. 3 in group 2, and so on.
2. Model derivation
We assume the unconditional probability of a positive outcome is given
by the standard logit equation
Pr(positive outcome) = G(x*b)
= e^(x*b)/(1+e^(x*b)) (1)
Equation (1) is not the appropriate probability for our data because it does
not account for the conditioning. In the first group, for instance, we want
Pr(obs. 1 positive and obs. 2 negative | one positive outcome)
and that is easy enough to write down in terms of the unconditional
probabilities. It is
Pr(1 positive)*Pr(2 negative)
------------------------------------------------------------- (2)
Pr(1 positive)*Pr(2 negative) + Pr(1 negative)*Pr(1 positive)
From now on, when I write Pr(1 positive) and Pr(2 negative), etc., I mean
the probability that observation 1 had a positive outcome, the probability
that observation 2 had a negative outcome, and so on.
Substituting (1) into (2), we obtain
Pr(1 positive and 2 negative | one positive outcome)
e^(x1*b)
= -------------------- (3)
e^(x1*b) + e^(x2*b)
So that is the model we seek to fit. (At least, that is the term for group
1, and there are similar terms for all the other groups. I have ignored the
possibility of multiple positive outcomes within group because that just
complicates things and is irrelevant to my point.)
2.1 Notation
In this FAQ, we will use the following mathematical notation. If you wish,
you can skip to the next section and return here if
our notation confuses you.
-
Pr(1 positive), Pr(2 negative), etc.
- Probability obs. 1 had a positive outcome,
- Probability obs. 2 had a negative outcome, etc.
-
Pr(1 positive and 2 negative | one positive outcome)
- Probability obs. 1 positive and obs. 2 negative given one positive
outcome in the group.
-
e
- 2.7182818...; we will write e^anything to mean
exp(anything).
-
x
- Vector of values of explanatory variables for an observation.
-
x1, x2, etc.
- Vector of values of explanatory variables for obs. 1, obs. 2, etc.
-
b
- Vector of coefficients.
- x*b is thus the summed product of the explanatory
variables with their respective coefficients.
-
var1, var2, etc.
- variables in the x vector.
-
var1_1, var1_2, var2_1, var2_2, etc.
- var1_1: value of var1 in obs. 1.
- var1_2: value of var1 in obs. 2.
- var2_1: value of var2 in obs. 1.
- var2_2: value of var2 in obs. 2.
-
a, b, c
- Scalars; elements of b.
- x*b = a + b*var1 + c*var2 + ...
- x1*b = a + b*var1_1 + c*var2_2 + ...
-
A, B, d
- More scalars.
-
G(x*b)
- Cumulative “logistic” distribution.
2.2 Intercept
Equation (3) has an unfortunate property. Let’s pretend
x, the vector of explanatory variables, includes var1 and var2. Thus
our model of the probabilities is, from (1),
Pr(positive outcome) = G(a + b*var1 + c*var2)
e^(a + b*var1 + c*var2)
= ---------------------------
1 + e^(a + b*var1 + c*var2)
Equation (3), the probability for the first group is similarly
e^(a + b*var1_1 + c*var2_1)
----------------------------------------------------------
e^(a + b*var1_1 + c*var2_1) + e^(a + b*var1_2 + c*var2_2)
e^a e^(b*var1_1) e^(c*var2_1)
= --------------------------------------------------------------
e^a e^(b*var1_1) e^(c*var2_1) + e^a e^(b*var1_2) e^(c*var2_2)
e^(b*var1_1) e^(c*var2_1)
= ------------------------------------------------------
e^(b*var1_1) e^(c*var2_1) + e^(b*var1_2) e^(c*var2_2)
where var1_1 and var1_2 are the values of var1 in observations 1 and 2,
respectively.
e^a cancelled in the numerator and denominator. Whatever is
the true value of the intercept, it plays no role in determining the
conditional probabilities of positive outcomes within groups. a
could be 0, −10, or 57.12, and it would make no difference.
Since a plays no role, we will not be able to estimate it. In our
model for the unconditional probabilities, we have
Pr(positive outcome) = G(a + b*var1 + c*var2)
^ ^ ^
| | |
| can be estimated by conditional logistic
|
cannot be estimated
by conditional logistic
That’s too bad but most researchers do not care much about the
intercept anyway.
2.3 Within-group constants
The problem, however, can be worse than that. Say var2 is constant
within group. Remember, our term for the first group is
e^(b*var1_1) e^(c*var2_1)
------------------------------------------------------
e^(b*var1_1) e^(c*var2_1) + e^(b*var1_2) e^(c*var2_2)
If var2_1==var2_2 (var2 is equal for the first two observations), then
e^(c*var2) cancels, and we are left with
e^(b*var1_1)
-----------------------------
e^(b*var1_1) + e^(b*var1_2)
If this same cancellation occurs in groups 2, 3, ...—if var2 is a
constant value in each group—then whatever is the true value of
c, it plays no role in our model. c could be anything, and it
would not change any part of our calculation. For this problem to arise,
var2 does not have to be a single constant value, it merely has to be
constant within group.
So now, in our unconditional model, we have
Pr(positive outcome) = G(a + b*var1 + c*var2)
^ ^
| |
| cannot be estimated by conditional
| logistic because var2 is constant
| within group
|
cannot be estimated
by conditional logistic
because constant
None of this is very surprising. The conditional logistic model attempts to
explain which observations within each group had positive outcomes, and
things that do not vary within group play no role in the explanation.
Moreover, there can be a real advantage in this. I may think that var2
belongs in the Pr(positive outcome) model but not know how it should be
specified. Does var2 have a linear effect c*var2 or should it be
quadratic c*var2+d*var2^2 or should be in the logs
c*ln(var2) or how? In the conditional logistic model, if var2 is
constant within group, it drops out no matter how the effect ought to be
parameterized. This is a great advantage if my interest is in the effect of
var1 and not var2.
All of this is a long explanation for why, when you fit a conditional
logistic model, Stata sometimes says
. clogit outcome var1 var2 var3 ..., group(id)
note: var2 omitted because of no within-group variance
Iteration 0: ...
...
(model without var2 reported)
2.4 Collinearity
I want to go back to our model
Pr(positive outcome) = G(a + b*var1 + c*var2)
for which, in the first group,
Pr(1 positive and 2 negative | one positive outcome)
e^(b*var1_1) e^(c*var2_1)
= ------------------------------------------------------
e^(b*var1_1) e^(c*var2_1) + e^(b*var1_2) e^(c*var2_2)
This time, let’s assume that var1 and var2 are collinear, meaning we
can write
var2 = A + B*var1
It will not surprise you to learn that we will not be able to estimate
b and c. Substituting var2 = A + B*var1 into
our formula for the conditional probability for group 1, we obtain
e^(b*var1_1) e^(c*(A+B*var1_1))
----------------------------------------------------------------
e^(b*var1_1) e^(c*(A+B*var1_1)) + e^(b*var1_2) e^(c*(A+B*var1_2))
e^(b*var1_1) e^(c*A) e^(c*B*var1_1)
= ------------------------------------------------------------------------
e^(b*var1_1) e^(c*A) e^(c*B*var1_1) + e^(b*var1_2) e^(c*A) e^(c*B*var1_2)
e^(b*var1_1) e^(c*B*var1_1)
= -----------------------------------------------------
e^(b*var1_1) e^(c*B*var1_1) + e^(b*var1_2) e^(c*B*var1_2)
e^((b+c*B)*var1_1)
= ---------------------------------------
e^((b+c*B)*var1_1) + e^((b+c*B)*var1_2)
Let us write d = b+c*B. The term can then be
written
e^(d*var1_1)
--------------------------
e^(d*var1_1) + e^(d*var1_2)
This is just what the term would look like if we estimated on var_1 alone.
Thus to fit this model we could
- Estimate on var1 alone to obtain d.
- Solve d = b+c*B to obtain b and
c.
The problem occurs in step 2. We have one equation and two unknowns
(b and c).
All of this is a long explanation for why, when you fit a conditional
logistic model, Stata sometimes says
. clogit outcome var1 var2 var3 ..., group(id)
note: var2 omitted because of collinearity
Iteration 0: ...
...
(model without var2 reported)
2.5 Within-group collinearity
The conditional logistic model is subject to another form of collinearity.
As before, let us assume
Pr(positive outcome) = G(a + b*var1 + c*var2)
but this time var1 and var2 are *NOT* collinear,
var2 *IS NOT EQUAL TO* A + B*var1
Instead, however, let us assume that, for each group
var2 = A_g + B*var1
That is, var1 and var2 are linearly related in the first group, linearly
related in the second group, and so on. The coefficient B
multiplying var1 is the same across groups but the intercept A is
allowed to differ.
If you go back through the algebra for the simple collinearity case, you
will note that it is all applicable because only the within-group
collinearity of var1 and var2 were used.
The final equation still holds. The conditional probability for the first
group can be written
e^((b+c*B)*var1_1)
---------------------------------------
e^((b+c*B)*var1_1) + e^((b+c*B)*var1_2)
and again, this is just what the term would look like if we estimated on
var_1 alone.
All of this is an explanation for why, when you fit a conditional
logistic model, Stata sometimes says
. clogit outcome var1 var2
note: var2 omitted because of collinearity.
Iteration 0: ...
3. Recommendation
If you suspect this kind of collinearity,
- Take the variable that was dropped—let’s call it var2—and
estimate a fixed-effects regression on all the other independent
variables using xtreg with the
fe option:
. xtreg var2 ..., i(group) fe
- If you obtain an R-sq within of 1, then you do have within-group
collinearity. You will have to admit that you cannot estimate
the var2 effect. Refit your clogit model, omitting the
variable.
|