# Re: st: Mantel-Haenszel vs. clustered logistic - please help

 From Mark Schaffer <[email protected]> To [email protected], Constantine Daskalakis <[email protected]> Subject Re: st: Mantel-Haenszel vs. clustered logistic - please help Date Fri, 23 May 2003 21:48:00 +0100 (BST)

```Hi everybody.  With respect to clustering,

Quoting Constantine Daskalakis <[email protected]>:

> > > Dear all,
> > >
> > > I am analyzing data from the 306 women at 4 outpatient
> > > clinics in Oklahoma. Each woman was asked if they
> > > performed monthly breast exams and other additional
> > > data (covariates) such as race was collected. We would
> > > like to characterize women according to these
> > > covariates. I am concerned about the these women are
> > > from different clinics and would like to take this
> > > into account.
> > >
> > > My first though was to performed a logistic regression
> > > clustering on clinic.
> >
> > > Summary:
> > >
> > > The raw OR:           1.97  (1.10,3.54) p= 0.0138
> > > Mantel-Haenszel OR:   2.94  (1.44,5.98) p= 0.0018
> > > Cluster logistic OR:  1.97  (0.77,5.06) p= 0.158
> > >
> > > Regards,
> > > Ricardo
>
> There are two issues here.
>
> 1. Do you need to control for clinic in your model? This is
> necessary if
> the clinics have varying rates of SBE (ie, clinic is a
> confounder).
>
> The logistic regressions do not control for this. The MH does. You
> can get
> comparable results by including appropriate dummy variables for
> clinic in
> your logistic models.
>
> Looking at your results, it seems that you do need to control for
> clinic.
>
> 2. Do you need to control for the clustering (ie, correlation of
> observations) within clinics?
>
> Usual logistic regression assumes that all observations are
> independent.
> This may not be for observations that come from the same clinic (ie,
> the
> outcomes for two women of the same clinic may be more similar that
> the
> outcomes for two women from different clinics). The cluster/robust
> options
> account for this lack of independence.
>
> I don't agree with Mark that "clustering on clinic is not a good
> idea." It
> does not matter how many clusters you have, as long as you account
> for the
> within-cluster correlation. The robust variance certainly does not
> treat
> the data as 4 observations. This is how I'd explain it.
>
> Without clustering (ie, independence for all observations), you have
> 306
> observations.
>
> With perfect clustering (ie, perfect correlation within each clinic,
> but
> independence across clinics), you have 4 observations.
>
> The cluster/robust option will essentially correctly fall somewhere
> in
> between, depending on the actual degree of correlation in your data.
> This
> is as it should be, no? You don't really have info from 306
> independent
> observations, but less (depending on how correlated they are).

This isn't how cluster-robust estimation works in Stata.  Sadly, I don't
have version 8, but precise statements of how cluster-robust SEs can be
found in the version 7 manuals under -regress- (pp. 87-88) and _robust (pp.
242ff).

Under _robust, p. 243, we read the following:

"Clusters ... are independent, so we can sum the scores within a cluster to
create a 'super-observation' and then use the standard formula for a total
on these independent super-observations."

This is not the only way to deal with intra-group correlation, but it
certainly is the Stata cluster-robust approach.  4 clusters is 4 "super-
observations", i.e., not very many.

For an alternative discussion in the context of linear regression (blatant
Variables and GMM", Stata Journal 3:1, 2003, Section 2.5, p. 10.  The usual
Eicker-Huber-White robust covariance matrix estimator is calculated using
the matrix product 1/N * X'*Omega_hat*X, where Omega_hat is a diagonal
matrix of squared residuals.  In the cluster-robust approach, the Omega_hat
matrix takes a block-diagonal form, with each block corresponding to a
cluster.  If you have only 4 clusters, this matrix has only 4 blocks.  Bad
news.

Alternatively, try the following experiment.  Using the auto data, run the
following regression:

regress price mpg weight length, cluster(foreign)

where foreign is a dummy variable, so there are only 2 clusters.

You will find that Stata will not report an F test statistic.  Click on the
hyperlink and you get an explanation why, including the following:

"The rank of the VCE is determined by the number of clusters, or PSU's and
strata in the survey case.  ... There is no mechanical problem with your
model, but you need to consider carefully whether any of the reported
standard errors mean anything."

To use wording like "you need to consider carefully" etc. is being much too
polite!  The reported SEs are simply nonsense.  Even if you don't get this
error message, they may still be nonsense.  Stata will also run a
regression on 40 observations and 35 explanatory variables without
providing an error message.  The consistency of cluster-robust SEs relies
on asymptotics, i.e., the number of observations and clusters going off to
infinity.  4 clusters is not very far down the road to infinity!

Personally, I think this problem is VERY easy to stumble into [decoded =
I've done it myself] and could do with much more highlighting in the
manuals, and in the on-line help and error messages.

--Mark

>
> Looking at your results, it seems that there's a non-trivial
> within-clinic
> correlation. -logit- uses the "independence" working correlation to
> correct
> for clustering. You can use -xtlogit- (with exchangeable
> correlation) to
> get more efficient estimates, ie, tighter standard errors.
>
>
>
>
>
>
>
>

```