
I was wrong about the changes that will take place when treatment is
designated as a stratum in the svy analysis. Not only will degrees
of freedom change, so will the estimated SE's. Therefore the pvalue
I gave for the GFR analysis will not be 0.052. Michael, what happens
when you recompute? In any case, I agree with Austin that, with four
clusters, no statistical inference is reliable.
Steve
On Sep 9, 2008, Steven Samuels wrote:
In your svyset statement, you made a mistake unrelated to the
downward bias of the clusterrobust SE: you must designate
treatment group as a stratum variable. That will make the
degrees of freedom = 2, and lead to a nominal p = 0.052.
Steve
On Sep 9, 2008, at 4:32 PM, Michael I. Lichter wrote:
Thanks to Austin and Jeph for responding. In reply to Jeph ...
I think there are good reasons to avoid both. You don't say what
kinds of analyses you have, but see ssc describe cltest for some
tools and a reference for analyzing cluster randomized outcomes
using adjustments to the standard chi2 and ttests.
Can you explain why to avoid both? Aren't they adjusting for the
same phenomenonclustering of observations? I'll describe the
analyses, but that will take some background ...
This is a small trial of an intervention designed to promote
guidelinebased diagnosis and treatment of patients with chronic
kidney disease (CKD). Four medical practices were selected and two
each were randomly assigned to control and intervention. (Yes, I
know that it is not recommended to do CRT with fewer than 5
clusters per arm.) Primary indicators include glomerular filtration
rate (GFR) and whether or not patients with substandard GFR were
diagnosed during the trial period has having CKD. We predict stable
or rising GFR in intervention practices compared to falling GFR in
control practices, and higher rates of physiciandiagnosed CKD in
intervention practices compared to control practices. The universe
of patients is those with substandard GFR levels prior to the
intervention.
For GFR, I was planning to regress pre/post absolute change in GFR
on a dummy for control vs. not. (I'd like to include covariates
like age and sex, but don't have the degrees of freedom). In
partial answer to Austin's question about differences in results
between cluster() and svy, and also to ask about a problem with
clttest, I've included output below for this regression (1)
unclustered, (2) with the cluster() option, (3) with the svy
command, and (4) with clttest  which isn't a regression but does
essentially the same thing in this instance.
. reg gfr_achg rcontrol /* unclustered */
Source  SS df MS Number of obs
= 159
+ F( 1,
157) = 0.30
Model  30.4456806 1 30.4456806 Prob > F
= 0.5834
Residual  15825.7933 157 100.801231 Rsquared
= 0.0019
+ Adj R
squared = 0.0044
Total  15856.239 158 100.355943 Root MSE
= 10.04


gfr_achg  Coef. Std. Err. t P>t [95% Conf.
Interval]

+
rcontrol  .9589666 1.744912 0.55 0.583
4.405498 2.487565
_cons  .2553191 1.464482 0.17 0.862
3.147948 2.63731


. reg gfr_achg rcontrol, cluster(rsiteid) /* clustered */
Linear regression Number of
obs = 159
F( 1, 3)
= 17.62
Prob > F
= 0.0247
Rsquared
= 0.0019
Number of clusters (rsiteid) = 4 Root
MSE = 10.04


 Robust
gfr_achg  Coef. Std. Err. t P>t [95% Conf.
Interval]

+
rcontrol  .9589666 .2284233 4.20 0.025 1.685911
.2320217
_cons  .2553191 .223962 1.14 0.337 .
9680661 .4574278


. svy: reg gfr_achg rcontrol /* survey */
(running regress on estimation sample)
Survey: Linear regression
Number of strata = 1 Number of obs
= 159
Number of PSUs = 4 Population size
= 159
Design df
= 3
F( 1, 3)
= 17.74
Prob > F
= 0.0245
Rsquared
= 0.0019


 Linearized
gfr_achg  Coef. Std. Err. t P>t [95% Conf.
Interval]

+
rcontrol  .9589666 .2276993 4.21 0.024 1.683607
.2343258
_cons  .2553191 .2232521 1.14 0.336 .
965807 .4551687


. estat effects

 Linearized
gfr_achg  Coef. Std. Err. Deff Deft
+
rcontrol  .9589666 .2276993 .012158 .110264
_cons  .2553191 .2232521 .013712 .117098

. clttest gfr_achg, by(rcontrol) cluster(rsiteid) /* clustered t
test */
ttest adjusted for clustering
gfr_achg by rcontrol, clustered by rsiteid


Intracluster correlation = 0.0267


N Clusts Mean SE 95 % CI
rcontrol=0 47 2 0.2553 0.7924 [10.3243,
9.8137]
rcontrol=1 112 2 1.2143 .
[ ., .]


Combined 159 2 0.9308 .
[ ., .]


Diff(01) 159 4 0.9590 .
[ ., .]
Degrees freedom: 2
Ho: mean() = mean(diff) = 0
Ha: mean(diff) < 0 Ha: mean(diff) ~= 0 Ha: mean
(diff) > 0
t = 2.1346 t = 2.1346 t =
2.1346
P < t = 0.9168 P > t = 0.1664 P > t =
0.0832
Suggestions on why the ttest didn't work (it didn't calculate SE)
would be welcomeit worked fine for a ttest of differences in the
postGFR itself.
BTW, you might have noticed that the SEs are *smaller* in the
cluster/svy model compared to the unclustered model. That's because
the internal variation within clusters is much larger than the
differences between themyou can see this also in the deff and
deft being less than 1.0. Does this give me an excuse to treat the
data as unclustered?
On the other hand, when I look at ckd2 (diagnosed with CKD) for
those not diagnosed before the start of the study (ckd1 == 0), I
get a substantial design effect:
. svy: logit ckd2 rcontrol if ckd1==0
(running logit on estimation sample)
Survey: Logistic regression
Number of strata = 1 Number of obs
= 259
Number of PSUs = 4 Population size
= 259
Design df
= 3
F( 1, 3)
= 10.64
Prob > F
= 0.0471


 Linearized
ckd2  Coef. Std. Err. t P>t [95% Conf.
Interval]

+
rcontrol  2.058052 .6309325 3.26 0.047 4.06596
.0501428
_cons  1.015231 .4127449 2.46 0.091 .
2983078 2.328769


. estat effects

 Linearized
ckd2  Coef. Std. Err. Deff Deft
+
rcontrol  2.058052 .6309325 4.61385 2.14799
_cons  1.015231 .4127449 3.11419 1.76471

Does all that make sense?
Another preferred option is to use panel methods such as xtmixed
with the clusters specified as panels. Even
if you don't have covariates (and in an RCT you will need to make
a case for including them), these are often
preferred.
This is preferred because ... ?
. xtmixed gfr_achg rcontrol
Mixedeffects REML regression Number of obs
= 159
Wald chi2(1)
= 0.30
Log restrictedlikelihood = 589.18999 Prob > chi2
= 0.5826


gfr_achg  Coef. Std. Err. z P>z [95% Conf.
Interval]

+
rcontrol  .9589666 1.744912 0.55 0.583
4.378931 2.460998
_cons  .2553191 1.464482 0.17 0.862
3.125651 2.615013




Randomeffects Parameters  Estimate Std. Err. [95% Conf.
Interval]

+
sd(Residual)  10.03998 .5630142
8.994974 11.2064


More detail on your design might produce more detailed answers.
See above.
Hope this helps,
Jeph
It does help. Thanks!
Michael I. Lichter wrote:
Hello, friends. I have a question about the analysis of data from
clusterrandomized trials (CRTs). CRTs are experiments where
subjects are randomly assigned to conditions (control, treatment)
based on their group membership rather than being assigned
individually as is usually the case in randomized controlled
trials. In my study, the clusters are medical practices, so when
a medical practice is assigned to a condition, all of the
eligible patients therein are also assigned to the condition.
CRTs should be analyzed using methods that take account of the
clustering in the study design, of course.
My question is this: For CRTs, is there any statistical reason
for preferring the cluster() option on estimation commands (e.g.,
regress, logit) over the survey commands, or viceversa? I've
used both and the results are similar, but the survey commands
estimate larger standard errors. If the answer is that they're
both equally appropriate but produce different results because
they use somewhat different methods of estimation, that's fine.
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/