|  | 
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: cluster() or svy? (analysis of cluster-randomized trials)
| From | Steven Samuels <[email protected]> | 
| To | [email protected] | 
| Subject | Re: st: cluster() or svy? (analysis of cluster-randomized trials) | 
| Date | Wed, 10 Sep 2008 08:52:31 -0400 | 
-
I was wrong about the changes that will take place when treatment is  
designated as a stratum in the -svy- analysis. Not only will degrees  
of freedom change, so will the estimated SE's. Therefore the p-value  
I gave for the GFR analysis will not be 0.052.  Michael, what happens  
when you recompute?  In any case, I agree with Austin that, with four  
clusters, no statistical inference is reliable.
-Steve
On Sep 9, 2008, Steven Samuels wrote:
In your -svyset- statement, you made a mistake unrelated to the  
downward bias of the cluster-robust SE:  you must designate  
treatment group as a -stratum- variable.  That will make the  
degrees of freedom = 2, and lead to a nominal p = 0.052.
-Steve
On Sep 9, 2008, at 4:32 PM, Michael I. Lichter wrote:
Thanks to Austin and Jeph for responding. In reply to Jeph ...
I think there are good reasons to avoid both. You don't say what  
kinds of analyses you have, but see  ssc describe cltest for some  
tools and a reference for analyzing cluster randomized outcomes  
using adjustments to the standard chi-2 and t-tests.
Can you explain why to avoid both? Aren't they adjusting for the  
same phenomenon--clustering of observations? I'll describe the  
analyses, but that will take some background ...
This is a small trial of an intervention designed to promote  
guideline-based diagnosis and treatment of patients with chronic  
kidney disease (CKD). Four medical practices were selected and two  
each were randomly assigned to control and intervention. (Yes, I  
know that it is not recommended to do CRT with fewer than 5  
clusters per arm.) Primary indicators include glomerular filtration  
rate (GFR) and whether or not patients with substandard GFR were  
diagnosed during the trial period has having CKD. We predict stable  
or rising GFR in intervention practices compared to falling GFR in  
control practices, and higher rates of physician-diagnosed CKD in  
intervention practices compared to control practices. The universe  
of patients is those with substandard GFR levels prior to the  
intervention.
For GFR, I was planning to regress pre/post absolute change in GFR  
on a dummy for control vs. not. (I'd like to include covariates  
like age and sex, but don't have the degrees of freedom). In  
partial answer to Austin's question about differences in results  
between cluster() and svy, and also to ask about a problem with  
clttest, I've included output below for this regression (1)  
unclustered, (2) with the cluster() option, (3) with the svy  
command, and (4) with clttest -- which isn't a regression but does  
essentially the same thing in this instance.
. reg gfr_achg rcontrol /* unclustered */
     Source |       SS       df       MS              Number of obs  
=     159
-------------+------------------------------           F(  1,    
157) =    0.30
      Model |  30.4456806     1  30.4456806           Prob > F       
=  0.5834
   Residual |  15825.7933   157  100.801231           R-squared      
=  0.0019
-------------+------------------------------           Adj R- 
squared = -0.0044
      Total |   15856.239   158  100.355943           Root MSE       
=   10.04
---------------------------------------------------------------------- 
--------
   gfr_achg |      Coef.   Std. Err.      t    P>|t|     [95% Conf.  
Interval]
------------- 
+----------------------------------------------------------------
   rcontrol |  -.9589666   1.744912    -0.55   0.583     
-4.405498    2.487565
      _cons |  -.2553191   1.464482    -0.17   0.862     
-3.147948     2.63731
---------------------------------------------------------------------- 
--------
. reg gfr_achg rcontrol, cluster(rsiteid) /* clustered */
Linear regression                                      Number of  
obs =     159
                                                      F(  1,     3)  
=   17.62
                                                      Prob > F       
=  0.0247
                                                      R-squared      
=  0.0019
Number of clusters (rsiteid) = 4                       Root  
MSE      =   10.04
---------------------------------------------------------------------- 
--------
            |               Robust
   gfr_achg |      Coef.   Std. Err.      t    P>|t|     [95% Conf.  
Interval]
------------- 
+----------------------------------------------------------------
   rcontrol |  -.9589666   .2284233    -4.20   0.025    -1.685911    
-.2320217
      _cons |  -.2553191    .223962    -1.14   0.337    -. 
9680661    .4574278
---------------------------------------------------------------------- 
--------
. svy: reg gfr_achg rcontrol /* survey */
(running regress on estimation sample)
Survey: Linear regression
Number of strata   =         1                  Number of obs       
=       159
Number of PSUs     =         4                  Population size     
=       159
                                               Design df           
=         3
                                               F(   1,      3)     
=     17.74
                                               Prob > F            
=    0.0245
                                               R-squared           
=    0.0019
---------------------------------------------------------------------- 
--------
            |             Linearized
   gfr_achg |      Coef.   Std. Err.      t    P>|t|     [95% Conf.  
Interval]
------------- 
+----------------------------------------------------------------
   rcontrol |  -.9589666   .2276993    -4.21   0.024    -1.683607    
-.2343258
      _cons |  -.2553191   .2232521    -1.14   0.336     -. 
965807    .4551687
---------------------------------------------------------------------- 
--------
. estat effects
----------------------------------------------------------
            |             Linearized
   gfr_achg |      Coef.   Std. Err.       Deff      Deft
-------------+--------------------------------------------
   rcontrol |  -.9589666   .2276993     .012158   .110264
      _cons |  -.2553191   .2232521     .013712   .117098
----------------------------------------------------------
. clttest gfr_achg, by(rcontrol) cluster(rsiteid) /* clustered t- 
test */
t-test adjusted for clustering
gfr_achg by rcontrol, clustered by rsiteid
---------------------------------------------------------------------- 
--
 Intra-cluster correlation         =          -0.0267
---------------------------------------------------------------------- 
--
             N    Clusts    Mean           SE             95 % CI
rcontrol=0    47    2      -0.2553      0.7924       [-10.3243,   
9.8137]
rcontrol=1   112    2      -1.2143           .        
[       .,       .]
---------------------------------------------------------------------- 
--
Combined    159     2      -0.9308           .        
[       .,       .]
---------------------------------------------------------------------- 
--
Diff(0-1)   159     4       0.9590           .        
[       .,       .]
Degrees freedom:    2
                   Ho: mean(-) = mean(diff) = 0
 Ha: mean(diff) < 0         Ha: mean(diff) ~= 0        Ha: mean 
(diff) > 0
      t =   2.1346                t =   2.1346              t =    
2.1346
  P < t =   0.9168          P > |t| =   0.1664          P > t =    
0.0832
Suggestions on why the t-test didn't work (it didn't calculate SE)  
would be welcome--it worked fine for a t-test of differences in the  
post-GFR itself.
BTW, you might have noticed that the SEs are *smaller* in the  
cluster/svy model compared to the unclustered model. That's because  
the internal variation within clusters is much larger than the  
differences between them--you can see this also in the deff and  
deft being less than 1.0. Does this give me an excuse to treat the  
data as unclustered?
On the other hand, when I look at ckd2 (diagnosed with CKD) for  
those not diagnosed before the start of the study (ckd1 == 0), I  
get a substantial design effect:
. svy: logit ckd2 rcontrol if ckd1==0
(running logit on estimation sample)
Survey: Logistic regression
Number of strata   =         1                  Number of obs       
=       259
Number of PSUs     =         4                  Population size     
=       259
                                               Design df           
=         3
                                               F(   1,      3)     
=     10.64
                                               Prob > F            
=    0.0471
---------------------------------------------------------------------- 
--------
            |             Linearized
       ckd2 |      Coef.   Std. Err.      t    P>|t|     [95% Conf.  
Interval]
------------- 
+----------------------------------------------------------------
   rcontrol |  -2.058052   .6309325    -3.26   0.047     -4.06596    
-.0501428
      _cons |   1.015231   .4127449     2.46   0.091    -. 
2983078    2.328769
---------------------------------------------------------------------- 
--------
. estat effects
----------------------------------------------------------
            |             Linearized
       ckd2 |      Coef.   Std. Err.       Deff      Deft
-------------+--------------------------------------------
   rcontrol |  -2.058052   .6309325     4.61385   2.14799
      _cons |   1.015231   .4127449     3.11419   1.76471
----------------------------------------------------------
Does all that make sense?
Another preferred option is to use panel methods such as -xtmixed-  
with the clusters specified as panels. Even
if you don't have covariates (and in an RCT you will need  to make  
a case for including them), these are often
preferred.
This is preferred because ... ?
. xtmixed gfr_achg rcontrol
Mixed-effects REML regression                   Number of obs       
=       159
                                               Wald chi2(1)        
=      0.30
Log restricted-likelihood = -589.18999          Prob > chi2         
=    0.5826
---------------------------------------------------------------------- 
--------
   gfr_achg |      Coef.   Std. Err.      z    P>|z|     [95% Conf.  
Interval]
------------- 
+----------------------------------------------------------------
   rcontrol |  -.9589666   1.744912    -0.55   0.583     
-4.378931    2.460998
      _cons |  -.2553191   1.464482    -0.17   0.862     
-3.125651    2.615013
---------------------------------------------------------------------- 
--------
---------------------------------------------------------------------- 
--------
 Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf.  
Interval]
----------------------------- 
+------------------------------------------------
               sd(Residual) |   10.03998   .5630142       
8.994974     11.2064
---------------------------------------------------------------------- 
--------
More detail on your design might produce more detailed answers.
See above.
Hope this helps,
Jeph
It does help. Thanks!
Michael I. Lichter wrote:
Hello, friends. I have a question about the analysis of data from  
cluster-randomized trials (CRTs). CRTs are experiments where  
subjects are randomly assigned to conditions (control, treatment)  
based on their group membership rather than being assigned  
individually as is usually the case in randomized controlled  
trials. In my study, the clusters are medical practices, so when  
a medical practice is assigned to a condition, all of the  
eligible patients therein are also assigned to the condition.  
CRTs should be analyzed using methods that take account of the  
clustering in the study design, of course.
My question is this: For CRTs, is there any statistical reason  
for preferring the cluster() option on estimation commands (e.g.,  
regress, logit) over the survey commands, or vice-versa? I've  
used both and the results are similar, but the survey commands  
estimate larger standard errors. If the answer is that they're  
both equally appropriate but produce different results because  
they use somewhat different methods of estimation, that's fine.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/