|
This FAQ is relevant for users of Stata 9.
Why do Stata and SAS differ in the results they report for the
stratified generalized Wilcoxon test for time-to-event data?
|
Title
|
|
Stratified generalized Wilcoxon test
|
|
Author
|
Ronald Thisted, Departments of Statistics, Health Studies,
and Anesthesia & Critical Care, The University of Chicago
|
|
Date
|
January 1999; updated June 2005
|
This FAQ details the reason we believe Stata and SAS differ in the results
that they report for the stratified Wilcoxon test. The context is
time-to-event data from a stratified analysis such as a randomized
multicenter clinical trial comparing an active treatment with a control,
stratifying by (controlling for) site-specific differences in underlying
event-time distribution. Multicenter randomized clinical trials comparing
treatments often assess the treatment effect using the log-rank or a
generalized Wilcoxon test, stratified by study site.
Although SAS does not have a procedure for
calculating the standard stratified log-rank or Wilcoxon tests, PROC LIFETEST
can be used to construct an approximate statistical test of treatment effect
in the presence of stratification.
Stata, on the other hand, does provide the option strata()
to its sts test
command, which allows for the analysis of stratified data. It combines the
stratum-specific estimates according to the procedure described for the
standard stratified test by Collett (1994, section 2.7) and by Klein and
Moeschberger (1997, section 7.5).
A particular example
Using a small published clinical trial dataset, we demonstrate the
differences obtained by SAS and Stata and show that the use of PROC LIFETEST
produces a different statistical significance of the treatment effect
compared with the standard stratified test. The data, given by Lee (1992) and
quoted by Collett (1994, p. 47, table 2.8), are reproduced below:
_____________________________________________________
21 - 40 40 - 60 61+
_______________ _______________ _______________
BCG c. parvum BCG c. parvum BCG c. parvum
_____________________________________________________
19 27* 34* 8 10 25*
24* 21* 4 11* 5 8
8 18* 17* 23* 11*
17* 16* 12*
17* 7 15*
34* 12* 8*
24 8*
8
8*
_____________________________________________________
An Asterisk (*) indicates a censored observation.
These data have been entered into Stata:
. use http://www.stata.com/support/faqs/dta/parvum
. list agegrp time c if treat==0, sepby(agegrp)
+-------------------+
| agegrp time c |
|-------------------|
1. | 1 19 1 |
2. | 1 24 0 |
3. | 1 8 1 |
4. | 1 17 0 |
5. | 1 17 0 |
6. | 1 34 0 |
|-------------------|
16. | 2 34 0 |
17. | 2 4 1 |
18. | 2 17 0 |
|-------------------|
26. | 3 10 1 |
27. | 3 5 1 |
+-------------------+
. list agegrp time c if treat==1, sepby(agegrp)
+-------------------+
| agegrp time c |
|-------------------|
7. | 1 27 0 |
8. | 1 21 0 |
9. | 1 18 0 |
10. | 1 16 0 |
11. | 1 7 1 |
12. | 1 12 0 |
13. | 1 24 1 |
14. | 1 8 1 |
15. | 1 8 0 |
|-------------------|
19. | 2 8 1 |
20. | 2 11 0 |
21. | 2 23 0 |
22. | 2 12 0 |
23. | 2 15 0 |
24. | 2 8 0 |
25. | 2 8 0 |
|-------------------|
28. | 3 25 0 |
29. | 3 8 1 |
30. | 3 11 0 |
+-------------------+
The aim of the clinical trial was to compare the ability of two
immunotherapies (BCG and Cryptosporidium parvum) to prolong the life of
malignant melanoma patients. Thirty patients were randomized to one of the
two treatment groups and then further classified into three age groups. This
dataset is instructive because (a) it is small enough that the calculations
can be replicated by hand, (b) it includes a small number of tied event
times, and (c) the intermediate calculations are documented in
Collett’s book.
After st
setting the data, the command in Stata for comparing via the generalized
Wilcoxon test the survival experience of the two treatment groups without
stratification is sts test treat, wilcoxon, and correcting
(stratifying) for age is: sts test treat, wilcoxon strata(agegrp)
where treat is an indicator variable of treatment type and
agegrp is the stratification variable. The results of these two
commands are
. stset time, failure(c)
. sts test treat, wilcoxon
failure _d: c
analysis time _t: time
Wilcoxon (Breslow) test for equality of survivor functions
| Events Events Sum of
treat | observed expected ranks
------+--------------------------------------
0 | 5 3.71 34
1 | 5 6.29 -34
------+--------------------------------------
Total | 10 10.00 0
chi2(1) = 0.91
Pr>chi2 = 0.3397
. sts test treat, wilcoxon strata(agegrp)
failure _d: c
analysis time _t: time
Stratified Wilcoxon (Breslow) test for equality of survivor functions
| Events Events Sum of
treat | observed expected(*) ranks(*)
------+--------------------------------------
0 | 5 3.76 6
1 | 5 6.24 -6
------+--------------------------------------
Total | 10 10.00 0
(*) sum over calculations within agegrp
chi2(1) = 0.18
Pr>chi2 = 0.6726
These results match those reported by Collett (1994, 47–49).
Now let’s analyze these data using SAS’s PROC LIFETEST. Using treatment
group as the stratum designator and the STRATA statement, SAS produces
precisely the calculations that Stata performs without the strata()
option. That is, the SAS command
PROC LIFETEST;TIME*C(0);STRATA AGEGRP;
produces the same results as Stata’s
. sts test treat, wilcoxon
When there is a separate stratification variable (such as agegrp in our
example), the test statistics cannot be calculated directly using PROC
LIFETEST. Because SAS is so widely used for clinical trial data management
and analysis, the approach taken by users to test the treatment effect is
often to specify the stratification variable using the STRATA command, and
then to test the treatment effect with the TEST command, hoping the
effect of ties is not critical. For our example, this procedure yields
Univariate Chi-Squares for the WILCOXON Test
Test Standard Pr >
Variable Statistic Deviation Chi-Square Chi-Square
TREAT 0.9409 1.1889 0.6263 0.4287
The significance probability of 0.4287 is substantially different from the
0.6726 that Stata reports and Collett computes.
The reason for the discrepancy
The difference in the reported results is not due to programming errors.
Rather, the difference hinges on how the stratum-specific estimates are
calculated and combined to produce the overall statistic.
In the absence of strata, the log-rank and Wilcoxon statistics for the effect
of an active treatment to a control are calculated as
U = Σ cj(dj- wj)
where dj is the number of failures at the jth time
point, wj is the expected number of failures in, say, the active treatment
group, and cj = 1 for the log-rank test, and
cj=nj, the number at risk of failure at the
jth time point, for the Wilcoxon–Gehan–Breslow test (Kalbfleisch
and Prentice 1980, 234; Collett 1994, 43). The test statistic U2/V, where
V is the variance of the U, is compared with the chi-squared distribution with
1 df.
When there is a separate stratification variable (such as study site in
multicenter randomized clinical trials or age group in our previous example),
tests based on pooling data across all strata can be misleading because of the
possibility of stratum-specific differences in survival unrelated to
treatment. These differences often reflect differences in patient
population, diagnostic methods, or case-mix across the study sites and would
generally affect all treatments to an equal degree.
The rank-based statistics described above are easily generalized to the
stratified case. As described in detail by Collett (1994, 47) or Klein and
Moeschberger (1997, 205), the statistics Us and Vs are
calculated within each stratum s, and the test statistic ((Σ
Us)2)/Σ Vs is then compared with the
chi-squared distribution with 1 df. This is what Stata does when the
strata() option is specified.
Using Collett’s data from the previous section, let’s examine these
calculations in detail.
Stratum-
Manual specific
Calculation STATA SAS’s LIFETEST Chi-squared
Age ------------ ----------- ---------------- -------------
Stratum U V U V U VT STATA SAS
--------------------------------------------------------------------
21-40 -3. 155.615 -3. 155.615 -0.1804 0.7667 0.0578 0.0424
41-60 5. 35.000 5. 35.000 0.4545 0.2975 0.7143 0.6944
61+ 4. 11.000 4. 11.000 0.6667 0.3492 1.4545 1.2728
--------------------------------------------------------------------
Total 6. 201.615 6. 201.615 0.9409 1.4134
Chi-squared 0.1786 0.1786 0.6263
p-value 0.6726 0.4287
_______________________________________________________________
Let’s examine how SAS computes the overall test statistic. The U statistic
and variance (VT) presented in the table for each stratum are those
reported by PROC LIFETEST when the TEST statement is used to analyze each
stratum independently. In other words, this is what SAS reports if we run
PROC LIFETEST with the "TEST treatment" statement (no STRATA statement) on
each stratum by itself. The overall chi-squared and corresponding
p-values are those obtained when both the STRATA and the TEST
statements are used. We can see that this statistic can be obtained by
summing the stratum-specific values and then calculating the overall test
statistic as [Σ(U)]2/Σ(VT). This is exactly what SAS does
internally. So, SAS and Stata are constructing the overall test in the same
way by summing the U statistics across strata, summing the variance, and then
constructing the test statistic. The difference lies in what SAS and Stata
are summing.
Within each stratum, the chi-squared value from SAS is slightly smaller than
that from Stata; this is due to the slightly larger variance produced by the
estimator used by SAS. Consequently, within each stratum, SAS produces a
result that is (very slightly) more supportive of the null hypothesis of no
difference. However, the overall chi-squared statistic from SAS is more
than three times as large as the Gehan–Wilcoxon stratified chi-squared
produced by Stata, and is much LESS supportive of the null hypothesis than
is the Stata result.
According to the SAS manual, the TEST statement is intended to be used for
testing the association between a list of “numeric covariates” and the failure
time. It performs better, in terms of power, when there are few or no ties
and when the variable to be tested is continuous. For a
categorical variable, such as a stratification variable and tied event
times, the test may produce undesirable results. The test statistic is
obtained as the average of the statistics one would have obtained by breaking
the ties in all possible ways. This averaging leads to inflated variance
estimates, resulting in a more conservative test. The documentation for PROC
LIFETEST suggests that the effect of tied event times on the results of the
TEST command is likely to be small: “Unless the proportion of ties is large,
it is unlikely that this will be a problem” (SAS, 1048). This effect can be
seen directly in the last two columns of the table above. Although this may
be true when only one stratification variable is present, the difference can
be substantial when two or more strata variables are used. In their book, Klein
and Moeschberger (1997, 200) recommend against doing this analysis using the
TEST statement.
Unfortunately, SAS’s TEST statement when used with STRATA need not
produce a conservative test, as the example above clearly demonstrates.
The problem with SAS is that it implicitly uses an algorithm that gives
inappropriate weights to the different strata. In the example above, the
youngest age group has 15 observations, the oldest only 5. The SAS
stratified test, however, gives only one-third the weight to the youngest
age group that it gives to the oldest! To a close approximation, the U
values obtained by SAS are equal to the U values obtained from Stata divided
by N+1, where N is the stratum size, and the V values can be obtained by
dividing the Stata variances by (N+1)2. While this factor
cancels out when calculating a chi-squared statistics within strata, it has
the effect of introducing stratum weights that penalize the information from
the largest strata.
We are confident that Stata’s strata() option in the sts test
command produces the desired overall test statistic for testing the
difference in treatment groups in the presence of stratification.
References
-
Collett, D. 1994.
- Modeling Survival Data in Medical Research. London:
Chapman & Hall.
- Kalbfleisch, J. D. and R. L. Prentice. 2002.
- The Statistical Analysis
of Failure Time Data. 2nd Edition. New York: Wiley.
- Klein, J. P. and M. L. L. Moeschberger. 1997.
-
Survival Analysis. Techniques for Censored and
Truncated Data. New York: Springer.
- SAS Institute Inc. 1989.
- SAS/STAT User’s Guide, Version 6, Volume 2.
4th ed. Cary, NC: SAS Institute.
|