Title | Stratified generalized Wilcoxon test | |
Author | Ronald Thisted, Departments of Statistics, Health Studies, and Anesthesia & Critical Care, The University of Chicago |
This FAQ details the reason we believe Stata and SAS differ in the results that they report for the stratified Wilcoxon test. The context is time-to-event data from a stratified analysis such as a randomized multicenter clinical trial comparing an active treatment with a control, stratifying by (controlling for) site-specific differences in underlying event-time distribution. Multicenter randomized clinical trials comparing treatments often assess the treatment effect using the log-rank or a generalized Wilcoxon test, stratified by study site.
Although SAS does not have a procedure for calculating the standard stratified log-rank or Wilcoxon tests, PROC LIFETEST can be used to construct an approximate statistical test of treatment effect in the presence of stratification.
Stata, on the other hand, does provide the option strata() to its sts test command, which allows for the analysis of stratified data. It combines the stratum-specific estimates according to the procedure described for the standard stratified test by Collett (1994, section 2.7) and by Klein and Moeschberger (1997, section 7.5).
Using a small published clinical trial dataset, we demonstrate the differences obtained by SAS and Stata and show that the use of PROC LIFETEST produces a different statistical significance of the treatment effect compared with the standard stratified test. The data, given by Lee (1992) and quoted by Collett (1994, p. 47, table 2.8), are reproduced below:
_____________________________________________________ 21 - 40 40 - 60 61+ _______________ _______________ _______________ BCG c. parvum BCG c. parvum BCG c. parvum _____________________________________________________ 19 27* 34* 8 10 25* 24* 21* 4 11* 5 8 8 18* 17* 23* 11* 17* 16* 12* 17* 7 15* 34* 12* 8* 24 8* 8 8* _____________________________________________________ An Asterisk (*) indicates a censored observation.
These data have been entered into Stata:
. use http://www.stata.com/support/faqs/dta/parvum . list agegrp time c if treat==0, sepby(agegrp) +-------------------+ | agegrp time c | |-------------------| 1. | 1 19 1 | 2. | 1 24 0 | 3. | 1 8 1 | 4. | 1 17 0 | 5. | 1 17 0 | 6. | 1 34 0 | |-------------------| 16. | 2 34 0 | 17. | 2 4 1 | 18. | 2 17 0 | |-------------------| 26. | 3 10 1 | 27. | 3 5 1 | +-------------------+ . list agegrp time c if treat==1, sepby(agegrp) +-------------------+ | agegrp time c | |-------------------| 7. | 1 27 0 | 8. | 1 21 0 | 9. | 1 18 0 | 10. | 1 16 0 | 11. | 1 7 1 | 12. | 1 12 0 | 13. | 1 24 1 | 14. | 1 8 1 | 15. | 1 8 0 | |-------------------| 19. | 2 8 1 | 20. | 2 11 0 | 21. | 2 23 0 | 22. | 2 12 0 | 23. | 2 15 0 | 24. | 2 8 0 | 25. | 2 8 0 | |-------------------| 28. | 3 25 0 | 29. | 3 8 1 | 30. | 3 11 0 | +-------------------+
The aim of the clinical trial was to compare the ability of two immunotherapies (BCG and Cryptosporidium parvum) to prolong the life of malignant melanoma patients. Thirty patients were randomized to one of the two treatment groups and then further classified into three age groups. This dataset is instructive because (a) it is small enough that the calculations can be replicated by hand, (b) it includes a small number of tied event times, and (c) the intermediate calculations are documented in Collett’s book.
After st setting the data, the command in Stata for comparing via the generalized Wilcoxon test the survival experience of the two treatment groups without stratification is sts test treat, wilcoxon, and correcting (stratifying) for age is: sts test treat, wilcoxon strata(agegrp) where treat is an indicator variable of treatment type and agegrp is the stratification variable. The results of these two commands are
. stset time, failure(c) . sts test treat, wilcoxon failure _d: c analysis time _t: time Wilcoxon (Breslow) test for equality of survivor functions | Events Events Sum of treat | observed expected ranks ------+-------------------------------------- 0 | 5 3.71 34 1 | 5 6.29 -34 ------+-------------------------------------- Total | 10 10.00 0 chi2(1) = 0.91 Pr>chi2 = 0.3397 . sts test treat, wilcoxon strata(agegrp) failure _d: c analysis time _t: time Stratified Wilcoxon (Breslow) test for equality of survivor functions | Events Events Sum of treat | observed expected(*) ranks(*) ------+-------------------------------------- 0 | 5 3.76 6 1 | 5 6.24 -6 ------+-------------------------------------- Total | 10 10.00 0 (*) sum over calculations within agegrp chi2(1) = 0.18 Pr>chi2 = 0.6726
These results match those reported by Collett (1994, 47–49).
Now let’s analyze these data using SAS’s PROC LIFETEST. Using treatment group as the stratum designator and the STRATA statement, SAS produces precisely the calculations that Stata performs without the strata() option. That is, the SAS command
PROC LIFETEST;TIME*C(0);STRATA AGEGRP;
produces the same results as Stata’s
. sts test treat, wilcoxon
When there is a separate stratification variable (such as agegrp in our example), the test statistics cannot be calculated directly using PROC LIFETEST. Because SAS is so widely used for clinical trial data management and analysis, the approach taken by users to test the treatment effect is often to specify the stratification variable using the STRATA command, and then to test the treatment effect with the TEST command, hoping the effect of ties is not critical. For our example, this procedure yields
Univariate Chi-Squares for the WILCOXON Test Test Standard Pr > Variable Statistic Deviation Chi-Square Chi-Square TREAT 0.9409 1.1889 0.6263 0.4287
The significance probability of 0.4287 is substantially different from the 0.6726 that Stata reports and Collett computes.
The difference in the reported results is not due to programming errors. Rather, the difference hinges on how the stratum-specific estimates are calculated and combined to produce the overall statistic.
In the absence of strata, the log-rank and Wilcoxon statistics for the effect of an active treatment to a control are calculated as
U = Σ c_{j}(d_{j}- w_{j})
where d_{j} is the number of failures at the j_{th} time point, w_{j} is the expected number of failures in, say, the active treatment group, and c_{j} = 1 for the log-rank test, and c_{j}=n_{j}, the number at risk of failure at the j_{th} time point, for the Wilcoxon–Gehan–Breslow test (Kalbfleisch and Prentice 1980, 234; Collett 1994, 43). The test statistic U^{2}/V, where V is the variance of the U, is compared with the chi-squared distribution with 1 df.
When there is a separate stratification variable (such as study site in multicenter randomized clinical trials or age group in our previous example), tests based on pooling data across all strata can be misleading because of the possibility of stratum-specific differences in survival unrelated to treatment. These differences often reflect differences in patient population, diagnostic methods, or case-mix across the study sites and would generally affect all treatments to an equal degree.
The rank-based statistics described above are easily generalized to the stratified case. As described in detail by Collett (1994, 47) or Klein and Moeschberger (1997, 205), the statistics U_{s} and V_{s} are calculated within each stratum s, and the test statistic ((Σ U_{s})^{2})/Σ V_{s} is then compared with the chi-squared distribution with 1 df. This is what Stata does when the strata() option is specified.
Using Collett’s data from the previous section, let’s examine these calculations in detail.
Stratum- Manual specific Calculation STATA SAS’s LIFETEST Chi-squared Age ------------ ----------- ---------------- ------------- Stratum U V U V U VT STATA SAS -------------------------------------------------------------------- 21-40 -3. 155.615 -3. 155.615 -0.1804 0.7667 0.0578 0.0424 41-60 5. 35.000 5. 35.000 0.4545 0.2975 0.7143 0.6944 61+ 4. 11.000 4. 11.000 0.6667 0.3492 1.4545 1.2728 -------------------------------------------------------------------- Total 6. 201.615 6. 201.615 0.9409 1.4134 Chi-squared 0.1786 0.1786 0.6263 p-value 0.6726 0.4287 _______________________________________________________________
Let’s examine how SAS computes the overall test statistic. The U statistic and variance (VT) presented in the table for each stratum are those reported by PROC LIFETEST when the TEST statement is used to analyze each stratum independently. In other words, this is what SAS reports if we run PROC LIFETEST with the "TEST treatment" statement (no STRATA statement) on each stratum by itself. The overall chi-squared and corresponding p-values are those obtained when both the STRATA and the TEST statements are used. We can see that this statistic can be obtained by summing the stratum-specific values and then calculating the overall test statistic as [Σ(U)]^{2}/Σ(VT). This is exactly what SAS does internally. So, SAS and Stata are constructing the overall test in the same way by summing the U statistics across strata, summing the variance, and then constructing the test statistic. The difference lies in what SAS and Stata are summing.
Within each stratum, the chi-squared value from SAS is slightly smaller than that from Stata; this is due to the slightly larger variance produced by the estimator used by SAS. Consequently, within each stratum, SAS produces a result that is (very slightly) more supportive of the null hypothesis of no difference. However, the overall chi-squared statistic from SAS is more than three times as large as the Gehan–Wilcoxon stratified chi-squared produced by Stata, and is much LESS supportive of the null hypothesis than is the Stata result.
According to the SAS manual, the TEST statement is intended to be used for testing the association between a list of “numeric covariates” and the failure time. It performs better, in terms of power, when there are few or no ties and when the variable to be tested is continuous. For a categorical variable, such as a stratification variable and tied event times, the test may produce undesirable results. The test statistic is obtained as the average of the statistics one would have obtained by breaking the ties in all possible ways. This averaging leads to inflated variance estimates, resulting in a more conservative test. The documentation for PROC LIFETEST suggests that the effect of tied event times on the results of the TEST command is likely to be small: “Unless the proportion of ties is large, it is unlikely that this will be a problem” (SAS, 1048). This effect can be seen directly in the last two columns of the table above. Although this may be true when only one stratification variable is present, the difference can be substantial when two or more strata variables are used. In their book, Klein and Moeschberger (1997, 200) recommend against doing this analysis using the TEST statement.
Unfortunately, SAS’s TEST statement when used with STRATA need not produce a conservative test, as the example above clearly demonstrates. The problem with SAS is that it implicitly uses an algorithm that gives inappropriate weights to the different strata. In the example above, the youngest age group has 15 observations, the oldest only 5. The SAS stratified test, however, gives only one-third the weight to the youngest age group that it gives to the oldest! To a close approximation, the U values obtained by SAS are equal to the U values obtained from Stata divided by N+1, where N is the stratum size, and the V values can be obtained by dividing the Stata variances by (N+1)^{2}. While this factor cancels out when calculating a chi-squared statistics within strata, it has the effect of introducing stratum weights that penalize the information from the largest strata.
We are confident that Stata’s strata() option in the sts test command produces the desired overall test statistic for testing the difference in treatment groups in the presence of stratification.