Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

re: Re: st: Extremely poor performance in repeated ANOVA


From   David Airey <david.airey@vanderbilt.edu>
To   statalist@hsphsun2.harvard.edu
Subject   re: Re: st: Extremely poor performance in repeated ANOVA
Date   Wed, 4 Feb 2004 10:14:11 -0600

Michael Ingre replied:

Ken Higbee <khigbee@stata.com>:

> I created a dataset based on the information you provided. I ran
> your -anova- on my 2.4 GHz computer running Linux. It finished
> in just under a minute. I do not know what SPSS and StatView are
> doing and so cannot fully explain the differences in timing.

I need to correct my timing a bit. My PowerBook (apparently) did not feel
very well yesterday. I have run it three times this morning in 3 minutes
29-32 seconds on an iMac G4 800Mhz. That's still however, a 100 times slower
than SPSS.
On my computer, a 1.25 GHz Powerbook, the timing for this problem with Michael Ingre's data set was:

r; t=119.92 9:02:08

Most of this was due to the epsilon correction calculations. The uncorrected ANOVA table was completed in less than 30 seconds (probably ~ 20 s).

Data Desk 6.2 calculated the ANOVA table (using GLM) less than 3 seconds:

Design:

Source F/R max df EMS F-Denom
Const - 1 sbt+Const sbt
sbt R 16 sbt Error
day F 2 sbt*day+day sbt*day
sbt*day M 32 sbt*day Error
tim F 19 sbt*tim+tim sbt*tim
sbt*tim M 304 sbt*tim Error
day*tim F 38 day*tim Error
Error R 608
Total 1019

ANOVA:

Source df SS MS F P
Const 1 17708.3 17708.3 209.28 0.0001
sbt 16 1353.83 84.6146 69.225 0.0001
day 2 80.6608 40.3304 5.1077 0.0119
sbt*day 32 252.673 7.89602 6.4599 0.0001
tim 19 113.157 5.95562 2.5548 0.0005
sbt*tim 304 708.676 2.33117 1.9072 0.0001
day*tim 38 53.4961 1.40779 1.1517 0.2487
Error 608 743.171 1.22232
Total 1019 3305.67

There is not a requirement for the data to be balanced using Data Desk for univariate repeated measures ANOVA; a subject is not completely dropped because one repeated observation was missing. On the other hand, Data Desk offers no corrections. Data Desk can calculate repeated measures design using MANOVA, but only in a limited way, unlike Stata. Data Desk could not, for example, compute a Ingre's problem using MANOVA, according to the manual. Stata can.



> When everything is balanced there may be faster ways of getting
> to the same answer. But, Stata's -anova-, using the sweep
> operator, is able to handle designs that are not balanced
> (including having missing cells) and that may have other
> collinearities (from continuous variables included in the model).
> In those cases, the faster ways of getting to the answer may not
> hold.

Yes. That's it. Thank you Ken for making that point. SPSS and StatView only
accepts cases with complete data on all measurements. In this area Stata
outperforms the competition.

The ability to analyze unbalanced designs with missing cells is intriguing
and I can think of many situations where it could be useful. Though, special
care must be taken, when there are lot's of missing data or when the pattern
of missing data is systematic.

Given the enormous speed improvement with (presumably) the alternative way
of calculating ANOVAs, an alternative procedure for anova (for complete
cases data) is high up on my wish list. And I guess also on David Aireys
(did your anova finish at all?) and others who do experimental research.
No, the ANOVA did not finish. Or rather the epsilon corrections never finished. My conclusion was that I should use a different approach altogether, for two reasons. One is that the ANOVA I discussed online previously was actually a smaller test version of the one I really need to run. It turns out that the design matrix limits are too small in Stata SE. My inadequate understanding is that both Proc Mixed and R LME use alternative ways of representing matrices during internal calculations, and are able to compute problems of the size I am interested in. The second reason is that both Proc Mixed and LME allow different covariance structures to be modeled, which is more realistic for repeated measures situations.



> David Airey <david.airey@vanderbilt.edu> mentioned several
> alternatives for repeated measures data including Stata's
> -manova- command that was introduced in Stata 8. I personally
> like MANOVA over repeated measures ANOVA. (But there are some
> cases where the MANOVA cannot be done -- too many y variables
> compared to the number of observations -- where the repeated
> measures ANOVA can still be computed.)

MANOVA is an interesting alternative in many situations. I will consider it
when appropriate. If I'm not mistaken though, the present analysis would not
run in MANOVA because it would mean 3*20 dependent variables and only 17
subjects. This is also typical for many of our experiments (and some of our
field studies) so ANOVA would still be our main approach.
Ouch. Are there no alternatives in any of the xt- commands in Stata? This is usually where I get frustrated with what I don't know--(in?)applicability of the xt commands to experimental repeated measures data typically analyzed by ANOVA/MANOVA or mixed modeling.



David Airey <david.airey@vanderbilt.edu>:

> As for me, the more I use Stata, the more I like it, but the more I
> mess around with statistics, the more tools I wind up exploring (Data
> Desk, Stata, and R, so far).

Agree. Stata is really growing on me. And this is of course part of my
problem. I want Stata to be able to do it all ... I don't want to spend time
in to many programs but I have realized that there are limits even to Stata.
Currently though, I have my hands full with learning Stata and LISREL. And
soon I will take a course in GLLAMM.

> For biologists using statistics, the main weaknesses of Stata are
> currently a lack of a routine like SAS Mixed or R LME/NLME

Mixed modeling is an area that I'm very interested in. I have no practical
experience of it but from what I've read it is the answer to many of my
problems. And that's why I will take some time to learn GLLAMM which as I
understand is the closest to Proc Mixed you can get in Stata.
Yes, but another list member has repeatedly stated that GLLAMM has limited ability for modeling the covariance structure. When you take that course, report back!



> Please send me the data set if it's not private, and I will run on my
> Powerbook to compare times. I'm curious about this. I have a 1.25 GHz
> Powerbook.

Check you mail.

Finally, many thanks to Ken Higbee and David Airey your time and knowledge.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index