[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <[email protected]> |

To |
<[email protected]> |

Subject |
st: RE: Processing speed for ttest |

Date |
Thu, 9 Oct 2003 10:18:33 +0100 |

Wallace, John > I've just finished writing my first .do file for a truly > enormous data > processing task. Its now running, and I'm underwhelmed at > the pace its > going at. I'll describe the dataset, the task, and the .do > file; please > comment on my approach and whether there is a more > efficient way to run it. > > I have a set of ~1100000 records, consisting of 3 supergroups of 6 > replicates. Each replicate has ~61000 analytes. Each > analyte is tested > across a pair of supergroups in an unpaired t-test, with 6 > replicates. > > Incidentally, if I'd had my way, we'd be using a oneway anova with a > bonferroni correction for significance, but the person > requesting the > analysis wanted t-tests. I'm not sure that this would > improve the speed of > the processing though (I imagine I'll find out later, since > I'll eventually > get my way with the analysis approach) > > I'm using the following variables > analyte = member of ~61000 records (string) > numanalyte = -encode-d analyte > q = counter for the set of supergroups in the t-test > I = counter for the t-test within the set of supergroups > p`q' = title of variable in dataset for recording the > calculated p-value of > the test > numsgroup = -encode-d supergroup (1, 2, or 3) > det = float number being tested > > .do-file: > > set more off > > encode(analyte), gen(numanalyte) > sum numanalyte > local min = r(min) > local max = r(max) > > forvalues q = 1(1)3 { > display "ttest "`q' > g p`q' = . > forvalues i = `min'(1)`max' { > display `i' > > if `q' == 1 { > quietly ttest det if numanalyte == > `i' & numsgroup > !=3, by(numsgroup) unpaired > } > else if `q' == 2 { > quietly ttest det if numanalyte == > `i' & numsgroup > !=2, by(numsgroup) unpaired > } > else { > quietly ttest det if numanalyte == > `i' & numsgroup > !=1, by(numsgroup) unpaired > } > > capture replace p`q' = r(p) if numanalyte == `i' > } > } > set more on > exit > end > > I'm monitoring the progress of the analysis by -display-ing > `q' and `i'. > I'm getting a new `i' displayed about once every 3.6 > seconds. This leads me > to think the entire analysis is going to take a few days! > I've got a Dell > Xeon workstation with dual 1.4GHz processors and 0.5GB > memory, and more than > sufficient hard drive space. I've allocated 200M to Stata, > and I'm running > Stata8, fully updated(9/30). > > Incidentally, I pre-sorted the dataset by analyte and > supergroup in the hope > that "making them close together" would speed processing. > > 60 mins in, 600 tests done...it seems to be slowing down (uhoh) David Airey has given several important pointers. The main issue, I guess, is that you are looping over groups when this can be vectorised. Also, a wide data structure may be preferable. It is worth underlining that -if- can be very slow, as Michael Blasnik has emphasised many times. There is no special logic whereby Stata goes straight to the observations required and works with them. Rather it blindly goes through every observation and tests whether the -if- condition is satisfied. With a million observations looped over repeatedly, this is not trivial, as you have observed. One remedy is to recast the problem using -in-, but the solutions pointed out by David are better in this case. I add a few extra comments on what makes this slow. First, the -display- to see how fast it's going itself shows things down. Second, -summarize-: sum numanalyte local min = r(min) local max = r(max) If you only want min and max, use the -meanonly- option. However, this is done only once and is not the main issue. Third, there is no gain in setting up an outer loop over `q' as you throw away the saving by repeatedly testing within it for the value of `q'. Explicit code should be faster. g p1 = . g p2 = . g p3 = . forvalues i = `min'(1)`max' { quietly ttest det if numanalyte == `i' & numsgroup !=3, by(numsgroup) unpaired capture replace p1 = r(p) if numanalyte == `i' quietly ttest det if numanalyte == `i' & numsgroup !=2, by(numsgroup) unpaired capture replace p2 = r(p) if numanalyte == `i' quietly ttest det if numanalyte == `i' & numsgroup !=1, by(numsgroup) unpaired capture replace p3 = r(p) if numanalyte == `i' } } I don't see that you need the -capture- there at all. These savings will probably be much less than the other savings from avoiding a loop over groups and -ttest- with -if-. Nick [email protected] * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Processing speed for ttest***From:*"Wallace, John" <[email protected]>

- Prev by Date:
**re: st: Processing speed for ttest** - Next by Date:
**st: predlog?** - Previous by thread:
**st: Processing speed for ttest** - Next by thread:
**re: st: Processing speed for ttest** - Index(es):

© Copyright 1996–2024 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |