Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: re: statsby slowness


From   David Airey <david.airey@Vanderbilt.Edu>
To   statalist@hsphsun2.harvard.edu
Subject   st: re: statsby slowness
Date   Sat, 18 Aug 2007 23:24:18 -0500

.

Just to follow up, I had complained about the slowness of statsby when the number of "by groups" is large, as in gene microarray data. In my case the number of by groups was about 22,000 and the rows about 300,000. I wanted to run a spearman correlation, and return the results to a file using statsby. I chose to break the file into smaller files of 500 "by groups" each. I then found statsby was performing at a speed that I could tolerate. That speed was 11 minutes for the main loop of about 22,000 tests, (2000 test every minute or .03 seconds per test). Since statsby is a very convenient command, this seems a reasonable approach to recommend to others.

Cheers,

-Dave


// Break file into subfiles of 500 "by groups" (genes),
// because Stata "statsby" command is slow across 22K by groups,
// then calculate and save out spearman correlation stats
// for iso_VSV and gene expression and append into a file 22K rows long
// and apply FDR

set more off
set rmsg on
clear
set memory 200M

// sabatini_merged is a long format file
// merging expression data and editing data
use "/Users/dairey/Desktop/primate_editing/sabatini_merged.dta"

// use a forvalues loop to make smaller files and do statsby
preserve
local start = 1
forvalues stop = 6000(6000)267396 {
keep if _n >= `start' & _n <= `stop'
save sabatini_`stop'.dta, replace
statsby n=r(N) spearman=r(rho) p=r(p), by(ssrownum): ///
spearman iso_VSV expression
save spearman_`stop'.dta, replace
local start = `stop'+1
restore, preserve
}

// above, the forvalues misses data after row 264000 so
// those are picked up here
clear
use "/Users/dairey/Desktop/primate_editing/sabatini_merged.dta"
keep if _n > 264000 & _n <= 267396
save sabatini_267396.dta, replace
statsby n=r(N) spearman=r(rho) p=r(p), by(ssrownum): ///
spearman iso_VSV expression
save spearman_267396.dta, replace

// append the statsby results into one file
use spearman_6000.dta
forvalues stop = 12000(6000)267396 {
append using spearman_`stop'.dta
erase spearman_`stop'.dta
}
append using spearman_267396.dta
erase spearman_267396.dta
save iso_VSV.dta, replace
erase spearman_6000.dta

// see if there is significance after FDR
multproc, pvalue(p) method(simes) puncor(0.05) rank(rank)

// keep top 100 and bring in gene information against ssrownum key id
sort rank
keep if rank < 101
sort ssrownum
merge ssrownum using "/Users/dairey/Desktop/primate_editing/ sabatini_genes.dta"
keep if _merge == 3

// list gene information
sort rank
list gene_title
set rmsg off
set more on



--
David C. Airey, Ph.D.
Research Assistant Professor

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index