Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: re: statsby slowness


From   David Airey <[email protected]>
To   [email protected]
Subject   st: re: statsby slowness
Date   Mon, 20 Aug 2007 09:13:29 -0500

.

These improvements sped up obtaining my 22,000 spearman correlations to 45 seconds, from 50 something. Very nice! Thank you. I had forgotten to look for all "in" versus "if" substitutions I could make.

-Dave


Interesting. You may get a bit more speed if
you replace this

egen rank_1 = rank(expression), by(ssrownum)
egen rank_2 = rank(iso_VSV), by(ssrownum)
egen corr = corr(rank_1 rank_2), by(ssrownum)

by this:

sort ssrownum
by ssrowsum : egen rank_1 = rank(expression)
by ssrowsum : egen rank_2 = rank(iso_VSV)
by ssrowsum : egen corr = corr(rank_1 rank_2)

The two code segments are equivalent in what
you end with, but not in when they -sort-.

SImilarly

keep if _n >= `start' & _n <= `stop'

should be faster as

keep in `start'/`stop'

and I would always use the built-in -sqrt()-
when it applies, rather than powering to 0.5.

Nick
[email protected]
// Break file into subfiles of 500 "by groups" (genes),
// because Stata "statsby" command is so slow across 22K by groups,
// then calculate and save out spearman correlation stats
// for iso_VSV and gene expression and append into a file 22K rows long
// and apply FDR

// This version comments out statsby and uses egen rank and egen corr
// and consequently is 10 times fast, doing 22K spearman correlations
// in about 50 seconds (along with file creation, etc.). Thanks Nick!

set more off
set rmsg on
clear
set memory 200M

// sabatini_merged is a long format file
// merging expression data and editing data
use "/Users/dairey/Desktop/primate_editing/sabatini_merged.dta"

preserve
local start = 1
forvalues stop = 6000(6000)267396 {
keep in `start'/`stop'
save sabatini_`stop'.dta, replace
sort ssrownum
by ssrownum : egen rank_1 = rank(expression)
by ssrownum : egen rank_2 = rank(iso_VSV)
by ssrownum : egen corr = corr(rank_1 rank_2)
collapse (mean) spearman=corr (count) count=corr, by(ssrownum)
gen t = spearman*sqrt((count-2)/(1-spearman^2))
gen p = 2*ttail(count-2,abs(t))
save spearman_`stop'.dta, replace
local start = `stop'+1
restore, preserve
}

clear
use "/Users/dairey/Desktop/primate_editing/sabatini_merged.dta"
keep if _n > 264000 & _n <= 267396
save sabatini_267396.dta, replace
sort ssrownum
by ssrownum : egen rank_1 = rank(expression)
by ssrownum : egen rank_2 = rank(iso_VSV)
by ssrownum : egen corr = corr(rank_1 rank_2)
collapse (mean) spearman=corr (count) count=corr, by(ssrownum)
gen t = spearman*sqrt((count-2)/(1-spearman^2))
gen p = 2*ttail(count-2,abs(t))
save spearman_267396.dta, replace


// append the statsby results into one file
use spearman_6000.dta
forvalues stop = 12000(6000)267396 {
append using spearman_`stop'.dta
erase spearman_`stop'.dta
}
append using spearman_267396.dta
erase spearman_267396.dta
save iso_VSV.dta, replace
erase spearman_6000.dta

// see if there is significance after FDR
multproc, pvalue(p) method(simes) puncor(0.05) rank(rank)

// keep top 100 and bring in gene information against ssrownum key id
sort rank
keep if rank < 201
sort ssrownum
merge ssrownum using "/Users/dairey/Desktop/primate_editing/ sabatini_genes.dta"
keep if _merge == 3

// list gene information
sort rank
list gene_symbol
set rmsg off
set more on

/*

// use a forvalues loop to make smaller files and do statsby
preserve
local start = 1
forvalues stop = 6000(6000)267396 {
keep if _n >= `start' & _n <= `stop'
save sabatini_`stop'.dta, replace
statsby n=r(N) spearman=r(rho) p=r(p), by(ssrownum): ///
spearman iso_VSV expression
save spearman_`stop'.dta, replace
local start = `stop'+1
restore, preserve
}

// above, the forvalues misses data after row 264000 so
// those are picked up here
clear
use "/Users/dairey/Desktop/primate_editing/sabatini_merged.dta"
keep if _n > 264000 & _n <= 267396
save sabatini_267396.dta, replace
statsby n=r(N) spearman=r(rho) p=r(p), by(ssrownum): ///
spearman iso_VSV expression
save spearman_267396.dta, replace

*/

--
David C. Airey, Ph.D.
Research Assistant Professor

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index