[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Joseph Coveney" <jcoveney@bigplanet.com> |

To |
"Statalist" <statalist@hsphsun2.harvard.edu> |

Subject |
Re: implementation of boschloo's test: very slow execution |

Date |
Fri, 29 Feb 2008 22:17:31 +0900 |

Eva Poen wrote (excerpted): I am currently trying to implement the unconditional test for comparing two binomial proportions by Boschloo. The test is described e.g. here: http://www.west.asu.edu/rlberge1/papers/comp2x2.pdf (page 3). The problem with my implementation is that it takes a long time to execute: for n1=148 and n2=132, it takes roughly 30 minutes on a reasonably modern machine. I am using Stata 9.2 for this. I was wondering if anyone could suggest improvements to make it faster. . . . I can replicate an example I found in a Biometrics(2003) paper by Mehrotra et al., so I believe that the calculations are correct. Is there a way to speed things up inside the nested loop? I'd be greatful for any hints, as I have to run quite a lot of these tests; time does matter here. [excerpted from a follow-up the same day (last Friday)] It seems to be the case that the most time-consuming thing inside the nested loop is this product of binomial probability mass functions. Is there a way outside Mata to speed this bit up? -------------------------------------------------------------------------------- The approach below stays within Stata 9.2 and speeds things up without going to Mata. With it, execution time for Eva's 30-minute example is down to about 40 seconds on my machine. The speed-up results from re-formulating the problem: rather than having 10000-observation dataset in memory and cumulating PH0sum 15000 times over it, the approach below puts the roster of Fisher-test-qualified two-by-two tables in memory and then uses the Newton-Raphson method to locate the function's maximum with arbitrarily high precision. This requires only a relative handful of evaluations of the probability mass functions down a dataset. Michael Blasnik observed that the p-value-versus-theta function should be smooth. It is, and the approach below takes advantage of that. Michael also suggested that Eva avoid -tabi- by using -tabulate- and -post-, and the approach below exploits that tactic, too. (It turns out that with Eva's approach avoiding -tabi- didn't seem to make a noticeable difference--almost all of the time was, as Eva mentioned, spent evaluating PH0sum. In the approach below, however, the bulk of the shortened execution time is spent conducting Fisher-Irwin tests. There could very well be a shortcut around exhaustively evaluating the set of less extreme tables. That would shorten execution time even further.) Eva is using Stata 9.2, and so doesn't have the option of using Stata 10's new Mata optimizing functions. In light of that, the approach below makes unconventional use of -ml- as the maximizer. It returns Boschloo's p-value in the ereturn scalar for the log-likelihood, e(ll). (Maybe that's only fitting, given that Boschloo uses the Fisher-Irwin test's p-value as a test statistic.) I resorted to using -ml-, because -amoeba- and -quasi-, two user-written function-maximizing commands, couldn't reliably find the maximum. Even -ml- can experience trouble avoiding converging to local maxima with fewer than 50, or so, repeated searches in -ml search- beforehand. In order to reduce the risk of -ml- settling onto a local maximum, the default for -ml search-'s repeat option is set to 100. The user has a choice of using -ml search- or -ml plot- to find a starting value for theta. -ml search- with its repeat of 100 takes longer than -ml plot- (44 versus 33 seconds with Eva's example on my desktop PC) despite the latter's loading classes for the graph. Experience might warrant lowering or raising the hard-coded default of 100--a repeat of 50 worked for Eva's example, but for insurance I've doubled that for the default. A repeat of 10 doesn't work reliably: it resulted in -ml- converging on a local maximum the first time (start-up random number seed) and on the global maximum in a second run with the next-in-line random number seed. The -ml search- bounds and -ml plot- bounds are both set for a theta from 0.01 to 0.5 by default, although these are both options for the user. (At least for Eva's example, a plot throughout the entire range of theta is symmetric about theta = 0.5, so that the supremum is present twice in the set. I don't know whether that's true in general, and so the bounds are left as options. Lower bound, upper bound, or both may be set for both -ml search- and -ml plot- options.) There are two ado-files below, -boschloo.ado- and -boschloo_ll.ado-. They must be saved separately where Stata can find them (-help adopath-); the former calls the latter. There is also an example do-file that runs Eva's example* with -ml search- (default) and -ml plot- options, as well as illustrates user-set bounds. -if-, -in-, and -nolog- options are available, but aren't illustrated. None of the -ml- suite's other conventional options are available. I'm not looking forward to finding out what StataCorp thinks of -ml- being abused as a general function maximizer in this fashion, and I've not done any certification of the code below with a variety of test cases, so it won't be going to SSC. Joseph Coveney * Mehrotra, D. V., Chan, I. S. F. and Berger, R. L. (2003). A cautionary note on exact unconditional inference for a difference between two independent binomial proportions. _Biometrics_ 59:441--50. [Available from www.west.asu.edu/rlberge1/papers/biometrics03.pdf ] For some practical perspective, see also www.iancampbell.co.uk/twobytwo/twobytwo.htm and the preprint downloadable from there. ------------------------------boschloo.ado-------------------------------------- program define boschloo, rclass version 9.2 syntax varlist(min = 2 max = 2) [if] [in] [fweight] /// [, Ubound(real 0.01) Lbound(real 0.5) MLPlot noLOG] tempfile tmpfil0 tempname memhold observed_fisher tempvar freq * preserve marksample touse quietly keep if `touse' * foreach var of local varlist { quietly inspect `var' if (r(N_unique) != 2) { display in smcl as error "Number of groups must equal two" exit = 99 } } if ("`weight'" != "") local weight [`weight' `exp'] contract `varlist' `weight', freq(`freq') zero nomiss quietly tabulate `varlist' [fweight = `freq'], exact scalar define `observed_fisher' = r(p_exact) * local groups : word 1 of `varlist' quietly levelsof `groups', local(rows) local i 1 foreach row of local rows { summarize `groups' if `groups' == `row' /// [fweight = `freq'], meanonly local count`i++' = r(N) } * postfile `memhold' long (counti countj) using `tmpfil0' quietly forvalues counti = 0/`count1' { replace `freq' = `counti' in 1 replace `freq' = `count1' - `counti' in 2 forvalues countj = 0/`count2' { replace `freq' = `countj' in 3 replace `freq' = `count2' - `countj' in 4 tabulate `varlist' [fweight = `freq'], exact if (r(p_exact) <= `observed_fisher') post /// `memhold' (`counti') (`countj') } } postclose `memhold' quietly use `tmpfil0', clear assert !missing(counti) // Remove after gaining experience assert !missing(countj) generate long counti1 = counti + 1 generate long countj1 = countj + 1 char define counti[total] `count1' char define countj[total] `count2' * if ("`mlplot'" != "") { ml model lf boschloo_ll (theta:), missing ml plot /theta `lbound' `ubound' ml maximize , nooutput `log' } else ml model lf boschloo_ll (theta:), maximize missing /// search(quietly) repeat(100) bounds(`lbound' `ubound') `log' display in smcl as text _newline(1) "Boschoo's p: " /// as result %06.4f e(ll) return scalar p_boschloo = e(ll) ereturn clear restore end -------------------------------------------------------------------------------- -----------------------------boschloo_ll.ado------------------------------------ program define boschloo_ll args lf theta local totali : char counti[total] local totalj : char countj[total] quietly replace `lf' = (Binomial(`totali', counti, `theta') - /// Binomial(`totali', counti1, `theta')) * /// (Binomial(`totalj', countj, `theta') - /// Binomial(`totalj', countj1, `theta')) end -------------------------------------------------------------------------------- --------------------------------example.do-------------------------------------- clear set more off input byte group byte count1 int total 1 8 148 2 1 132 end generate int count0 = total - count1 drop total quietly reshape long count, i(group) j(rash) * boschloo group rash [fweight = count] boschloo group rash [fweight = count], mlplot lbound(0.1) ubound(0.9) return list exit -------------------------------------------------------------------------------- * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**Re: st: Some annoying problems with my stata 10** - Next by Date:
**st: -bs- with the xi: prefix and -gologit2-** - Previous by thread:
**Re: implementation of boschloo's test: very slow execution** - Next by thread:
**Re: implementation of boschloo's test: very slow execution** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |