[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: implementation of boschloo's test: very slow execution

From	"Joseph Coveney" <[email protected]>
To	"Statalist" <[email protected]>
Subject	Re: implementation of boschloo's test: very slow execution
Date	Fri, 29 Feb 2008 22:17:31 +0900
Eva Poen wrote (excerpted):

I am currently trying to implement the unconditional test for
comparing two binomial proportions by Boschloo. The test is described
e.g. here: http://www.west.asu.edu/rlberge1/papers/comp2x2.pdf (page
3).

The problem with my implementation is that it takes a long time to
execute: for n1=148 and n2=132, it takes roughly 30 minutes on a
reasonably modern machine. I am using Stata 9.2 for this. I was
wondering if anyone could suggest improvements to make it faster.

. . .

I can replicate an example I found in a Biometrics(2003) paper by
Mehrotra et al., so I believe that the calculations are correct. Is
there a way to speed things up inside the nested loop? I'd be greatful
for any hints, as I have to run quite a lot of these tests; time does
matter here.

[excerpted from a follow-up the same day (last Friday)]

It seems to be the case that the most time-consuming thing inside the
nested loop is this product of binomial probability mass functions. Is
there a way outside Mata to speed this bit up?

--------------------------------------------------------------------------------

The approach below stays within Stata 9.2 and speeds things up without going
to Mata.  With it, execution time for Eva's 30-minute example is down to
about 40 seconds on my machine.

The speed-up results from re-formulating the problem:  rather than having
10000-observation dataset in memory and cumulating PH0sum 15000 times over
it, the approach below puts the roster of Fisher-test-qualified two-by-two
tables in memory and then uses the Newton-Raphson method to locate the
function's maximum with arbitrarily high precision.  This requires only a
relative handful of evaluations of the probability mass functions down a
dataset.

Michael Blasnik observed that the p-value-versus-theta function should be
smooth.  It is, and the approach below takes advantage of that.  Michael
also suggested that Eva avoid -tabi- by using -tabulate- and -post-, and the
approach below exploits that tactic, too.  (It turns out that with Eva's
approach avoiding -tabi- didn't seem to make a noticeable difference--almost
all of the time was, as Eva mentioned, spent evaluating PH0sum.  In the
approach below, however, the bulk of the shortened execution time is spent
conducting Fisher-Irwin tests.  There could very well be a shortcut around
exhaustively evaluating the set of less extreme tables.  That would shorten
execution time even further.)

Eva is using Stata 9.2, and so doesn't have the option of using Stata 10's
new Mata optimizing functions.  In light of that, the approach below makes
unconventional use of -ml- as the maximizer.  It returns Boschloo's p-value
in the ereturn scalar for the log-likelihood, e(ll).  (Maybe that's only
fitting, given that Boschloo uses the Fisher-Irwin test's p-value as a test
statistic.)  I resorted to using -ml-, because -amoeba- and -quasi-, two
user-written function-maximizing commands, couldn't reliably find the
maximum.  Even -ml- can experience trouble avoiding converging to local
maxima with fewer than 50, or so, repeated searches in -ml search-
beforehand.  In order to reduce the risk of -ml- settling onto a local
maximum, the default for -ml search-'s repeat option is set to 100. The user
has a choice of using -ml search- or -ml plot- to find a starting value for
theta.  -ml search- with its repeat of 100 takes longer than -ml plot- (44
versus 33 seconds with Eva's example on my desktop PC) despite the latter's
loading classes for the graph.  Experience might warrant lowering or raising
the hard-coded default of 100--a repeat of 50 worked for Eva's example, but
for insurance I've doubled that for the default.  A repeat of 10 doesn't
work reliably:  it resulted in -ml- converging on a local maximum the first
time (start-up random number seed) and on the global maximum in a second run
with the next-in-line random number seed.  The -ml search- bounds and -ml
plot- bounds are both set for a theta from 0.01 to 0.5 by default, although
these are both options for the user.  (At least for Eva's example, a plot
throughout the entire range of theta is symmetric about theta = 0.5, so that
the supremum is present twice in the set.  I don't know whether that's true
in general, and so the bounds are left as options.  Lower bound, upper
bound, or both may be set for both -ml search- and -ml plot- options.)

There are two ado-files below, -boschloo.ado- and -boschloo_ll.ado-.  They
must be saved separately where Stata can find them (-help adopath-); the
former calls the latter.  There is also an example do-file that runs Eva's
example* with -ml search- (default) and -ml plot- options, as well as
illustrates user-set bounds.  -if-, -in-, and -nolog- options are available,
but aren't illustrated.  None of the -ml- suite's other conventional options
are available.

I'm not looking forward to finding out what StataCorp thinks of -ml- being
abused as a general function maximizer in this fashion, and I've not done
any certification of the code below with a variety of test cases, so it
won't be going to SSC.

Joseph Coveney

* Mehrotra, D. V., Chan, I. S. F. and Berger, R. L. (2003). A cautionary
note on exact unconditional inference for a difference between two
independent binomial proportions. _Biometrics_ 59:441--50.
[Available from www.west.asu.edu/rlberge1/papers/biometrics03.pdf ]

For some practical perspective, see also
www.iancampbell.co.uk/twobytwo/twobytwo.htm and the preprint downloadable
from there.

------------------------------boschloo.ado--------------------------------------
program define boschloo, rclass
   version 9.2
   syntax varlist(min = 2 max = 2) [if] [in] [fweight] ///
     [, Ubound(real 0.01) Lbound(real 0.5) MLPlot noLOG]
   tempfile tmpfil0
   tempname memhold observed_fisher
   tempvar freq
*
   preserve
   marksample touse
   quietly keep if `touse'
*
   foreach var of local varlist {
       quietly inspect `var'
       if (r(N_unique) != 2) {
           display in smcl as error "Number of groups must equal two"
           exit = 99
       }
   }
   if ("`weight'" != "") local weight [`weight' `exp']
   contract `varlist' `weight', freq(`freq') zero nomiss
   quietly tabulate `varlist' [fweight = `freq'], exact
   scalar define `observed_fisher' = r(p_exact)
*
   local groups : word 1 of `varlist'
   quietly levelsof `groups', local(rows)
   local i 1
   foreach row of local rows {
       summarize `groups' if `groups' == `row' ///
         [fweight = `freq'], meanonly
       local count`i++' = r(N)
   }
*
   postfile `memhold' long (counti countj) using `tmpfil0'
   quietly forvalues counti = 0/`count1' {
       replace `freq' = `counti' in 1
       replace `freq' = `count1' - `counti' in 2
       forvalues countj = 0/`count2' {
           replace `freq' = `countj' in 3
           replace `freq' = `count2' - `countj' in 4
           tabulate `varlist' [fweight = `freq'], exact
           if (r(p_exact) <= `observed_fisher') post ///
             `memhold' (`counti') (`countj')
       }
   }
   postclose `memhold'
   quietly use `tmpfil0', clear
   assert !missing(counti)  // Remove after gaining experience
   assert !missing(countj)
   generate long counti1 = counti + 1
   generate long countj1 = countj + 1
   char define counti[total] `count1'
   char define countj[total] `count2'
*
   if ("`mlplot'" != "") {
       ml model lf boschloo_ll (theta:), missing
       ml plot /theta `lbound' `ubound'
       ml maximize , nooutput `log'
   }
   else ml model lf boschloo_ll (theta:), maximize missing ///
     search(quietly) repeat(100) bounds(`lbound' `ubound') `log'
   display in smcl as text _newline(1) "Boschoo's p: " ///
     as result %06.4f e(ll)
   return scalar p_boschloo = e(ll)
   ereturn clear
   restore
end
--------------------------------------------------------------------------------
-----------------------------boschloo_ll.ado------------------------------------
program define boschloo_ll
   args lf theta
   local totali : char counti[total]
   local totalj : char countj[total]

   quietly replace `lf' = (Binomial(`totali', counti, `theta') - ///
     Binomial(`totali', counti1, `theta')) * ///
     (Binomial(`totalj', countj, `theta') - ///
     Binomial(`totalj', countj1, `theta'))
end
--------------------------------------------------------------------------------
--------------------------------example.do--------------------------------------
clear
set more off
input byte group byte count1 int total
1 8 148
2 1 132
end
generate int count0 = total - count1
drop total
quietly reshape long count, i(group) j(rash)
*
boschloo group rash [fweight = count]
boschloo group rash [fweight = count], mlplot lbound(0.1) ubound(0.9)
return list
exit
--------------------------------------------------------------------------------


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: Re: st: Some annoying problems with my stata 10
Next by Date: st: -bs- with the xi: prefix and -gologit2-
Previous by thread: Re: implementation of boschloo's test: very slow execution
Next by thread: Re: implementation of boschloo's test: very slow execution
Index(es):
- Date
- Thread