Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Rebecca Pope <rebecca.a.pope@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: RE: Stata analog to Mata's -strdup()- or better approach? |
Date | Sun, 13 Mar 2011 14:45:39 -0500 |
A quick question about optimizing processing speed for this routine: Should the speed slow considerably with temporary variables? Because it is my habit to have temporary variables when I do not intend to keep them, I changed Robert's code to use -tempvar- instead of creating the "isX" and "currspan" variables and them dropping them. The processing time increased from 21 to 72 seconds. Note: "maxspan" renamed "contelig" in my code to be consistent with the rest of my program. *** Robert's Original Code *** timer on 5 gen maxspan = 0 gen currspan = 0 gen isX = 0 qui forvalue i = 1/`len' { replace isX = substr(estring,`i',1) == "X" replace currspan = currspan + 1 if isX replace maxspan = currspan if !isX & /// currspan > maxspan replace currspan = 0 if !isX } replace maxspan = currspan if currspan > maxspan drop currspan isX timer off 5 *** My modified code *** gen int contelig = 0 label var contelig "Longest Period of Continuous Enrollment" note contelig: Number of months in longest set of Xs from 'estring' tempvar isX currelig n_longest timer on 1 gen int `currspan' = 0 gen byte `isX' = 0 qui forvalues i = 1/`len' { replace `isX' = substr(estring,`i',1) == "X" replace `currspan' = `currspan' + 1 if `isX' replace contelig = `currspan' if !`isX' & /// `currspan' > contelig replace `currspan' = 0 if !`isX' } replace contelig = `currelig' if `currelig' > contelig timer off 1 *-------end of code snippets timer list 1: 71.94 / 1 = 71.9390 5: 21.01 / 1 = 21.0060 Best, Rebecca On Sun, Mar 13, 2011 at 10:40 AM, Rebecca Pope <rebecca.a.pope@gmail.com> wrote: > On Sun, Mar 13, 2011 at 4:33 AM, Nick Cox <njcoxstata@gmail.com> wrote: > >> 3. My original code could be speeded up a bit by not using a variable >> X but my guess would be that Robert's is still definitely faster. >> > I should have specified that I altered your code to use the macro you > posted later for the time listed in my previous post. That one change > makes a substantial difference in the speed--just less than half the > time it takes to run with the variable. Even better, it means that I > don't need to drop the other variables in my dataset to complete the > search over all 6 million observations. If you count time to merge the > findings back in the difference is even greater. > > On Sun, Mar 13, 2011 at 7:25 AM, Robert Picard <picard@netbox.com> wrote: >> Turns out that finding the longest span can be done faster without >> string manipulations. Here's a new version: >> >> * -------------------------- begin example ---------------- >> >> clear all >> input patid str12 estring >> 1 XXXXX------- >> 2 --XXX---XXXX >> 3 -XXXXXX----- >> 4 -XXX-XXX-XXX >> 5 XXXX-XX-XXXX >> 6 X-XX-XX-XXXX >> 7 X-XXXXX-XXXX >> 8 X-XXX---XXX- >> 9 XXXXXXXXXXXX >> 10 ------------ >> end >> >> local len = 12 >> >> * Find the longest period of continuous eligibility. >> gen maxspan = 0 >> gen currspan = 0 >> gen isX = 0 >> qui forvalue i = 1/`len' { >> replace isX = substr(estring,`i',1) == "X" >> replace currspan = currspan + 1 if isX >> replace maxspan = currspan if !isX & /// >> currspan > maxspan >> replace currspan = 0 if !isX >> } >> replace maxspan = currspan if currspan > maxspan >> drop currspan isX >> >> * Identify the start of each span >> gen spanX = substr("`: di _dup(`len') "X"'",1,maxspan) >> gen blanks = subinstr(spanX,"X"," ",.) >> gen es = estring >> local i 0 >> local more 1 >> qui while `more' { >> local i = `i' + 1 >> gen where`i' = strpos(es,spanX) >> replace where`i' = . if where`i' == 0 >> replace es = subinstr(es,spanX,blanks,1) >> count if where`i' != . >> local more = r(N) >> } >> drop where`i' >> replace where1 = . if maxspan == 0 >> egen nmaxspan = rownonmiss(where*) >> drop es blanks spanX >> >> * -------------------------- end example ------------------ > > Yup. It reduces total run time by about 3.5 seconds in the 10% sample. > > Splitting the code into two functions, (1) finding the longest span of > continuous eligibility and (2) determining where those spans occur > within the 15-year period covered by the data, I get the best > performance by using Robert's method for (1) and Nick's method for > (2). The whole process takes just less than 29 seconds. > > Thanks again very much to both of you. I'd still be muddling through > with trial and error without you. I've also learned a lot by looking > at your code. I really appreciate all the help. > > Best, > Rebecca > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/