Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Stata analog to Mata's -strdup()- or better approach?


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date   Mon, 14 Mar 2011 00:40:00 +0000

You refer to a temporary variable -currelig-. Where do you define it?

Nick

On Sun, Mar 13, 2011 at 7:45 PM, Rebecca Pope <[email protected]> wrote:
> A quick question about optimizing processing speed for this routine:
> Should the speed slow considerably with temporary variables? Because
> it is my habit to have temporary variables when I do not intend to
> keep them, I changed Robert's code to use -tempvar- instead of
> creating the "isX" and "currspan" variables and them dropping them.
> The processing time increased from 21 to 72 seconds. Note: "maxspan"
> renamed "contelig" in my code to be consistent with the rest of my
> program.
>
> *** Robert's Original Code ***
> timer on 5
> gen maxspan = 0
> gen currspan = 0
> gen isX = 0
> qui forvalue i = 1/`len' {
>       replace isX = substr(estring,`i',1) == "X"
>       replace currspan = currspan + 1 if isX
>       replace maxspan = currspan if !isX & ///
>               currspan > maxspan
>       replace currspan = 0 if !isX
> }
> replace maxspan = currspan if currspan > maxspan
> drop currspan isX
> timer off 5
>
> *** My modified code ***
> gen int contelig = 0
> label var contelig "Longest Period of Continuous Enrollment"
>        note contelig: Number of months in longest set of Xs from 'estring'
>
> tempvar isX currelig n_longest
> timer on 1
> gen int `currspan' = 0
> gen byte `isX' = 0
>
> qui forvalues i = 1/`len' {
>       replace `isX' = substr(estring,`i',1) == "X"
>       replace `currspan' = `currspan' + 1 if `isX'
>       replace contelig = `currspan' if !`isX' & ///
>               `currspan' > contelig
>       replace `currspan' = 0 if !`isX'
> }
> replace contelig = `currelig' if `currelig' > contelig
> timer off 1
>
> *-------end of code snippets
>
>  timer list
>   1:     71.94 /        1 =      71.9390
>   5:     21.01 /        1 =      21.0060
>
> Best,
> Rebecca
>
> On Sun, Mar 13, 2011 at 10:40 AM, Rebecca Pope <[email protected]> wrote:
>> On Sun, Mar 13, 2011 at 4:33 AM, Nick Cox <[email protected]> wrote:
>>
>>> 3. My original code could be speeded up a bit by not using a variable
>>> X but my guess would be that Robert's is still definitely faster.
>>>
>> I should have specified that I altered your code to use the macro you
>> posted later for the time listed in my previous post. That one change
>> makes a substantial difference in the speed--just less than half the
>> time it takes to run with the variable. Even better, it means that I
>> don't need to drop the other variables in my dataset to complete the
>> search over all 6 million observations. If you count time to merge the
>> findings back in the difference is even greater.
>>
>> On Sun, Mar 13, 2011 at 7:25 AM, Robert Picard <[email protected]> wrote:
>>> Turns out that finding the longest span can be done faster without
>>> string manipulations. Here's a new version:
>>>
>>> * -------------------------- begin example ----------------
>>>
>>> clear all
>>> input patid str12 estring
>>> 1          XXXXX-------
>>> 2          --XXX---XXXX
>>> 3          -XXXXXX-----
>>> 4          -XXX-XXX-XXX
>>> 5          XXXX-XX-XXXX
>>> 6          X-XX-XX-XXXX
>>> 7          X-XXXXX-XXXX
>>> 8          X-XXX---XXX-
>>> 9          XXXXXXXXXXXX
>>> 10         ------------
>>> end
>>>
>>> local len = 12
>>>
>>> * Find the longest period of continuous eligibility.
>>> gen maxspan = 0
>>> gen currspan = 0
>>> gen isX = 0
>>> qui forvalue i = 1/`len' {
>>>        replace isX = substr(estring,`i',1) == "X"
>>>        replace currspan = currspan + 1 if isX
>>>        replace maxspan = currspan if !isX & ///
>>>                currspan > maxspan
>>>        replace currspan = 0 if !isX
>>> }
>>> replace maxspan = currspan if currspan > maxspan
>>> drop currspan isX
>>>
>>> * Identify the start of each span
>>> gen spanX = substr("`: di _dup(`len') "X"'",1,maxspan)
>>> gen blanks = subinstr(spanX,"X"," ",.)
>>> gen es = estring
>>> local i 0
>>> local more 1
>>> qui while `more' {
>>>        local i = `i' + 1
>>>        gen where`i' = strpos(es,spanX)
>>>        replace where`i' = . if where`i' == 0
>>>        replace es = subinstr(es,spanX,blanks,1)
>>>        count if where`i' != .
>>>        local more = r(N)
>>> }
>>> drop where`i'
>>> replace where1 = . if maxspan == 0
>>> egen nmaxspan = rownonmiss(where*)
>>> drop es blanks spanX
>>>
>>> * -------------------------- end example ------------------
>>
>> Yup. It reduces total run time by about 3.5 seconds in the 10% sample.
>>
>> Splitting the code into two functions, (1) finding the longest span of
>> continuous eligibility and (2) determining where those spans occur
>> within the 15-year period covered by the data, I get the best
>> performance by using Robert's method for (1) and Nick's method for
>> (2). The whole process takes just less than 29 seconds.
>>
>> Thanks again very much to both of you. I'd still be muddling through
>> with trial and error without you. I've also learned a lot by looking
>> at your code. I really appreciate all the help.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index