Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Stata analog to Mata's -strdup()- or better approach?


From   Rebecca Pope <[email protected]>
To   [email protected]
Subject   Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date   Sun, 13 Mar 2011 20:15:53 -0500

Sorry. I renamed Robert's "currspan" to "currelig" (using find &
replace) just to have the terminology consistent. When I copied my
code over here, did another F&R, so that it would be consistent with
Robert's just above, but I apparently didn't highlight far enough
down.

Rebecca



         __o                __o
      _`\ <,_            _`\ <,_
     (_)/   (_)          (_)/   (_)
=========================



On Sun, Mar 13, 2011 at 7:40 PM, Nick Cox <[email protected]> wrote:
> You refer to a temporary variable -currelig-. Where do you define it?
>
> Nick
>
> On Sun, Mar 13, 2011 at 7:45 PM, Rebecca Pope <[email protected]> wrote:
>> A quick question about optimizing processing speed for this routine:
>> Should the speed slow considerably with temporary variables? Because
>> it is my habit to have temporary variables when I do not intend to
>> keep them, I changed Robert's code to use -tempvar- instead of
>> creating the "isX" and "currspan" variables and them dropping them.
>> The processing time increased from 21 to 72 seconds. Note: "maxspan"
>> renamed "contelig" in my code to be consistent with the rest of my
>> program.
>>
>> *** Robert's Original Code ***
>> timer on 5
>> gen maxspan = 0
>> gen currspan = 0
>> gen isX = 0
>> qui forvalue i = 1/`len' {
>>       replace isX = substr(estring,`i',1) == "X"
>>       replace currspan = currspan + 1 if isX
>>       replace maxspan = currspan if !isX & ///
>>               currspan > maxspan
>>       replace currspan = 0 if !isX
>> }
>> replace maxspan = currspan if currspan > maxspan
>> drop currspan isX
>> timer off 5
>>
>> *** My modified code ***
>> gen int contelig = 0
>> label var contelig "Longest Period of Continuous Enrollment"
>>        note contelig: Number of months in longest set of Xs from 'estring'
>>
>> tempvar isX currelig n_longest
>> timer on 1
>> gen int `currspan' = 0
>> gen byte `isX' = 0
>>
>> qui forvalues i = 1/`len' {
>>       replace `isX' = substr(estring,`i',1) == "X"
>>       replace `currspan' = `currspan' + 1 if `isX'
>>       replace contelig = `currspan' if !`isX' & ///
>>               `currspan' > contelig
>>       replace `currspan' = 0 if !`isX'
>> }
>> replace contelig = `currelig' if `currelig' > contelig
>> timer off 1
>>
>> *-------end of code snippets
>>
>>  timer list
>>   1:     71.94 /        1 =      71.9390
>>   5:     21.01 /        1 =      21.0060
>>
>> Best,
>> Rebecca
>>
>> On Sun, Mar 13, 2011 at 10:40 AM, Rebecca Pope <[email protected]> wrote:
>>> On Sun, Mar 13, 2011 at 4:33 AM, Nick Cox <[email protected]> wrote:
>>>
>>>> 3. My original code could be speeded up a bit by not using a variable
>>>> X but my guess would be that Robert's is still definitely faster.
>>>>
>>> I should have specified that I altered your code to use the macro you
>>> posted later for the time listed in my previous post. That one change
>>> makes a substantial difference in the speed--just less than half the
>>> time it takes to run with the variable. Even better, it means that I
>>> don't need to drop the other variables in my dataset to complete the
>>> search over all 6 million observations. If you count time to merge the
>>> findings back in the difference is even greater.
>>>
>>> On Sun, Mar 13, 2011 at 7:25 AM, Robert Picard <[email protected]> wrote:
>>>> Turns out that finding the longest span can be done faster without
>>>> string manipulations. Here's a new version:
>>>>
>>>> * -------------------------- begin example ----------------
>>>>
>>>> clear all
>>>> input patid str12 estring
>>>> 1          XXXXX-------
>>>> 2          --XXX---XXXX
>>>> 3          -XXXXXX-----
>>>> 4          -XXX-XXX-XXX
>>>> 5          XXXX-XX-XXXX
>>>> 6          X-XX-XX-XXXX
>>>> 7          X-XXXXX-XXXX
>>>> 8          X-XXX---XXX-
>>>> 9          XXXXXXXXXXXX
>>>> 10         ------------
>>>> end
>>>>
>>>> local len = 12
>>>>
>>>> * Find the longest period of continuous eligibility.
>>>> gen maxspan = 0
>>>> gen currspan = 0
>>>> gen isX = 0
>>>> qui forvalue i = 1/`len' {
>>>>        replace isX = substr(estring,`i',1) == "X"
>>>>        replace currspan = currspan + 1 if isX
>>>>        replace maxspan = currspan if !isX & ///
>>>>                currspan > maxspan
>>>>        replace currspan = 0 if !isX
>>>> }
>>>> replace maxspan = currspan if currspan > maxspan
>>>> drop currspan isX
>>>>
>>>> * Identify the start of each span
>>>> gen spanX = substr("`: di _dup(`len') "X"'",1,maxspan)
>>>> gen blanks = subinstr(spanX,"X"," ",.)
>>>> gen es = estring
>>>> local i 0
>>>> local more 1
>>>> qui while `more' {
>>>>        local i = `i' + 1
>>>>        gen where`i' = strpos(es,spanX)
>>>>        replace where`i' = . if where`i' == 0
>>>>        replace es = subinstr(es,spanX,blanks,1)
>>>>        count if where`i' != .
>>>>        local more = r(N)
>>>> }
>>>> drop where`i'
>>>> replace where1 = . if maxspan == 0
>>>> egen nmaxspan = rownonmiss(where*)
>>>> drop es blanks spanX
>>>>
>>>> * -------------------------- end example ------------------
>>>
>>> Yup. It reduces total run time by about 3.5 seconds in the 10% sample.
>>>
>>> Splitting the code into two functions, (1) finding the longest span of
>>> continuous eligibility and (2) determining where those spans occur
>>> within the 15-year period covered by the data, I get the best
>>> performance by using Robert's method for (1) and Nick's method for
>>> (2). The whole process takes just less than 29 seconds.
>>>
>>> Thanks again very much to both of you. I'd still be muddling through
>>> with trial and error without you. I've also learned a lot by looking
>>> at your code. I really appreciate all the help.
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index