Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Stata analog to Mata's -strdup()- or better approach?


From   Rebecca Pope <[email protected]>
To   [email protected]
Subject   Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date   Sun, 13 Mar 2011 20:30:34 -0500

I should add that since I posted the question about -tempvar- earlier,
I've stepped through each piece of the code, isolating the changes.

Here is an inventory of all the differences in code that I could find:
1. I use a different variable name for the permanent variable
("contelig" instead of "maxspan")
2. "contelig" had a variable label and notes attached to it while
"maxspan" did not
3. I define the variable type when I create them while Robert's code
uses the system default
4. I use temporary variables

I closed down Stata, reopened it, and copied Robert's original code
into a new do-file editor. I changed each piece at a time, starting by
using find & replace to change "maxspan" to "contelig" (paranoid, I
know). Then I ran the code w/ timer again. No change in time... I went
down the list above & lost a few tenths of a second when I added the
label & notes. No change for (3). And then a big hit on (4). The
difference was not quite as extreme as what I posted earlier, but
still there.

Thanks,
Rebecca



         __o                __o
      _`\ <,_            _`\ <,_
     (_)/   (_)          (_)/   (_)
=========================



On Sun, Mar 13, 2011 at 8:15 PM, Rebecca Pope <[email protected]> wrote:
> Sorry. I renamed Robert's "currspan" to "currelig" (using find &
> replace) just to have the terminology consistent. When I copied my
> code over here, did another F&R, so that it would be consistent with
> Robert's just above, but I apparently didn't highlight far enough
> down.
>
> Rebecca
>
>
>
>          __o                __o
>       _`\ <,_            _`\ <,_
>      (_)/   (_)          (_)/   (_)
> =========================
>
>
>
> On Sun, Mar 13, 2011 at 7:40 PM, Nick Cox <[email protected]> wrote:
>> You refer to a temporary variable -currelig-. Where do you define it?
>>
>> Nick
>>
>> On Sun, Mar 13, 2011 at 7:45 PM, Rebecca Pope <[email protected]> wrote:
>>> A quick question about optimizing processing speed for this routine:
>>> Should the speed slow considerably with temporary variables? Because
>>> it is my habit to have temporary variables when I do not intend to
>>> keep them, I changed Robert's code to use -tempvar- instead of
>>> creating the "isX" and "currspan" variables and them dropping them.
>>> The processing time increased from 21 to 72 seconds. Note: "maxspan"
>>> renamed "contelig" in my code to be consistent with the rest of my
>>> program.
>>>
>>> *** Robert's Original Code ***
>>> timer on 5
>>> gen maxspan = 0
>>> gen currspan = 0
>>> gen isX = 0
>>> qui forvalue i = 1/`len' {
>>>       replace isX = substr(estring,`i',1) == "X"
>>>       replace currspan = currspan + 1 if isX
>>>       replace maxspan = currspan if !isX & ///
>>>               currspan > maxspan
>>>       replace currspan = 0 if !isX
>>> }
>>> replace maxspan = currspan if currspan > maxspan
>>> drop currspan isX
>>> timer off 5
>>>
>>> *** My modified code ***
>>> gen int contelig = 0
>>> label var contelig "Longest Period of Continuous Enrollment"
>>>        note contelig: Number of months in longest set of Xs from 'estring'
>>>
>>> tempvar isX currelig n_longest
>>> timer on 1
>>> gen int `currspan' = 0
>>> gen byte `isX' = 0
>>>
>>> qui forvalues i = 1/`len' {
>>>       replace `isX' = substr(estring,`i',1) == "X"
>>>       replace `currspan' = `currspan' + 1 if `isX'
>>>       replace contelig = `currspan' if !`isX' & ///
>>>               `currspan' > contelig
>>>       replace `currspan' = 0 if !`isX'
>>> }
>>> replace contelig = `currelig' if `currelig' > contelig
>>> timer off 1
>>>
>>> *-------end of code snippets
>>>
>>>  timer list
>>>   1:     71.94 /        1 =      71.9390
>>>   5:     21.01 /        1 =      21.0060
>>>
>>> Best,
>>> Rebecca
>>>
>>> On Sun, Mar 13, 2011 at 10:40 AM, Rebecca Pope <[email protected]> wrote:
>>>> On Sun, Mar 13, 2011 at 4:33 AM, Nick Cox <[email protected]> wrote:
>>>>
>>>>> 3. My original code could be speeded up a bit by not using a variable
>>>>> X but my guess would be that Robert's is still definitely faster.
>>>>>
>>>> I should have specified that I altered your code to use the macro you
>>>> posted later for the time listed in my previous post. That one change
>>>> makes a substantial difference in the speed--just less than half the
>>>> time it takes to run with the variable. Even better, it means that I
>>>> don't need to drop the other variables in my dataset to complete the
>>>> search over all 6 million observations. If you count time to merge the
>>>> findings back in the difference is even greater.
>>>>
>>>> On Sun, Mar 13, 2011 at 7:25 AM, Robert Picard <[email protected]> wrote:
>>>>> Turns out that finding the longest span can be done faster without
>>>>> string manipulations. Here's a new version:
>>>>>
>>>>> * -------------------------- begin example ----------------
>>>>>
>>>>> clear all
>>>>> input patid str12 estring
>>>>> 1          XXXXX-------
>>>>> 2          --XXX---XXXX
>>>>> 3          -XXXXXX-----
>>>>> 4          -XXX-XXX-XXX
>>>>> 5          XXXX-XX-XXXX
>>>>> 6          X-XX-XX-XXXX
>>>>> 7          X-XXXXX-XXXX
>>>>> 8          X-XXX---XXX-
>>>>> 9          XXXXXXXXXXXX
>>>>> 10         ------------
>>>>> end
>>>>>
>>>>> local len = 12
>>>>>
>>>>> * Find the longest period of continuous eligibility.
>>>>> gen maxspan = 0
>>>>> gen currspan = 0
>>>>> gen isX = 0
>>>>> qui forvalue i = 1/`len' {
>>>>>        replace isX = substr(estring,`i',1) == "X"
>>>>>        replace currspan = currspan + 1 if isX
>>>>>        replace maxspan = currspan if !isX & ///
>>>>>                currspan > maxspan
>>>>>        replace currspan = 0 if !isX
>>>>> }
>>>>> replace maxspan = currspan if currspan > maxspan
>>>>> drop currspan isX
>>>>>
>>>>> * Identify the start of each span
>>>>> gen spanX = substr("`: di _dup(`len') "X"'",1,maxspan)
>>>>> gen blanks = subinstr(spanX,"X"," ",.)
>>>>> gen es = estring
>>>>> local i 0
>>>>> local more 1
>>>>> qui while `more' {
>>>>>        local i = `i' + 1
>>>>>        gen where`i' = strpos(es,spanX)
>>>>>        replace where`i' = . if where`i' == 0
>>>>>        replace es = subinstr(es,spanX,blanks,1)
>>>>>        count if where`i' != .
>>>>>        local more = r(N)
>>>>> }
>>>>> drop where`i'
>>>>> replace where1 = . if maxspan == 0
>>>>> egen nmaxspan = rownonmiss(where*)
>>>>> drop es blanks spanX
>>>>>
>>>>> * -------------------------- end example ------------------
>>>>
>>>> Yup. It reduces total run time by about 3.5 seconds in the 10% sample.
>>>>
>>>> Splitting the code into two functions, (1) finding the longest span of
>>>> continuous eligibility and (2) determining where those spans occur
>>>> within the 15-year period covered by the data, I get the best
>>>> performance by using Robert's method for (1) and Nick's method for
>>>> (2). The whole process takes just less than 29 seconds.
>>>>
>>>> Thanks again very much to both of you. I'd still be muddling through
>>>> with trial and error without you. I've also learned a lot by looking
>>>> at your code. I really appreciate all the help.
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index