Re: st: RE: Stata analog to Mata's -strdup()- or better approach?

From   David Elliott <[email protected]>
To   [email protected]
Subject   Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date   Mon, 14 Mar 2011 08:03:02 -0300

Did anyone think of splitting on the -- and changing the XX... to
11...  One would not be counting Xs then since 11... would be a number
that one could work on directly to determine which was the largest.

I provide no code and am just suggesting this as a conceptual approach.

DC Elliott

On 14 March 2011 05:06, Robert Picard <[email protected]> wrote:
> I'm traveling so I don't have time to look into this right now but I
> suspect that the timing differences are due to the use of a more
> compact data type, in particular for your temporary `isX'. In 32 and
> 64 bit computers, fetching a byte requires more work than a real or a
> long.
> On Mon, Mar 14, 2011 at 2:30 AM, Rebecca Pope <[email protected]> wrote:
>> I should add that since I posted the question about -tempvar- earlier,
>> I've stepped through each piece of the code, isolating the changes.
>> Here is an inventory of all the differences in code that I could find:
>> 1. I use a different variable name for the permanent variable
>> ("contelig" instead of "maxspan")
>> 2. "contelig" had a variable label and notes attached to it while
>> "maxspan" did not
>> 3. I define the variable type when I create them while Robert's code
>> uses the system default
>> 4. I use temporary variables
>> I closed down Stata, reopened it, and copied Robert's original code
>> into a new do-file editor. I changed each piece at a time, starting by
>> using find & replace to change "maxspan" to "contelig" (paranoid, I
>> know). Then I ran the code w/ timer again. No change in time... I went
>> down the list above & lost a few tenths of a second when I added the
>> label & notes. No change for (3). And then a big hit on (4). The
>> difference was not quite as extreme as what I posted earlier, but
>> still there.
>> Thanks,
>> Rebecca
>>          __o                __o
>>       _`\ <,_            _`\ <,_
>>      (_)/   (_)          (_)/   (_)
>> =========================
>> On Sun, Mar 13, 2011 at 8:15 PM, Rebecca Pope <[email protected]> wrote:
>>> Sorry. I renamed Robert's "currspan" to "currelig" (using find &
>>> replace) just to have the terminology consistent. When I copied my
>>> code over here, did another F&R, so that it would be consistent with
>>> Robert's just above, but I apparently didn't highlight far enough
>>> down.
>>> Rebecca
>>>          __o                __o
>>>       _`\ <,_            _`\ <,_
>>>      (_)/   (_)          (_)/   (_)
>>> =========================
>>> On Sun, Mar 13, 2011 at 7:40 PM, Nick Cox <[email protected]> wrote:
>>>> You refer to a temporary variable -currelig-. Where do you define it?
>>>> Nick
>>>> On Sun, Mar 13, 2011 at 7:45 PM, Rebecca Pope <[email protected]> wrote:
>>>>> A quick question about optimizing processing speed for this routine:
>>>>> Should the speed slow considerably with temporary variables? Because
>>>>> it is my habit to have temporary variables when I do not intend to
>>>>> keep them, I changed Robert's code to use -tempvar- instead of
>>>>> creating the "isX" and "currspan" variables and them dropping them.
>>>>> The processing time increased from 21 to 72 seconds. Note: "maxspan"
>>>>> renamed "contelig" in my code to be consistent with the rest of my
>>>>> program.
>>>>> *** Robert's Original Code ***
>>>>> timer on 5
>>>>> gen maxspan = 0
>>>>> gen currspan = 0
>>>>> gen isX = 0
>>>>> qui forvalue i = 1/`len' {
>>>>>       replace isX = substr(estring,`i',1) == "X"
>>>>>       replace currspan = currspan + 1 if isX
>>>>>       replace maxspan = currspan if !isX & ///
>>>>>               currspan > maxspan
>>>>>       replace currspan = 0 if !isX
>>>>> }
>>>>> replace maxspan = currspan if currspan > maxspan
>>>>> drop currspan isX
>>>>> timer off 5
>>>>> *** My modified code ***
>>>>> gen int contelig = 0
>>>>> label var contelig "Longest Period of Continuous Enrollment"
>>>>>        note contelig: Number of months in longest set of Xs from 'estring'
>>>>> tempvar isX currelig n_longest
>>>>> timer on 1
>>>>> gen int `currspan' = 0
>>>>> gen byte `isX' = 0
>>>>> qui forvalues i = 1/`len' {
>>>>>       replace `isX' = substr(estring,`i',1) == "X"
>>>>>       replace `currspan' = `currspan' + 1 if `isX'
>>>>>       replace contelig = `currspan' if !`isX' & ///
>>>>>               `currspan' > contelig
>>>>>       replace `currspan' = 0 if !`isX'
>>>>> }
>>>>> replace contelig = `currelig' if `currelig' > contelig
>>>>> timer off 1
>>>>> *-------end of code snippets
>>>>>  timer list
>>>>>   1:     71.94 /        1 =      71.9390
>>>>>   5:     21.01 /        1 =      21.0060
>>>>> Best,
>>>>> Rebecca
>>>>> On Sun, Mar 13, 2011 at 10:40 AM, Rebecca Pope <[email protected]> wrote:
>>>>>> On Sun, Mar 13, 2011 at 4:33 AM, Nick Cox <[email protected]> wrote:
>>>>>>> 3. My original code could be speeded up a bit by not using a variable
>>>>>>> X but my guess would be that Robert's is still definitely faster.
>>>>>> I should have specified that I altered your code to use the macro you
>>>>>> posted later for the time listed in my previous post. That one change
>>>>>> makes a substantial difference in the speed--just less than half the
>>>>>> time it takes to run with the variable. Even better, it means that I
>>>>>> don't need to drop the other variables in my dataset to complete the
>>>>>> search over all 6 million observations. If you count time to merge the
>>>>>> findings back in the difference is even greater.
>>>>>> On Sun, Mar 13, 2011 at 7:25 AM, Robert Picard <[email protected]> wrote:
>>>>>>> Turns out that finding the longest span can be done faster without
>>>>>>> string manipulations. Here's a new version:
>>>>>>> * -------------------------- begin example ----------------
>>>>>>> clear all
>>>>>>> input patid str12 estring
>>>>>>> 1          XXXXX-------
>>>>>>> 2          --XXX---XXXX
>>>>>>> 3          -XXXXXX-----
>>>>>>> 4          -XXX-XXX-XXX
>>>>>>> 5          XXXX-XX-XXXX
>>>>>>> 6          X-XX-XX-XXXX
>>>>>>> 7          X-XXXXX-XXXX
>>>>>>> 8          X-XXX---XXX-
>>>>>>> 9          XXXXXXXXXXXX
>>>>>>> 10         ------------
>>>>>>> end
>>>>>>> local len = 12
>>>>>>> * Find the longest period of continuous eligibility.
>>>>>>> gen maxspan = 0
>>>>>>> gen currspan = 0
>>>>>>> gen isX = 0
>>>>>>> qui forvalue i = 1/`len' {
>>>>>>>        replace isX = substr(estring,`i',1) == "X"
>>>>>>>        replace currspan = currspan + 1 if isX
>>>>>>>        replace maxspan = currspan if !isX & ///
>>>>>>>                currspan > maxspan
>>>>>>>        replace currspan = 0 if !isX
>>>>>>> }
>>>>>>> replace maxspan = currspan if currspan > maxspan
>>>>>>> drop currspan isX
>>>>>>> * Identify the start of each span
>>>>>>> gen spanX = substr("`: di _dup(`len') "X"'",1,maxspan)
>>>>>>> gen blanks = subinstr(spanX,"X"," ",.)
>>>>>>> gen es = estring
>>>>>>> local i 0
>>>>>>> local more 1
>>>>>>> qui while `more' {
>>>>>>>        local i = `i' + 1
>>>>>>>        gen where`i' = strpos(es,spanX)
>>>>>>>        replace where`i' = . if where`i' == 0
>>>>>>>        replace es = subinstr(es,spanX,blanks,1)
>>>>>>>        count if where`i' != .
>>>>>>>        local more = r(N)
>>>>>>> }
>>>>>>> drop where`i'
>>>>>>> replace where1 = . if maxspan == 0
>>>>>>> egen nmaxspan = rownonmiss(where*)
>>>>>>> drop es blanks spanX
>>>>>>> * -------------------------- end example ------------------
>>>>>> Yup. It reduces total run time by about 3.5 seconds in the 10% sample.
>>>>>> Splitting the code into two functions, (1) finding the longest span of
>>>>>> continuous eligibility and (2) determining where those spans occur
>>>>>> within the 15-year period covered by the data, I get the best
>>>>>> performance by using Robert's method for (1) and Nick's method for
>>>>>> (2). The whole process takes just less than 29 seconds.
>>>>>> Thanks again very much to both of you. I'd still be muddling through
>>>>>> with trial and error without you. I've also learned a lot by looking
>>>>>> at your code. I really appreciate all the help.

