Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
From
Robert Picard <[email protected]>
To
[email protected]
Subject
Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date
Mon, 14 Mar 2011 14:19:27 +0100
Turns out that the time penalty you noticed is due to Stata's use of
doubles to perform all calculations. The extra execution time is
caused by the conversion of integer datatypes to double. There's even
a cost to convert from float to double. For the fastest code, use
double for all variables.
Robert
On Mon, Mar 14, 2011 at 2:30 AM, Rebecca Pope <[email protected]> wrote:
> I should add that since I posted the question about -tempvar- earlier,
> I've stepped through each piece of the code, isolating the changes.
>
> Here is an inventory of all the differences in code that I could find:
> 1. I use a different variable name for the permanent variable
> ("contelig" instead of "maxspan")
> 2. "contelig" had a variable label and notes attached to it while
> "maxspan" did not
> 3. I define the variable type when I create them while Robert's code
> uses the system default
> 4. I use temporary variables
>
> I closed down Stata, reopened it, and copied Robert's original code
> into a new do-file editor. I changed each piece at a time, starting by
> using find & replace to change "maxspan" to "contelig" (paranoid, I
> know). Then I ran the code w/ timer again. No change in time... I went
> down the list above & lost a few tenths of a second when I added the
> label & notes. No change for (3). And then a big hit on (4). The
> difference was not quite as extreme as what I posted earlier, but
> still there.
>
> Thanks,
> Rebecca
>
>
>
> __o __o
> _`\ <,_ _`\ <,_
> (_)/ (_) (_)/ (_)
> =========================
>
>
>
> On Sun, Mar 13, 2011 at 8:15 PM, Rebecca Pope <[email protected]> wrote:
>> Sorry. I renamed Robert's "currspan" to "currelig" (using find &
>> replace) just to have the terminology consistent. When I copied my
>> code over here, did another F&R, so that it would be consistent with
>> Robert's just above, but I apparently didn't highlight far enough
>> down.
>>
>> Rebecca
>>
>>
>>
>> __o __o
>> _`\ <,_ _`\ <,_
>> (_)/ (_) (_)/ (_)
>> =========================
>>
>>
>>
>> On Sun, Mar 13, 2011 at 7:40 PM, Nick Cox <[email protected]> wrote:
>>> You refer to a temporary variable -currelig-. Where do you define it?
>>>
>>> Nick
>>>
>>> On Sun, Mar 13, 2011 at 7:45 PM, Rebecca Pope <[email protected]> wrote:
>>>> A quick question about optimizing processing speed for this routine:
>>>> Should the speed slow considerably with temporary variables? Because
>>>> it is my habit to have temporary variables when I do not intend to
>>>> keep them, I changed Robert's code to use -tempvar- instead of
>>>> creating the "isX" and "currspan" variables and them dropping them.
>>>> The processing time increased from 21 to 72 seconds. Note: "maxspan"
>>>> renamed "contelig" in my code to be consistent with the rest of my
>>>> program.
>>>>
>>>> *** Robert's Original Code ***
>>>> timer on 5
>>>> gen maxspan = 0
>>>> gen currspan = 0
>>>> gen isX = 0
>>>> qui forvalue i = 1/`len' {
>>>> replace isX = substr(estring,`i',1) == "X"
>>>> replace currspan = currspan + 1 if isX
>>>> replace maxspan = currspan if !isX & ///
>>>> currspan > maxspan
>>>> replace currspan = 0 if !isX
>>>> }
>>>> replace maxspan = currspan if currspan > maxspan
>>>> drop currspan isX
>>>> timer off 5
>>>>
>>>> *** My modified code ***
>>>> gen int contelig = 0
>>>> label var contelig "Longest Period of Continuous Enrollment"
>>>> note contelig: Number of months in longest set of Xs from 'estring'
>>>>
>>>> tempvar isX currelig n_longest
>>>> timer on 1
>>>> gen int `currspan' = 0
>>>> gen byte `isX' = 0
>>>>
>>>> qui forvalues i = 1/`len' {
>>>> replace `isX' = substr(estring,`i',1) == "X"
>>>> replace `currspan' = `currspan' + 1 if `isX'
>>>> replace contelig = `currspan' if !`isX' & ///
>>>> `currspan' > contelig
>>>> replace `currspan' = 0 if !`isX'
>>>> }
>>>> replace contelig = `currelig' if `currelig' > contelig
>>>> timer off 1
>>>>
>>>> *-------end of code snippets
>>>>
>>>> timer list
>>>> 1: 71.94 / 1 = 71.9390
>>>> 5: 21.01 / 1 = 21.0060
>>>>
>>>> Best,
>>>> Rebecca
>>>>
>>>> On Sun, Mar 13, 2011 at 10:40 AM, Rebecca Pope <[email protected]> wrote:
>>>>> On Sun, Mar 13, 2011 at 4:33 AM, Nick Cox <[email protected]> wrote:
>>>>>
>>>>>> 3. My original code could be speeded up a bit by not using a variable
>>>>>> X but my guess would be that Robert's is still definitely faster.
>>>>>>
>>>>> I should have specified that I altered your code to use the macro you
>>>>> posted later for the time listed in my previous post. That one change
>>>>> makes a substantial difference in the speed--just less than half the
>>>>> time it takes to run with the variable. Even better, it means that I
>>>>> don't need to drop the other variables in my dataset to complete the
>>>>> search over all 6 million observations. If you count time to merge the
>>>>> findings back in the difference is even greater.
>>>>>
>>>>> On Sun, Mar 13, 2011 at 7:25 AM, Robert Picard <[email protected]> wrote:
>>>>>> Turns out that finding the longest span can be done faster without
>>>>>> string manipulations. Here's a new version:
>>>>>>
>>>>>> * -------------------------- begin example ----------------
>>>>>>
>>>>>> clear all
>>>>>> input patid str12 estring
>>>>>> 1 XXXXX-------
>>>>>> 2 --XXX---XXXX
>>>>>> 3 -XXXXXX-----
>>>>>> 4 -XXX-XXX-XXX
>>>>>> 5 XXXX-XX-XXXX
>>>>>> 6 X-XX-XX-XXXX
>>>>>> 7 X-XXXXX-XXXX
>>>>>> 8 X-XXX---XXX-
>>>>>> 9 XXXXXXXXXXXX
>>>>>> 10 ------------
>>>>>> end
>>>>>>
>>>>>> local len = 12
>>>>>>
>>>>>> * Find the longest period of continuous eligibility.
>>>>>> gen maxspan = 0
>>>>>> gen currspan = 0
>>>>>> gen isX = 0
>>>>>> qui forvalue i = 1/`len' {
>>>>>> replace isX = substr(estring,`i',1) == "X"
>>>>>> replace currspan = currspan + 1 if isX
>>>>>> replace maxspan = currspan if !isX & ///
>>>>>> currspan > maxspan
>>>>>> replace currspan = 0 if !isX
>>>>>> }
>>>>>> replace maxspan = currspan if currspan > maxspan
>>>>>> drop currspan isX
>>>>>>
>>>>>> * Identify the start of each span
>>>>>> gen spanX = substr("`: di _dup(`len') "X"'",1,maxspan)
>>>>>> gen blanks = subinstr(spanX,"X"," ",.)
>>>>>> gen es = estring
>>>>>> local i 0
>>>>>> local more 1
>>>>>> qui while `more' {
>>>>>> local i = `i' + 1
>>>>>> gen where`i' = strpos(es,spanX)
>>>>>> replace where`i' = . if where`i' == 0
>>>>>> replace es = subinstr(es,spanX,blanks,1)
>>>>>> count if where`i' != .
>>>>>> local more = r(N)
>>>>>> }
>>>>>> drop where`i'
>>>>>> replace where1 = . if maxspan == 0
>>>>>> egen nmaxspan = rownonmiss(where*)
>>>>>> drop es blanks spanX
>>>>>>
>>>>>> * -------------------------- end example ------------------
>>>>>
>>>>> Yup. It reduces total run time by about 3.5 seconds in the 10% sample.
>>>>>
>>>>> Splitting the code into two functions, (1) finding the longest span of
>>>>> continuous eligibility and (2) determining where those spans occur
>>>>> within the 15-year period covered by the data, I get the best
>>>>> performance by using Robert's method for (1) and Nick's method for
>>>>> (2). The whole process takes just less than 29 seconds.
>>>>>
>>>>> Thanks again very much to both of you. I'd still be muddling through
>>>>> with trial and error without you. I've also learned a lot by looking
>>>>> at your code. I really appreciate all the help.
>>>
>>> *
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/statalist/faq
>>> * http://www.ats.ucla.edu/stat/stata/
>>>
>>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/