Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Stata analog to Mata's -strdup()- or better approach?


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date   Sat, 12 Mar 2011 14:05:54 +0000

Interesting point. Clearly fast beats slow whenever nothing else is at issue.

In this case, the data are patients' eligiblity for health insurance
benefits over a period of 15 years. I've never worked with such data
but it does seem possible that at least some people among 6 million
might have been eligible for the entire time.

Nick

On Sat, Mar 12, 2011 at 12:43 PM, Robert Picard <[email protected]> wrote:
> Note that with 6 millon observations and 12*15 months, this is going
> to take quite a long time to compute. Nick's approach would require
> looping 180 times to identify the longest span. My suggested approach
> loops as many times as there are spans, which should be significantly
> less than 180.
>
> Cheers, Robert
>
> On Sat, Mar 12, 2011 at 1:02 PM, Nick Cox <[email protected]> wrote:
>> You're correct. The code below fixes the error I found on closer examination.
>>
>> The incorrect line used a replacement mask was n_longest long; it
>> should have been l_longest.
>>
>> Thanks for checking.
>>
>> clear
>> input patid str12 estring
>> 1          XXXXX-------
>> 2          --XXX---XXXX
>> 3          -XXXXXX-----
>> 4          -XXX-XXX-XXX
>> 5          XXXX-XX-XXXX
>> end
>>
>> gen X = ""
>> gen l_longest = 0
>> gen s_longest = ""
>> gen where1 = 0
>>
>> qui forval i = 1/12 {
>>       replace X = X + "X"
>>         replace s_longest = X if strpos(estring, X)
>>       replace l_longest = `i' if strpos(estring, X)
>>       replace where1 = strpos(estring, X) if strpos(estring, X)
>> }
>>
>> drop X
>> gen n_longest = ///
>> (length(estring) - length(subinstr(estring, s_longest, "", .))) / ///
>> length(s_longest)
>>
>> clonevar copy = estring
>> local mask : di _dup(12) "&"
>> local rtext subinstr(copy, s_longest, substr("`mask'", 1, l_longest), 1)
>> replace copy = `rtext'
>>
>> su n_longest, meanonly
>> forval j = 2/`r(max)' {
>>        gen where`j' = strpos(copy, s_longest)
>>        replace copy = `rtext'
>> }
>>
>>
>>
>> On Sat, Mar 12, 2011 at 11:47 AM, Robert Picard <[email protected]> wrote:
>>> Nick, I think that there's a problem with your code, it does not work
>>> with a string like:
>>>
>>>  "XXXX-XX-XXXX"
>>>
>>> Here's how I would do it:
>>>
>>> * -------------------------- begin example ----------------
>>>
>>> clear
>>> input patid str12 estring
>>> 1          XXXXX-------
>>> 2          --XXX---XXXX
>>> 3          -XXXXXX-----
>>> 4          -XXX-XXX-XXX
>>> 5          XXXX-XX-XXXX
>>> 6          X-XX-XX-XXXX
>>> 6          X-XXXXX-XXXX
>>> end
>>>
>>> * Find the longest period of continuous eligibility
>>> clonevar es = estring
>>> gen maxspan = ""
>>> local more 1
>>> while `more' {
>>>        gen s = regexs(1) if regexm(es,"(X+)")
>>>        replace maxspan = s if length(s) > length(maxspan)
>>>        replace es = subinstr(es,s,"",1)
>>>        count if s != ""
>>>        local more = r(N)
>>>        drop s
>>> }
>>>
>>>
>>> * Identify the start of each span
>>> gen smask = subinstr(maxspan,"X","_",.)
>>> replace es = estring
>>> local i 0
>>> local more 1
>>> while `more' {
>>>        local i = `i' + 1
>>>        gen where`i' = strpos(es,maxspan)
>>>        replace where`i' = . if where`i' == 0
>>>        replace es = subinstr(es,maxspan,smask,1)
>>>        count if where`i' != .
>>>        local more = r(N)
>>> }
>>> drop where`i'
>>> egen nmaxspan = rownonmiss(where*)
>>> drop es smask
>>>
>>> * -------------------------- end example ------------------
>>>
>>>
>>>
>>> On Sat, Mar 12, 2011 at 11:23 AM, Nick Cox <[email protected]> wrote:
>>>> First, let me give a more complete example of how I would approach
>>>> your problem.
>>>>
>>>> 1. Your example data.
>>>>
>>>> clear
>>>> input patid str12 estring
>>>> 1          XXXXX-------
>>>> 2          --XXX---XXXX
>>>> 3          -XXXXXX-----
>>>> 4          -XXX-XXX-XXX
>>>> end
>>>>
>>>> 2. Sample script starts with initialisations. Clearly, 12 is specific
>>>> to the example.
>>>>
>>>> gen X = ""
>>>> gen l_longest = 0
>>>> gen s_longest = ""
>>>> gen where1 = 0
>>>>
>>>> 3. The main loop just tries out longer multiples of "X" until it finds
>>>> the longest.
>>>>
>>>> qui forval i = 1/12 {
>>>>       replace X = X + "X"
>>>>         replace s_longest = X if strpos(estring, X)
>>>>       replace l_longest = `i' if strpos(estring, X)
>>>>       replace where1 = strpos(estring, X) if strpos(estring, X)
>>>> }
>>>>
>>>> drop X
>>>>
>>>> 4. The number of times the longest substring occurs is calculated from
>>>> a comparison of length before and after (notionally) blanking it out.
>>>> There is more on this trick at Mitch Abdon's blog
>>>>
>>>> <http://statadaily.wordpress.com/2011/01/20/counting-occurrence-of-strings-within-strings/>
>>>>
>>>> and in my Speaking Stata column in SJ 11(1) 2011.
>>>>
>>>> gen n_longest = ///
>>>> (length(estring) - length(subinstr(estring, s_longest, "", .))) / ///
>>>> length(s_longest)
>>>>
>>>> 5. Now to find the separate occurrences of the longest substring we
>>>> look for each one in a copy, and everytime we do find it one we
>>>> replace it with a mask of the same length. "&" is arbitrary here.
>>>>
>>>> clonevar copy = estring
>>>> local mask : di _dup(12) "&"
>>>> local rtext subinstr(copy, s_longest, substr("`mask'", 1, n_longest), 1)
>>>> replace copy = `rtext'
>>>>
>>>> su n_longest, meanonly
>>>> forval j = 2/`r(max)' {
>>>>        gen where`j' = strpos(copy, s_longest)
>>>>        replace copy = `rtext'
>>>> }
>>>>
>>>> Part of the Stata magic is that what the longest substring is, how
>>>> many times it occurs, and its length can easily vary from observation
>>>> to observation.
>>>>
>>>> Here is all the code as one segment
>>>>
>>>> clear
>>>> input patid str12 estring
>>>> 1          XXXXX-------
>>>> 2          --XXX---XXXX
>>>> 3          -XXXXXX-----
>>>> 4          -XXX-XXX-XXX
>>>> end
>>>>
>>>> gen X = ""
>>>> gen l_longest = 0
>>>> gen s_longest = ""
>>>> gen where1 = 0
>>>>
>>>> qui forval i = 1/12 {
>>>>       replace X = X + "X"
>>>>         replace s_longest = X if strpos(estring, X)
>>>>       replace l_longest = `i' if strpos(estring, X)
>>>>       replace where1 = strpos(estring, X) if strpos(estring, X)
>>>> }
>>>>
>>>> drop X
>>>> gen n_longest = ///
>>>> (length(estring) - length(subinstr(estring, s_longest, "", .))) / ///
>>>> length(s_longest)
>>>>
>>>> clonevar copy = estring
>>>> local mask : di _dup(12) "&"
>>>> local rtext subinstr(copy, s_longest, substr("`mask'", 1, n_longest), 1)
>>>> replace copy = `rtext'
>>>>
>>>> su n_longest, meanonly
>>>> forval j = 2/`r(max)' {
>>>>        gen where`j' = strpos(copy, s_longest)
>>>>        replace copy = `rtext'
>>>> }
>>>>
>>>> Now: commenting on -split-. The approach above seems closer to what
>>>> you want than using -split-.
>>>>
>>>> -split- treats multiple spaces as one, but otherwise does not treat
>>>> multiple occurrences of other delimiters as equivalent to one
>>>> occurrence. That is why I wrote
>>>>
>>>> replace fstring = subinstr(fstring, "-", " ", .)
>>>>
>>>> You will find that
>>>>
>>>> split estring, parse(-)
>>>>
>>>> creates rather too many variables to be useful.
>>>>
>>>> Nick
>>>>
>>>> On Sat, Mar 12, 2011 at 2:51 AM, Rebecca Pope <[email protected]> wrote:
>>>>> Nick,
>>>>> I had to read what you wrote a couple of times before the "Duh" kicked
>>>>> in. In one of my many attempts, I did (nearly) exactly what you wrote
>>>>> below. The real difference, which I didn't catch at first, is that you
>>>>> don't condense the spaces into a single space like I did. -split- will
>>>>> create a new variable for each " ", thereby preserving where the
>>>>> string started. For subsequent instances of variables including Xs,
>>>>> the index on the variable generated by -split- will be off, but I
>>>>> could just add the length of the preceding variables. Brilliant! (you,
>>>>> not me)
>>>>>
>>>>> In the interest of full disclosure, I'm rather ashamed to admit that I
>>>>> initially used -split- exactly as you do and cursed at it for not
>>>>> recognizing multiple delimiters as one, went back and condensed the
>>>>> multiple spaces to a single space, and then -split- the variable
>>>>> again. In fact, my initial reaction to your e-mail was "Did that;
>>>>> doesn't work." I suppose "obtuse" does apply. Sorry for the trouble.
>>>>>
>>>>> Unless I'm missing something else, I could just use a - split estring,
>>>>> parse(-) -, correct?
>>>>>
>>>>> Thanks again for all the help,
>>>>> Rebecca
>>>>>
>>>>> On Fri, Mar 11, 2011 at 6:32 PM, Nick Cox <[email protected]> wrote:
>>>>>>
>>>>>> Have you thought of something like
>>>>>>
>>>>>> clonevar fstring = estring
>>>>>> replace fstring = subinstr(fstring, "-", " ", .)
>>>>>> split fstring
>>>>>>
>>>>> <truncated>
>>>> *
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index