Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Stata analog to Mata's -strdup()- or better approach?

From	Robert Picard <[email protected]>
To	[email protected]
Subject	Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date	Sat, 12 Mar 2011 13:43:12 +0100

Note that with 6 millon observations and 12*15 months, this is going
to take quite a long time to compute. Nick's approach would require
looping 180 times to identify the longest span. My suggested approach
loops as many times as there are spans, which should be significantly
less than 180.

Cheers, Robert

On Sat, Mar 12, 2011 at 1:02 PM, Nick Cox <[email protected]> wrote:
> You're correct. The code below fixes the error I found on closer examination.
>
> The incorrect line used a replacement mask was n_longest long; it
> should have been l_longest.
>
> Thanks for checking.
>
> clear
> input patid str12 estring
> 1          XXXXX-------
> 2          --XXX---XXXX
> 3          -XXXXXX-----
> 4          -XXX-XXX-XXX
> 5          XXXX-XX-XXXX
> end
>
> gen X = ""
> gen l_longest = 0
> gen s_longest = ""
> gen where1 = 0
>
> qui forval i = 1/12 {
>       replace X = X + "X"
>         replace s_longest = X if strpos(estring, X)
>       replace l_longest = `i' if strpos(estring, X)
>       replace where1 = strpos(estring, X) if strpos(estring, X)
> }
>
> drop X
> gen n_longest = ///
> (length(estring) - length(subinstr(estring, s_longest, "", .))) / ///
> length(s_longest)
>
> clonevar copy = estring
> local mask : di _dup(12) "&"
> local rtext subinstr(copy, s_longest, substr("`mask'", 1, l_longest), 1)
> replace copy = `rtext'
>
> su n_longest, meanonly
> forval j = 2/`r(max)' {
>        gen where`j' = strpos(copy, s_longest)
>        replace copy = `rtext'
> }
>
>
>
> On Sat, Mar 12, 2011 at 11:47 AM, Robert Picard <[email protected]> wrote:
>> Nick, I think that there's a problem with your code, it does not work
>> with a string like:
>>
>>  "XXXX-XX-XXXX"
>>
>> Here's how I would do it:
>>
>> * -------------------------- begin example ----------------
>>
>> clear
>> input patid str12 estring
>> 1          XXXXX-------
>> 2          --XXX---XXXX
>> 3          -XXXXXX-----
>> 4          -XXX-XXX-XXX
>> 5          XXXX-XX-XXXX
>> 6          X-XX-XX-XXXX
>> 6          X-XXXXX-XXXX
>> end
>>
>> * Find the longest period of continuous eligibility
>> clonevar es = estring
>> gen maxspan = ""
>> local more 1
>> while `more' {
>>        gen s = regexs(1) if regexm(es,"(X+)")
>>        replace maxspan = s if length(s) > length(maxspan)
>>        replace es = subinstr(es,s,"",1)
>>        count if s != ""
>>        local more = r(N)
>>        drop s
>> }
>>
>>
>> * Identify the start of each span
>> gen smask = subinstr(maxspan,"X","_",.)
>> replace es = estring
>> local i 0
>> local more 1
>> while `more' {
>>        local i = `i' + 1
>>        gen where`i' = strpos(es,maxspan)
>>        replace where`i' = . if where`i' == 0
>>        replace es = subinstr(es,maxspan,smask,1)
>>        count if where`i' != .
>>        local more = r(N)
>> }
>> drop where`i'
>> egen nmaxspan = rownonmiss(where*)
>> drop es smask
>>
>> * -------------------------- end example ------------------
>>
>>
>>
>> On Sat, Mar 12, 2011 at 11:23 AM, Nick Cox <[email protected]> wrote:
>>> First, let me give a more complete example of how I would approach
>>> your problem.
>>>
>>> 1. Your example data.
>>>
>>> clear
>>> input patid str12 estring
>>> 1          XXXXX-------
>>> 2          --XXX---XXXX
>>> 3          -XXXXXX-----
>>> 4          -XXX-XXX-XXX
>>> end
>>>
>>> 2. Sample script starts with initialisations. Clearly, 12 is specific
>>> to the example.
>>>
>>> gen X = ""
>>> gen l_longest = 0
>>> gen s_longest = ""
>>> gen where1 = 0
>>>
>>> 3. The main loop just tries out longer multiples of "X" until it finds
>>> the longest.
>>>
>>> qui forval i = 1/12 {
>>>       replace X = X + "X"
>>>         replace s_longest = X if strpos(estring, X)
>>>       replace l_longest = `i' if strpos(estring, X)
>>>       replace where1 = strpos(estring, X) if strpos(estring, X)
>>> }
>>>
>>> drop X
>>>
>>> 4. The number of times the longest substring occurs is calculated from
>>> a comparison of length before and after (notionally) blanking it out.
>>> There is more on this trick at Mitch Abdon's blog
>>>
>>> <http://statadaily.wordpress.com/2011/01/20/counting-occurrence-of-strings-within-strings/>
>>>
>>> and in my Speaking Stata column in SJ 11(1) 2011.
>>>
>>> gen n_longest = ///
>>> (length(estring) - length(subinstr(estring, s_longest, "", .))) / ///
>>> length(s_longest)
>>>
>>> 5. Now to find the separate occurrences of the longest substring we
>>> look for each one in a copy, and everytime we do find it one we
>>> replace it with a mask of the same length. "&" is arbitrary here.
>>>
>>> clonevar copy = estring
>>> local mask : di _dup(12) "&"
>>> local rtext subinstr(copy, s_longest, substr("`mask'", 1, n_longest), 1)
>>> replace copy = `rtext'
>>>
>>> su n_longest, meanonly
>>> forval j = 2/`r(max)' {
>>>        gen where`j' = strpos(copy, s_longest)
>>>        replace copy = `rtext'
>>> }
>>>
>>> Part of the Stata magic is that what the longest substring is, how
>>> many times it occurs, and its length can easily vary from observation
>>> to observation.
>>>
>>> Here is all the code as one segment
>>>
>>> clear
>>> input patid str12 estring
>>> 1          XXXXX-------
>>> 2          --XXX---XXXX
>>> 3          -XXXXXX-----
>>> 4          -XXX-XXX-XXX
>>> end
>>>
>>> gen X = ""
>>> gen l_longest = 0
>>> gen s_longest = ""
>>> gen where1 = 0
>>>
>>> qui forval i = 1/12 {
>>>       replace X = X + "X"
>>>         replace s_longest = X if strpos(estring, X)
>>>       replace l_longest = `i' if strpos(estring, X)
>>>       replace where1 = strpos(estring, X) if strpos(estring, X)
>>> }
>>>
>>> drop X
>>> gen n_longest = ///
>>> (length(estring) - length(subinstr(estring, s_longest, "", .))) / ///
>>> length(s_longest)
>>>
>>> clonevar copy = estring
>>> local mask : di _dup(12) "&"
>>> local rtext subinstr(copy, s_longest, substr("`mask'", 1, n_longest), 1)
>>> replace copy = `rtext'
>>>
>>> su n_longest, meanonly
>>> forval j = 2/`r(max)' {
>>>        gen where`j' = strpos(copy, s_longest)
>>>        replace copy = `rtext'
>>> }
>>>
>>> Now: commenting on -split-. The approach above seems closer to what
>>> you want than using -split-.
>>>
>>> -split- treats multiple spaces as one, but otherwise does not treat
>>> multiple occurrences of other delimiters as equivalent to one
>>> occurrence. That is why I wrote
>>>
>>> replace fstring = subinstr(fstring, "-", " ", .)
>>>
>>> You will find that
>>>
>>> split estring, parse(-)
>>>
>>> creates rather too many variables to be useful.
>>>
>>> Nick
>>>
>>> On Sat, Mar 12, 2011 at 2:51 AM, Rebecca Pope <[email protected]> wrote:
>>>> Nick,
>>>> I had to read what you wrote a couple of times before the "Duh" kicked
>>>> in. In one of my many attempts, I did (nearly) exactly what you wrote
>>>> below. The real difference, which I didn't catch at first, is that you
>>>> don't condense the spaces into a single space like I did. -split- will
>>>> create a new variable for each " ", thereby preserving where the
>>>> string started. For subsequent instances of variables including Xs,
>>>> the index on the variable generated by -split- will be off, but I
>>>> could just add the length of the preceding variables. Brilliant! (you,
>>>> not me)
>>>>
>>>> In the interest of full disclosure, I'm rather ashamed to admit that I
>>>> initially used -split- exactly as you do and cursed at it for not
>>>> recognizing multiple delimiters as one, went back and condensed the
>>>> multiple spaces to a single space, and then -split- the variable
>>>> again. In fact, my initial reaction to your e-mail was "Did that;
>>>> doesn't work." I suppose "obtuse" does apply. Sorry for the trouble.
>>>>
>>>> Unless I'm missing something else, I could just use a - split estring,
>>>> parse(-) -, correct?
>>>>
>>>> Thanks again for all the help,
>>>> Rebecca
>>>>
>>>> On Fri, Mar 11, 2011 at 6:32 PM, Nick Cox <[email protected]> wrote:
>>>>>
>>>>> Have you thought of something like
>>>>>
>>>>> clonevar fstring = estring
>>>>> replace fstring = subinstr(fstring, "-", " ", .)
>>>>> split fstring
>>>>>
>>>> <truncated>
>>> *
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>

References:
- st: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Robert Picard <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>

Prev by Date: Re: st: doubt on the output format %w.dg
Next by Date: Re: st: random coefficient model for cross-sectional data?
Previous by thread: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Next by thread: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Index(es):
- Date
- Thread