Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Stata analog to Mata's -strdup()- or better approach?


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date   Sat, 12 Mar 2011 10:44:13 +0000

Using a variable X isn't needed here. You could also just go

local X "`X'X"

Each string around the first loop and look for "`X'". That might help
if you are tight on memory.


On Sat, Mar 12, 2011 at 10:23 AM, Nick Cox <[email protected]> wrote:
> First, let me give a more complete example of how I would approach
> your problem.
>
> 1. Your example data.
>
> clear
> input patid str12 estring
> 1          XXXXX-------
> 2          --XXX---XXXX
> 3          -XXXXXX-----
> 4          -XXX-XXX-XXX
> end
>
> 2. Sample script starts with initialisations. Clearly, 12 is specific
> to the example.
>
> gen X = ""
> gen l_longest = 0
> gen s_longest = ""
> gen where1 = 0
>
> 3. The main loop just tries out longer multiples of "X" until it finds
> the longest.
>
> qui forval i = 1/12 {
>       replace X = X + "X"
>         replace s_longest = X if strpos(estring, X)
>       replace l_longest = `i' if strpos(estring, X)
>       replace where1 = strpos(estring, X) if strpos(estring, X)
> }
>
> drop X
>
> 4. The number of times the longest substring occurs is calculated from
> a comparison of length before and after (notionally) blanking it out.
> There is more on this trick at Mitch Abdon's blog
>
> <http://statadaily.wordpress.com/2011/01/20/counting-occurrence-of-strings-within-strings/>
>
> and in my Speaking Stata column in SJ 11(1) 2011.
>
> gen n_longest = ///
> (length(estring) - length(subinstr(estring, s_longest, "", .))) / ///
> length(s_longest)
>
> 5. Now to find the separate occurrences of the longest substring we
> look for each one in a copy, and everytime we do find it one we
> replace it with a mask of the same length. "&" is arbitrary here.
>
> clonevar copy = estring
> local mask : di _dup(12) "&"
> local rtext subinstr(copy, s_longest, substr("`mask'", 1, n_longest), 1)
> replace copy = `rtext'
>
> su n_longest, meanonly
> forval j = 2/`r(max)' {
>        gen where`j' = strpos(copy, s_longest)
>        replace copy = `rtext'
> }
>
> Part of the Stata magic is that what the longest substring is, how
> many times it occurs, and its length can easily vary from observation
> to observation.
>
> Here is all the code as one segment
>
> clear
> input patid str12 estring
> 1          XXXXX-------
> 2          --XXX---XXXX
> 3          -XXXXXX-----
> 4          -XXX-XXX-XXX
> end
>
> gen X = ""
> gen l_longest = 0
> gen s_longest = ""
> gen where1 = 0
>
> qui forval i = 1/12 {
>       replace X = X + "X"
>         replace s_longest = X if strpos(estring, X)
>       replace l_longest = `i' if strpos(estring, X)
>       replace where1 = strpos(estring, X) if strpos(estring, X)
> }
>
> drop X
> gen n_longest = ///
> (length(estring) - length(subinstr(estring, s_longest, "", .))) / ///
> length(s_longest)
>
> clonevar copy = estring
> local mask : di _dup(12) "&"
> local rtext subinstr(copy, s_longest, substr("`mask'", 1, n_longest), 1)
> replace copy = `rtext'
>
> su n_longest, meanonly
> forval j = 2/`r(max)' {
>        gen where`j' = strpos(copy, s_longest)
>        replace copy = `rtext'
> }
>
> Now: commenting on -split-. The approach above seems closer to what
> you want than using -split-.
>
> -split- treats multiple spaces as one, but otherwise does not treat
> multiple occurrences of other delimiters as equivalent to one
> occurrence. That is why I wrote
>
> replace fstring = subinstr(fstring, "-", " ", .)
>
> You will find that
>
> split estring, parse(-)
>
> creates rather too many variables to be useful.
>
> Nick
>
> On Sat, Mar 12, 2011 at 2:51 AM, Rebecca Pope <[email protected]> wrote:
>> Nick,
>> I had to read what you wrote a couple of times before the "Duh" kicked
>> in. In one of my many attempts, I did (nearly) exactly what you wrote
>> below. The real difference, which I didn't catch at first, is that you
>> don't condense the spaces into a single space like I did. -split- will
>> create a new variable for each " ", thereby preserving where the
>> string started. For subsequent instances of variables including Xs,
>> the index on the variable generated by -split- will be off, but I
>> could just add the length of the preceding variables. Brilliant! (you,
>> not me)
>>
>> In the interest of full disclosure, I'm rather ashamed to admit that I
>> initially used -split- exactly as you do and cursed at it for not
>> recognizing multiple delimiters as one, went back and condensed the
>> multiple spaces to a single space, and then -split- the variable
>> again. In fact, my initial reaction to your e-mail was "Did that;
>> doesn't work." I suppose "obtuse" does apply. Sorry for the trouble.
>>
>> Unless I'm missing something else, I could just use a - split estring,
>> parse(-) -, correct?
>>
>> Thanks again for all the help,
>> Rebecca
>>
>> On Fri, Mar 11, 2011 at 6:32 PM, Nick Cox <[email protected]> wrote:
>>>
>>> Have you thought of something like
>>>
>>> clonevar fstring = estring
>>> replace fstring = subinstr(fstring, "-", " ", .)
>>> split fstring
>>>
>> <truncated>
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index