Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Stata analog to Mata's -strdup()- or better approach?

From	Robert Picard <[email protected]>
To	[email protected]
Subject	Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date	Sat, 12 Mar 2011 12:47:26 +0100

Nick, I think that there's a problem with your code, it does not work
with a string like:

 "XXXX-XX-XXXX"

Here's how I would do it:

* -------------------------- begin example ----------------

clear
input patid str12 estring
1          XXXXX-------
2          --XXX---XXXX
3          -XXXXXX-----
4          -XXX-XXX-XXX
5          XXXX-XX-XXXX
6          X-XX-XX-XXXX
6          X-XXXXX-XXXX
end

* Find the longest period of continuous eligibility
clonevar es = estring
gen maxspan = ""
local more 1
while `more' {
	gen s = regexs(1) if regexm(es,"(X+)")
	replace maxspan = s if length(s) > length(maxspan)
	replace es = subinstr(es,s,"",1)
	count if s != ""
	local more = r(N)
	drop s
}


* Identify the start of each span
gen smask = subinstr(maxspan,"X","_",.)
replace es = estring
local i 0
local more 1
while `more' {
	local i = `i' + 1
	gen where`i' = strpos(es,maxspan)
	replace where`i' = . if where`i' == 0
	replace es = subinstr(es,maxspan,smask,1)
	count if where`i' != .
	local more = r(N)
}
drop where`i'
egen nmaxspan = rownonmiss(where*)
drop es smask

* -------------------------- end example ------------------



On Sat, Mar 12, 2011 at 11:23 AM, Nick Cox <[email protected]> wrote:
> First, let me give a more complete example of how I would approach
> your problem.
>
> 1. Your example data.
>
> clear
> input patid str12 estring
> 1          XXXXX-------
> 2          --XXX---XXXX
> 3          -XXXXXX-----
> 4          -XXX-XXX-XXX
> end
>
> 2. Sample script starts with initialisations. Clearly, 12 is specific
> to the example.
>
> gen X = ""
> gen l_longest = 0
> gen s_longest = ""
> gen where1 = 0
>
> 3. The main loop just tries out longer multiples of "X" until it finds
> the longest.
>
> qui forval i = 1/12 {
>       replace X = X + "X"
>         replace s_longest = X if strpos(estring, X)
>       replace l_longest = `i' if strpos(estring, X)
>       replace where1 = strpos(estring, X) if strpos(estring, X)
> }
>
> drop X
>
> 4. The number of times the longest substring occurs is calculated from
> a comparison of length before and after (notionally) blanking it out.
> There is more on this trick at Mitch Abdon's blog
>
> <http://statadaily.wordpress.com/2011/01/20/counting-occurrence-of-strings-within-strings/>
>
> and in my Speaking Stata column in SJ 11(1) 2011.
>
> gen n_longest = ///
> (length(estring) - length(subinstr(estring, s_longest, "", .))) / ///
> length(s_longest)
>
> 5. Now to find the separate occurrences of the longest substring we
> look for each one in a copy, and everytime we do find it one we
> replace it with a mask of the same length. "&" is arbitrary here.
>
> clonevar copy = estring
> local mask : di _dup(12) "&"
> local rtext subinstr(copy, s_longest, substr("`mask'", 1, n_longest), 1)
> replace copy = `rtext'
>
> su n_longest, meanonly
> forval j = 2/`r(max)' {
>        gen where`j' = strpos(copy, s_longest)
>        replace copy = `rtext'
> }
>
> Part of the Stata magic is that what the longest substring is, how
> many times it occurs, and its length can easily vary from observation
> to observation.
>
> Here is all the code as one segment
>
> clear
> input patid str12 estring
> 1          XXXXX-------
> 2          --XXX---XXXX
> 3          -XXXXXX-----
> 4          -XXX-XXX-XXX
> end
>
> gen X = ""
> gen l_longest = 0
> gen s_longest = ""
> gen where1 = 0
>
> qui forval i = 1/12 {
>       replace X = X + "X"
>         replace s_longest = X if strpos(estring, X)
>       replace l_longest = `i' if strpos(estring, X)
>       replace where1 = strpos(estring, X) if strpos(estring, X)
> }
>
> drop X
> gen n_longest = ///
> (length(estring) - length(subinstr(estring, s_longest, "", .))) / ///
> length(s_longest)
>
> clonevar copy = estring
> local mask : di _dup(12) "&"
> local rtext subinstr(copy, s_longest, substr("`mask'", 1, n_longest), 1)
> replace copy = `rtext'
>
> su n_longest, meanonly
> forval j = 2/`r(max)' {
>        gen where`j' = strpos(copy, s_longest)
>        replace copy = `rtext'
> }
>
> Now: commenting on -split-. The approach above seems closer to what
> you want than using -split-.
>
> -split- treats multiple spaces as one, but otherwise does not treat
> multiple occurrences of other delimiters as equivalent to one
> occurrence. That is why I wrote
>
> replace fstring = subinstr(fstring, "-", " ", .)
>
> You will find that
>
> split estring, parse(-)
>
> creates rather too many variables to be useful.
>
> Nick
>
> On Sat, Mar 12, 2011 at 2:51 AM, Rebecca Pope <[email protected]> wrote:
>> Nick,
>> I had to read what you wrote a couple of times before the "Duh" kicked
>> in. In one of my many attempts, I did (nearly) exactly what you wrote
>> below. The real difference, which I didn't catch at first, is that you
>> don't condense the spaces into a single space like I did. -split- will
>> create a new variable for each " ", thereby preserving where the
>> string started. For subsequent instances of variables including Xs,
>> the index on the variable generated by -split- will be off, but I
>> could just add the length of the preceding variables. Brilliant! (you,
>> not me)
>>
>> In the interest of full disclosure, I'm rather ashamed to admit that I
>> initially used -split- exactly as you do and cursed at it for not
>> recognizing multiple delimiters as one, went back and condensed the
>> multiple spaces to a single space, and then -split- the variable
>> again. In fact, my initial reaction to your e-mail was "Did that;
>> doesn't work." I suppose "obtuse" does apply. Sorry for the trouble.
>>
>> Unless I'm missing something else, I could just use a - split estring,
>> parse(-) -, correct?
>>
>> Thanks again for all the help,
>> Rebecca
>>
>> On Fri, Mar 11, 2011 at 6:32 PM, Nick Cox <[email protected]> wrote:
>>>
>>> Have you thought of something like
>>>
>>> clonevar fstring = estring
>>> replace fstring = subinstr(fstring, "-", " ", .)
>>> split fstring
>>>
>> <truncated>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>

References:
- st: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>

Prev by Date: st: probit converging but procedure doesnt stop
Next by Date: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Previous by thread: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Next by thread: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Index(es):
- Date
- Thread