Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Stata analog to Mata's -strdup()- or better approach?

From	Nick Cox <[email protected]>
To	[email protected]
Subject	Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date	Sat, 12 Mar 2011 10:23:32 +0000

First, let me give a more complete example of how I would approach
your problem.

1. Your example data.

clear
input patid str12 estring
1          XXXXX-------
2          --XXX---XXXX
3          -XXXXXX-----
4          -XXX-XXX-XXX
end

2. Sample script starts with initialisations. Clearly, 12 is specific
to the example.

gen X = ""
gen l_longest = 0
gen s_longest = ""
gen where1 = 0

3. The main loop just tries out longer multiples of "X" until it finds
the longest.

qui forval i = 1/12 {
       replace X = X + "X"
	 replace s_longest = X if strpos(estring, X)
       replace l_longest = `i' if strpos(estring, X)
       replace where1 = strpos(estring, X) if strpos(estring, X)
}

drop X

4. The number of times the longest substring occurs is calculated from
a comparison of length before and after (notionally) blanking it out.
There is more on this trick at Mitch Abdon's blog

<http://statadaily.wordpress.com/2011/01/20/counting-occurrence-of-strings-within-strings/>

and in my Speaking Stata column in SJ 11(1) 2011.

gen n_longest = ///
(length(estring) - length(subinstr(estring, s_longest, "", .))) / ///
length(s_longest)

5. Now to find the separate occurrences of the longest substring we
look for each one in a copy, and everytime we do find it one we
replace it with a mask of the same length. "&" is arbitrary here.

clonevar copy = estring
local mask : di _dup(12) "&"
local rtext subinstr(copy, s_longest, substr("`mask'", 1, n_longest), 1)
replace copy = `rtext'

su n_longest, meanonly
forval j = 2/`r(max)' {
	gen where`j' = strpos(copy, s_longest)
	replace copy = `rtext'
} 	

Part of the Stata magic is that what the longest substring is, how
many times it occurs, and its length can easily vary from observation
to observation.

Here is all the code as one segment

clear
input patid str12 estring
1          XXXXX-------
2          --XXX---XXXX
3          -XXXXXX-----
4          -XXX-XXX-XXX
end

gen X = ""
gen l_longest = 0
gen s_longest = ""
gen where1 = 0

qui forval i = 1/12 {
       replace X = X + "X"
	 replace s_longest = X if strpos(estring, X)
       replace l_longest = `i' if strpos(estring, X)
       replace where1 = strpos(estring, X) if strpos(estring, X)
}

drop X
gen n_longest = ///
(length(estring) - length(subinstr(estring, s_longest, "", .))) / ///
length(s_longest)

clonevar copy = estring
local mask : di _dup(12) "&"
local rtext subinstr(copy, s_longest, substr("`mask'", 1, n_longest), 1)
replace copy = `rtext'

su n_longest, meanonly
forval j = 2/`r(max)' {
	gen where`j' = strpos(copy, s_longest)
	replace copy = `rtext'
} 	

Now: commenting on -split-. The approach above seems closer to what
you want than using -split-.

-split- treats multiple spaces as one, but otherwise does not treat
multiple occurrences of other delimiters as equivalent to one
occurrence. That is why I wrote

replace fstring = subinstr(fstring, "-", " ", .)

You will find that

split estring, parse(-)

creates rather too many variables to be useful.

Nick

On Sat, Mar 12, 2011 at 2:51 AM, Rebecca Pope <[email protected]> wrote:
> Nick,
> I had to read what you wrote a couple of times before the "Duh" kicked
> in. In one of my many attempts, I did (nearly) exactly what you wrote
> below. The real difference, which I didn't catch at first, is that you
> don't condense the spaces into a single space like I did. -split- will
> create a new variable for each " ", thereby preserving where the
> string started. For subsequent instances of variables including Xs,
> the index on the variable generated by -split- will be off, but I
> could just add the length of the preceding variables. Brilliant! (you,
> not me)
>
> In the interest of full disclosure, I'm rather ashamed to admit that I
> initially used -split- exactly as you do and cursed at it for not
> recognizing multiple delimiters as one, went back and condensed the
> multiple spaces to a single space, and then -split- the variable
> again. In fact, my initial reaction to your e-mail was "Did that;
> doesn't work." I suppose "obtuse" does apply. Sorry for the trouble.
>
> Unless I'm missing something else, I could just use a - split estring,
> parse(-) -, correct?
>
> Thanks again for all the help,
> Rebecca
>
> On Fri, Mar 11, 2011 at 6:32 PM, Nick Cox <[email protected]> wrote:
>>
>> Have you thought of something like
>>
>> clonevar fstring = estring
>> replace fstring = subinstr(fstring, "-", " ", .)
>> split fstring
>>
> <truncated>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Robert Picard <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>

References:
- st: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Nick Cox <[email protected]>
- Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
  - From: Rebecca Pope <[email protected]>

Prev by Date: Re: st: random coefficient model for cross-sectional data?
Next by Date: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Previous by thread: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Next by thread: Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Index(es):
- Date
- Thread