Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: RE: Stata analog to Mata's -strdup()- or better approach? |
Date | Sat, 12 Mar 2011 14:05:54 +0000 |
Interesting point. Clearly fast beats slow whenever nothing else is at issue. In this case, the data are patients' eligiblity for health insurance benefits over a period of 15 years. I've never worked with such data but it does seem possible that at least some people among 6 million might have been eligible for the entire time. Nick On Sat, Mar 12, 2011 at 12:43 PM, Robert Picard <picard@netbox.com> wrote: > Note that with 6 millon observations and 12*15 months, this is going > to take quite a long time to compute. Nick's approach would require > looping 180 times to identify the longest span. My suggested approach > loops as many times as there are spans, which should be significantly > less than 180. > > Cheers, Robert > > On Sat, Mar 12, 2011 at 1:02 PM, Nick Cox <njcoxstata@gmail.com> wrote: >> You're correct. The code below fixes the error I found on closer examination. >> >> The incorrect line used a replacement mask was n_longest long; it >> should have been l_longest. >> >> Thanks for checking. >> >> clear >> input patid str12 estring >> 1 XXXXX------- >> 2 --XXX---XXXX >> 3 -XXXXXX----- >> 4 -XXX-XXX-XXX >> 5 XXXX-XX-XXXX >> end >> >> gen X = "" >> gen l_longest = 0 >> gen s_longest = "" >> gen where1 = 0 >> >> qui forval i = 1/12 { >> replace X = X + "X" >> replace s_longest = X if strpos(estring, X) >> replace l_longest = `i' if strpos(estring, X) >> replace where1 = strpos(estring, X) if strpos(estring, X) >> } >> >> drop X >> gen n_longest = /// >> (length(estring) - length(subinstr(estring, s_longest, "", .))) / /// >> length(s_longest) >> >> clonevar copy = estring >> local mask : di _dup(12) "&" >> local rtext subinstr(copy, s_longest, substr("`mask'", 1, l_longest), 1) >> replace copy = `rtext' >> >> su n_longest, meanonly >> forval j = 2/`r(max)' { >> gen where`j' = strpos(copy, s_longest) >> replace copy = `rtext' >> } >> >> >> >> On Sat, Mar 12, 2011 at 11:47 AM, Robert Picard <picard@netbox.com> wrote: >>> Nick, I think that there's a problem with your code, it does not work >>> with a string like: >>> >>> "XXXX-XX-XXXX" >>> >>> Here's how I would do it: >>> >>> * -------------------------- begin example ---------------- >>> >>> clear >>> input patid str12 estring >>> 1 XXXXX------- >>> 2 --XXX---XXXX >>> 3 -XXXXXX----- >>> 4 -XXX-XXX-XXX >>> 5 XXXX-XX-XXXX >>> 6 X-XX-XX-XXXX >>> 6 X-XXXXX-XXXX >>> end >>> >>> * Find the longest period of continuous eligibility >>> clonevar es = estring >>> gen maxspan = "" >>> local more 1 >>> while `more' { >>> gen s = regexs(1) if regexm(es,"(X+)") >>> replace maxspan = s if length(s) > length(maxspan) >>> replace es = subinstr(es,s,"",1) >>> count if s != "" >>> local more = r(N) >>> drop s >>> } >>> >>> >>> * Identify the start of each span >>> gen smask = subinstr(maxspan,"X","_",.) >>> replace es = estring >>> local i 0 >>> local more 1 >>> while `more' { >>> local i = `i' + 1 >>> gen where`i' = strpos(es,maxspan) >>> replace where`i' = . if where`i' == 0 >>> replace es = subinstr(es,maxspan,smask,1) >>> count if where`i' != . >>> local more = r(N) >>> } >>> drop where`i' >>> egen nmaxspan = rownonmiss(where*) >>> drop es smask >>> >>> * -------------------------- end example ------------------ >>> >>> >>> >>> On Sat, Mar 12, 2011 at 11:23 AM, Nick Cox <njcoxstata@gmail.com> wrote: >>>> First, let me give a more complete example of how I would approach >>>> your problem. >>>> >>>> 1. Your example data. >>>> >>>> clear >>>> input patid str12 estring >>>> 1 XXXXX------- >>>> 2 --XXX---XXXX >>>> 3 -XXXXXX----- >>>> 4 -XXX-XXX-XXX >>>> end >>>> >>>> 2. Sample script starts with initialisations. Clearly, 12 is specific >>>> to the example. >>>> >>>> gen X = "" >>>> gen l_longest = 0 >>>> gen s_longest = "" >>>> gen where1 = 0 >>>> >>>> 3. The main loop just tries out longer multiples of "X" until it finds >>>> the longest. >>>> >>>> qui forval i = 1/12 { >>>> replace X = X + "X" >>>> replace s_longest = X if strpos(estring, X) >>>> replace l_longest = `i' if strpos(estring, X) >>>> replace where1 = strpos(estring, X) if strpos(estring, X) >>>> } >>>> >>>> drop X >>>> >>>> 4. The number of times the longest substring occurs is calculated from >>>> a comparison of length before and after (notionally) blanking it out. >>>> There is more on this trick at Mitch Abdon's blog >>>> >>>> <http://statadaily.wordpress.com/2011/01/20/counting-occurrence-of-strings-within-strings/> >>>> >>>> and in my Speaking Stata column in SJ 11(1) 2011. >>>> >>>> gen n_longest = /// >>>> (length(estring) - length(subinstr(estring, s_longest, "", .))) / /// >>>> length(s_longest) >>>> >>>> 5. Now to find the separate occurrences of the longest substring we >>>> look for each one in a copy, and everytime we do find it one we >>>> replace it with a mask of the same length. "&" is arbitrary here. >>>> >>>> clonevar copy = estring >>>> local mask : di _dup(12) "&" >>>> local rtext subinstr(copy, s_longest, substr("`mask'", 1, n_longest), 1) >>>> replace copy = `rtext' >>>> >>>> su n_longest, meanonly >>>> forval j = 2/`r(max)' { >>>> gen where`j' = strpos(copy, s_longest) >>>> replace copy = `rtext' >>>> } >>>> >>>> Part of the Stata magic is that what the longest substring is, how >>>> many times it occurs, and its length can easily vary from observation >>>> to observation. >>>> >>>> Here is all the code as one segment >>>> >>>> clear >>>> input patid str12 estring >>>> 1 XXXXX------- >>>> 2 --XXX---XXXX >>>> 3 -XXXXXX----- >>>> 4 -XXX-XXX-XXX >>>> end >>>> >>>> gen X = "" >>>> gen l_longest = 0 >>>> gen s_longest = "" >>>> gen where1 = 0 >>>> >>>> qui forval i = 1/12 { >>>> replace X = X + "X" >>>> replace s_longest = X if strpos(estring, X) >>>> replace l_longest = `i' if strpos(estring, X) >>>> replace where1 = strpos(estring, X) if strpos(estring, X) >>>> } >>>> >>>> drop X >>>> gen n_longest = /// >>>> (length(estring) - length(subinstr(estring, s_longest, "", .))) / /// >>>> length(s_longest) >>>> >>>> clonevar copy = estring >>>> local mask : di _dup(12) "&" >>>> local rtext subinstr(copy, s_longest, substr("`mask'", 1, n_longest), 1) >>>> replace copy = `rtext' >>>> >>>> su n_longest, meanonly >>>> forval j = 2/`r(max)' { >>>> gen where`j' = strpos(copy, s_longest) >>>> replace copy = `rtext' >>>> } >>>> >>>> Now: commenting on -split-. The approach above seems closer to what >>>> you want than using -split-. >>>> >>>> -split- treats multiple spaces as one, but otherwise does not treat >>>> multiple occurrences of other delimiters as equivalent to one >>>> occurrence. That is why I wrote >>>> >>>> replace fstring = subinstr(fstring, "-", " ", .) >>>> >>>> You will find that >>>> >>>> split estring, parse(-) >>>> >>>> creates rather too many variables to be useful. >>>> >>>> Nick >>>> >>>> On Sat, Mar 12, 2011 at 2:51 AM, Rebecca Pope <rebecca.a.pope@gmail.com> wrote: >>>>> Nick, >>>>> I had to read what you wrote a couple of times before the "Duh" kicked >>>>> in. In one of my many attempts, I did (nearly) exactly what you wrote >>>>> below. The real difference, which I didn't catch at first, is that you >>>>> don't condense the spaces into a single space like I did. -split- will >>>>> create a new variable for each " ", thereby preserving where the >>>>> string started. For subsequent instances of variables including Xs, >>>>> the index on the variable generated by -split- will be off, but I >>>>> could just add the length of the preceding variables. Brilliant! (you, >>>>> not me) >>>>> >>>>> In the interest of full disclosure, I'm rather ashamed to admit that I >>>>> initially used -split- exactly as you do and cursed at it for not >>>>> recognizing multiple delimiters as one, went back and condensed the >>>>> multiple spaces to a single space, and then -split- the variable >>>>> again. In fact, my initial reaction to your e-mail was "Did that; >>>>> doesn't work." I suppose "obtuse" does apply. Sorry for the trouble. >>>>> >>>>> Unless I'm missing something else, I could just use a - split estring, >>>>> parse(-) -, correct? >>>>> >>>>> Thanks again for all the help, >>>>> Rebecca >>>>> >>>>> On Fri, Mar 11, 2011 at 6:32 PM, Nick Cox <njcoxstata@gmail.com> wrote: >>>>>> >>>>>> Have you thought of something like >>>>>> >>>>>> clonevar fstring = estring >>>>>> replace fstring = subinstr(fstring, "-", " ", .) >>>>>> split fstring >>>>>> >>>>> <truncated> >>>> * >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ >> > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/