Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Stata analog to Mata's -strdup()- or better approach?


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date   Sun, 13 Mar 2011 09:33:08 +0000

Thanks for the very detailed report. A few small footnotes

1. My comment

it does seem possible that at least some people among 6 million might
have been eligible for the entire time

was definitely not a suggestion that Robert's code assumed otherwise.
I could see that it was elegantly general enough to cope fine. It was
a comment that going all the way as my code did might be needed,
although no doubt my code still tries patterns never found in the
data, to no good purpose.

2. It is striking that, to put it directly, an empty string has
position 1 within another string

. di strpos("--------", "")
1

That is certainly something that could bite as when by accident one
tries to find this.

3. My original code could be speeded up a bit by not using a variable
X but my guess would be that Robert's is still definitely faster.

Nick

On Sat, Mar 12, 2011 at 7:00 PM, Rebecca Pope <[email protected]> wrote:
> Many thanks Nick and Robert. I'm using a combination of your
> approaches. Nick is absolutely right; some persons could be eligible
> for the full 15 years, but Robert's code handles this situation fine.
> It will continue to loop for the maximum number of sets in the data,
> even if someone is eligible for all 15 years.
>
> The true problem is when the person was never eligible. In that
> situation, Robert's code (pasted in at the end of this reply without
> his example data) always assigns a value of 1 in "where1". This has to
> do, I think, with how Stata matches missing values when implementing
> strpos(). If maxspan=="" then that gets treated as a match to pos 1of
> "es" when using strpos(). You can test it with:
>
> clear
> input patid str12 estring
> 1          XXXXX-------
> 2          --XXX---XXXX
> 3          -XXXXXX-----
> 4          -XXX-XXX-XXX
> 5          XXXX-XX-XXXX
> 6          XXXXXXXXXXXX
> 7          ------------
> end
>
> After a couple of slight modifications so that Robert's code will only
> produce 1s when the first match of a set of Xs occurs in position 1, I
> took a 10% sample of the data and ran both sets of code (Nick's and
> Robert's). Robert's code does run substantially faster. I changed the
> -replace where`i'- line in Robert's code so that the results preserve
> the 0s in where1 & are thus directly comparable to Nick's so everyone
> can see how the results compare.
>
> 1 = Nick, 2 = Robert (order received, nothing else should be inferred)
>
> <omitted output>
> . tabstat where*, statistics( count min max ) columns(statistics)
>
>    variable |         N       min       max
> -------------+------------------------------
>      where1 |    502964         0       180
>      where2 |      1572        29       178
>      where3 |        57        74       178
>      where4 |        12       142       169
>      where5 |         3       152       175
>      where6 |         1       173       173
> --------------------------------------------
>
> <omitted output>
>    variable |         N       min       max
> -------------+------------------------------
>      where1 |    502964         0       180
>      where2 |      1572        29       178
>      where3 |        57        74       178
>      where4 |        12       142       169
>      where5 |         3       152       175
>      where6 |         1       173       173
> --------------------------------------------
>
> . timer list
>   1:    184.85 /        1 =     184.8470
>   2:     36.50 /        1 =      36.5020
>
> I also merged the sets on pat_id and the where`i' variables to make
> sure the values were the same & not just the counts & ranges. The
> results are identical.
>
> For those like Nick who haven't worked with eligibility data like this
> & in case someone who has wonders why I'm counting Xs instead of
> something "logical" like subtracting the start date from the end date:
> This data only has the earliest start date and the last end date. I
> would expect a full 15-year insurance coverage only _very_  rarely. It
> doesn't happen at all in the sample I used for testing. Here in the
> US, people tend to gain and lose insurance with their jobs.
> Compounding the issue, the employer could change the company they
> contract with several times over the years. If one company isn't
> covered by my data, that will cause apparent "gaps" as well. Our
> public insurance for the poor has the same problem of "gaps"--people
> constantly go in and out of the program with marginal changes in
> financial situation or moving across state lines.
>
> Best,
> Rebecca
>
> On Sat, Mar 12, 2011 at 8:05 AM, Nick Cox <[email protected]> wrote:
>> Interesting point. Clearly fast beats slow whenever nothing else is at issue.
>>
>> In this case, the data are patients' eligiblity for health insurance
>> benefits over a period of 15 years. I've never worked with such data
>> but it does seem possible that at least some people among 6 million
>> might have been eligible for the entire time.
>>
>> Nick
>
> <omitted text here>
>
>> On Sat, Mar 12, 2011 at 11:47 AM, Robert Picard <[email protected]> wrote:
>>> * -------------------------- begin example ----------------
>>> * Find the longest period of continuous eligibility
>>> clonevar es = estring
>>> gen maxspan = ""
>>> local more 1
>>> while `more' {
>>>        gen s = regexs(1) if regexm(es,"(X+)")
>>>        replace maxspan = s if length(s) > length(maxspan)
>>>        replace es = subinstr(es,s,"",1)
>>>        count if s != ""
>>>        local more = r(N)
>>>        drop s
>>> }
>>>
>>>
>>> * Identify the start of each span
>>> gen smask = subinstr(maxspan,"X","_",.)
>>> replace es = estring
>>> local i 0
>>> local more 1
>>> while `more' {
>>>        local i = `i' + 1
>>>        gen where`i' = strpos(es,maxspan)
>>>        replace where`i' = . if where`i' == 0
>>>        replace es = subinstr(es,maxspan,smask,1)
>>>        count if where`i' != .
>>>        local more = r(N)
>>> }
>>> drop where`i'
>>> egen nmaxspan = rownonmiss(where*)
>>> drop es smask
>>>
>>> * -------------------------- end example ------------------
> *

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index