Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Stata analog to Mata's -strdup()- or better approach?


From   Rebecca Pope <[email protected]>
To   [email protected]
Subject   Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date   Sat, 12 Mar 2011 13:00:28 -0600

Many thanks Nick and Robert. I'm using a combination of your
approaches. Nick is absolutely right; some persons could be eligible
for the full 15 years, but Robert's code handles this situation fine.
It will continue to loop for the maximum number of sets in the data,
even if someone is eligible for all 15 years.

The true problem is when the person was never eligible. In that
situation, Robert's code (pasted in at the end of this reply without
his example data) always assigns a value of 1 in "where1". This has to
do, I think, with how Stata matches missing values when implementing
strpos(). If maxspan=="" then that gets treated as a match to pos 1of
"es" when using strpos(). You can test it with:

clear
input patid str12 estring
1          XXXXX-------
2          --XXX---XXXX
3          -XXXXXX-----
4          -XXX-XXX-XXX
5          XXXX-XX-XXXX
6          XXXXXXXXXXXX
7          ------------
end

After a couple of slight modifications so that Robert's code will only
produce 1s when the first match of a set of Xs occurs in position 1, I
took a 10% sample of the data and ran both sets of code (Nick's and
Robert's). Robert's code does run substantially faster. I changed the
-replace where`i'- line in Robert's code so that the results preserve
the 0s in where1 & are thus directly comparable to Nick's so everyone
can see how the results compare.

1 = Nick, 2 = Robert (order received, nothing else should be inferred)

<omitted output>
. tabstat where*, statistics( count min max ) columns(statistics)

    variable |         N       min       max
-------------+------------------------------
      where1 |    502964         0       180
      where2 |      1572        29       178
      where3 |        57        74       178
      where4 |        12       142       169
      where5 |         3       152       175
      where6 |         1       173       173
--------------------------------------------

<omitted output>
    variable |         N       min       max
-------------+------------------------------
      where1 |    502964         0       180
      where2 |      1572        29       178
      where3 |        57        74       178
      where4 |        12       142       169
      where5 |         3       152       175
      where6 |         1       173       173
--------------------------------------------

. timer list
   1:    184.85 /        1 =     184.8470
   2:     36.50 /        1 =      36.5020

I also merged the sets on pat_id and the where`i' variables to make
sure the values were the same & not just the counts & ranges. The
results are identical.

For those like Nick who haven't worked with eligibility data like this
& in case someone who has wonders why I'm counting Xs instead of
something "logical" like subtracting the start date from the end date:
This data only has the earliest start date and the last end date. I
would expect a full 15-year insurance coverage only _very_  rarely. It
doesn't happen at all in the sample I used for testing. Here in the
US, people tend to gain and lose insurance with their jobs.
Compounding the issue, the employer could change the company they
contract with several times over the years. If one company isn't
covered by my data, that will cause apparent "gaps" as well. Our
public insurance for the poor has the same problem of "gaps"--people
constantly go in and out of the program with marginal changes in
financial situation or moving across state lines.

Best,
Rebecca

On Sat, Mar 12, 2011 at 8:05 AM, Nick Cox <[email protected]> wrote:
> Interesting point. Clearly fast beats slow whenever nothing else is at issue.
>
> In this case, the data are patients' eligiblity for health insurance
> benefits over a period of 15 years. I've never worked with such data
> but it does seem possible that at least some people among 6 million
> might have been eligible for the entire time.
>
> Nick

<omitted text here>

> On Sat, Mar 12, 2011 at 11:47 AM, Robert Picard <[email protected]> wrote:
>> * -------------------------- begin example ----------------
>> * Find the longest period of continuous eligibility
>> clonevar es = estring
>> gen maxspan = ""
>> local more 1
>> while `more' {
>>        gen s = regexs(1) if regexm(es,"(X+)")
>>        replace maxspan = s if length(s) > length(maxspan)
>>        replace es = subinstr(es,s,"",1)
>>        count if s != ""
>>        local more = r(N)
>>        drop s
>> }
>>
>>
>> * Identify the start of each span
>> gen smask = subinstr(maxspan,"X","_",.)
>> replace es = estring
>> local i 0
>> local more 1
>> while `more' {
>>        local i = `i' + 1
>>        gen where`i' = strpos(es,maxspan)
>>        replace where`i' = . if where`i' == 0
>>        replace es = subinstr(es,maxspan,smask,1)
>>        count if where`i' != .
>>        local more = r(N)
>> }
>> drop where`i'
>> egen nmaxspan = rownonmiss(where*)
>> drop es smask
>>
>> * -------------------------- end example ------------------
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index