Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: RE: Stata analog to Mata's -strdup()- or better approach?


From   Nick Cox <[email protected]>
To   "'[email protected]'" <[email protected]>
Subject   st: RE: Stata analog to Mata's -strdup()- or better approach?
Date   Fri, 11 Mar 2011 18:51:29 +0000

There is at least one analogue, but I don't think you need it for this. 

gen X50 = `"`: di _dup(50) "X" '"'

See -help extended fcn- and look for "display directive". 

My -egen- function (not option) -repeat()- isn't aimed at this problem. 

More interestingly, you could try something like this: 

gen X = "" 
gen longest = 0 
gen where = 0 

qui forval i = 1/180 { 
	replace X = X + "X" 
	replace longest = length(X) if strpos(estring, X) 
	replace where = strpos(estring, X) if strpos(estring, X) 
} 

-if strpos()- is just a contraction of -if strpos() > 0-. 

So, the pattern you search for is just so many "X"s. If you find a string of "X"s longer than you found previously, you update. 

Warning: Code not tested. 

I don't think this is all of your problem, but I don't think you need much, if any, machinery beyond this. 

Nick 
[email protected] 

Rebecca Pope

Does anyone know if there is a Stata analog to Mata's -strdup()-? I'm
not committed to the approach below, so if anyone knows of a better
way to accomplish what I need I'm open to all suggestions. I apologize
in advance for the length of this e-mail, but I've tried to ensure
sufficient detail.

By way of background, I have data on patients' eligiblity for health
insurance benefits over a period of 15 years. The data is stored such
that a "-" is in a position of the string for a month that the patient
was not eligible and an "X" if they were. If a patient was eligible in
Jan of 1995, they have an "X" in position one. Position 13 corresponds
to Jan 1996, etc. Therefore, the data looks something like the
following for a period of 1 year. Note, all 15 years are stored in the
same variable (estring), but I've truncated it for illustration
purposes.

patid     estring
1          XXXXX-------
2          --XXX---XXXX
3          -XXXXXX-----
4          -XXX-XXX-XXX

I need to find first the longest period of continuous eligibility
(i.e. the longest set of Xs) and when that period occurred.

I've found the longest period of continuous eligibility by the following:
/* begin code */
tempvar wc elig

generate `elig' = trim(itrim(subinstr(estring,"-"," ",.)))
generate int `wc' = wordcount(`elig')
quietly summarize `wc'
local wmax = r(max)
di `wmax'

generate eligstr = word(`elig',1)
compress

forvalues i = 2/`wmax' {
       replace eligstr = word(`elig',`i') ///
               if length(word(`elig',`i')) > length(eligstr)
}

/* end code */

I then go back and find when that occurs by the following:
- generate int estart1 = strpos(estring,eligstr) -

In general, this is sufficient, however for patients like patid==4
above, I wouldn't know about other instances of the same eligibility
length. I would like to generate additional variables estart2 through
estart`wmax' that contain the starting positions of all other sets of
Xs that match eligstr.

I thought about replacing the first set of Xs with some non-X character using
- subinstr() - but the problem is that I need to preserve the position
and the number of Xs can vary, so I couldn't code something like
- subinstr(estring,eligstr,"---",1) -.
In my mind, the solution to this would be something like the following:
- subinstr(estring,eligstr,repeat("-",length(eligstr)),1) -
such that Stata would generate the appropriate number of Xs to be
replaced, thereby maintaining the position of the next set of Xs.
However, -repeat- as used above is not a Stata function as far as I
can tell. There is a -repeat- option in Nick Cox's -egenmore- package,
but as near as I can tell it won't work for my purposes. The closest
thing I've found is a Mata function -strdup()- or more precisely the
ability to code "-"*n where n would
have to be defined previously as the length of eligstr.

I'm willing to work out how to write the Mata code, but I thought that
first I'd check with the List to see if there was a relatively simple
solution like some sort of repeat function.

I am using Stata 11/MP.


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index