Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Stata analog to Mata's -strdup()- or better approach?

From   Rebecca Pope <>
Subject   Re: st: RE: Stata analog to Mata's -strdup()- or better approach?
Date   Mon, 14 Mar 2011 13:31:48 -0500

First, two corrections:
1 - It is the data management guide, accessed through -help datatype-
that references Stata's method of calculation.
2 - Apparently I only wrote down that I would change storage type. In
fact, I changed it simultaneously with the temporary variable switch.
I must have been pretty tired yesterday*. I noticed just now when I
looked at the log again. Sorry for any confusion. I promise to pay
closer attention in the future.

Second, here is a summary of the discussion in case anyone else ever
encounters a similar situation (original question below so you don't
have to search through the whole thread):
- The fastest way to find the longest period of continuous eligibility
(here a series of Xs) is to check each place in the string, updating a
numeric count while progressing through the string. See

- The fastest way to find the starting places of the identified
pattern is to find the first instance using -strpos()-, replace that
with some non-X character using -subinstr()-, then repeat. See

- With a large string like mine (180 characters), -split- produces too
many variables, most of which aren't needed. I tried this initially
parsing on "-" and got about 90 variables before getting an error for
not enough memory. I replaced all the "-"s with " "s and used
-itrim()- because prior to Nick's post about using -split()- I didn't
know, or had at least forgotten, that -split()- reads all spaces as
one (see link for msg00806 above). This creates 87 variables in my
data and was abandoned. It might work fine for someone with shorter

- Robert is absolutely right, the memory hit occurs when the storage
type is changed. Temporary variables _do not_ make a noticeable

- The combined code Nick and Robert wrote runs in just over 5 minutes
on all 6 million observations if I make all numeric variables included
in the calculations double. It takes more than 3 times as long if
Stata has to convert the variables from a smaller storage type.

Finding longest span (Robert)
   1:    206.64 /        1 =     206.6360
Finding starting months of each span (Nick)...
...First instance
   2:      7.27 /        1 =       7.2670
...All subsequent instances
   3:    103.38 /        1 =     103.3780
Recoding starting months to Stata date standards & applying %tm format
   4:     12.85 /        1 =      12.8540
Finding total months of eligibility, regardless of length & continuity
   5:     36.01 /        1 =      36.0050
Whole program (including some parts not itemized)
  10:    366.14 /        1 =     366.1400

For context, times were calculated using 4GB of memory with Stata
11/MP-4 running on 64-bit Windows 7 Enterprise. Hardware: 8GB memory,
3GHz quad-core processor.

Thanks to Brendan, Nick, Robert, and David for replying to my post. I
really appreciate you taking time to help.


*Nick, the Statalist FAQ credit you as the author. You might consider
adding, under "Before you post", Graduate students should have a full
night's sleep.

< previous posts in this thread omitted >

Initial question:
Does anyone know if there is a Stata analog to Mata's -strdup()-? I'm
not committed to the approach below, so if anyone knows of a better
way to accomplish what I need I'm open to all suggestions. I apologize
in advance for the length of this e-mail, but I've tried to ensure
sufficient detail.

By way of background, I have data on patients' eligiblity for health
insurance benefits over a period of 15 years. The data is stored such
that a "-" is in a position of the string for a month that the patient
was not eligible and an "X" if they were. If a patient was eligible in
Jan of 1995, they have an "X" in position one. Position 13 corresponds
to Jan 1996, etc. Therefore, the data looks something like the
following for a period of 1 year. Note, all 15 years are stored in the
same variable (estring), but I've truncated it for illustration

patid     estring
1          XXXXX-------
2          --XXX---XXXX
3          -XXXXXX-----
4          -XXX-XXX-XXX

I need to find first the longest period of continuous eligibility
(i.e. the longest set of Xs) and when that period occurred.

I've found the longest period of continuous eligibility by the following:
/* begin code */
tempvar wc elig

generate `elig' = trim(itrim(subinstr(estring,"-"," ",.)))
generate int `wc' = wordcount(`elig')
quietly summarize `wc'
local wmax = r(max)
di `wmax'

generate eligstr = word(`elig',1)

forvalues i = 2/`wmax' {
      replace eligstr = word(`elig',`i') ///
              if length(word(`elig',`i')) > length(eligstr)

/* end code */

I then go back and find when that occurs by the following:
- generate int estart1 = strpos(estring,eligstr) -

In general, this is sufficient, however for patients like patid==4
above, I wouldn't know about other instances of the same eligibility
length. I would like to generate additional variables estart2 through
estart`wmax' that contain the starting positions of all other sets of
Xs that match eligstr.

I thought about replacing the first set of Xs with some non-X character using
- subinstr() - but the problem is that I need to preserve the position
and the number of Xs can vary, so I couldn't code something like
- subinstr(estring,eligstr,"---",1) -.
In my mind, the solution to this would be something like the following:
- subinstr(estring,eligstr,repeat("-",length(eligstr)),1) -
such that Stata would generate the appropriate number of Xs to be
replaced, thereby maintaining the position of the next set of Xs.
However, -repeat- as used above is not a Stata function as far as I
can tell. There is a -repeat- option in Nick Cox's -egenmore- package,
but as near as I can tell it won't work for my purposes. The closest
thing I've found is a Mata function -strdup()- or more precisely the
ability to code "-"*n where n would
have to be defined previously as the length of eligstr.

I'm willing to work out how to write the Mata code, but I thought that
first I'd check with the List to see if there was a relatively simple
solution like some sort of repeat function.

I am using Stata 11/MP.
*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index