Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Rebecca Pope <rebecca.a.pope@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: RE: Stata analog to Mata's -strdup()- or better approach? |

Date |
Mon, 14 Mar 2011 13:31:48 -0500 |

First, two corrections: 1 - It is the data management guide, accessed through -help datatype- that references Stata's method of calculation. 2 - Apparently I only wrote down that I would change storage type. In fact, I changed it simultaneously with the temporary variable switch. I must have been pretty tired yesterday*. I noticed just now when I looked at the log again. Sorry for any confusion. I promise to pay closer attention in the future. Second, here is a summary of the discussion in case anyone else ever encounters a similar situation (original question below so you don't have to search through the whole thread): - The fastest way to find the longest period of continuous eligibility (here a series of Xs) is to check each place in the string, updating a numeric count while progressing through the string. See http://www.stata.com/statalist/archive/2011-03/msg00850.html - The fastest way to find the starting places of the identified pattern is to find the first instance using -strpos()-, replace that with some non-X character using -subinstr()-, then repeat. See http://www.stata.com/statalist/archive/2011-03/msg00806.html - With a large string like mine (180 characters), -split- produces too many variables, most of which aren't needed. I tried this initially parsing on "-" and got about 90 variables before getting an error for not enough memory. I replaced all the "-"s with " "s and used -itrim()- because prior to Nick's post about using -split()- I didn't know, or had at least forgotten, that -split()- reads all spaces as one (see link for msg00806 above). This creates 87 variables in my data and was abandoned. It might work fine for someone with shorter strings. - Robert is absolutely right, the memory hit occurs when the storage type is changed. Temporary variables _do not_ make a noticeable difference. - The combined code Nick and Robert wrote runs in just over 5 minutes on all 6 million observations if I make all numeric variables included in the calculations double. It takes more than 3 times as long if Stata has to convert the variables from a smaller storage type. Finding longest span (Robert) 1: 206.64 / 1 = 206.6360 Finding starting months of each span (Nick)... ...First instance 2: 7.27 / 1 = 7.2670 ...All subsequent instances 3: 103.38 / 1 = 103.3780 Recoding starting months to Stata date standards & applying %tm format 4: 12.85 / 1 = 12.8540 Finding total months of eligibility, regardless of length & continuity 5: 36.01 / 1 = 36.0050 Whole program (including some parts not itemized) 10: 366.14 / 1 = 366.1400 For context, times were calculated using 4GB of memory with Stata 11/MP-4 running on 64-bit Windows 7 Enterprise. Hardware: 8GB memory, 3GHz quad-core processor. Thanks to Brendan, Nick, Robert, and David for replying to my post. I really appreciate you taking time to help. Best, Rebecca *Nick, the Statalist FAQ credit you as the author. You might consider adding, under "Before you post", Graduate students should have a full night's sleep. < previous posts in this thread omitted > Initial question: Does anyone know if there is a Stata analog to Mata's -strdup()-? I'm not committed to the approach below, so if anyone knows of a better way to accomplish what I need I'm open to all suggestions. I apologize in advance for the length of this e-mail, but I've tried to ensure sufficient detail. By way of background, I have data on patients' eligiblity for health insurance benefits over a period of 15 years. The data is stored such that a "-" is in a position of the string for a month that the patient was not eligible and an "X" if they were. If a patient was eligible in Jan of 1995, they have an "X" in position one. Position 13 corresponds to Jan 1996, etc. Therefore, the data looks something like the following for a period of 1 year. Note, all 15 years are stored in the same variable (estring), but I've truncated it for illustration purposes. patid estring 1 XXXXX------- 2 --XXX---XXXX 3 -XXXXXX----- 4 -XXX-XXX-XXX I need to find first the longest period of continuous eligibility (i.e. the longest set of Xs) and when that period occurred. I've found the longest period of continuous eligibility by the following: /* begin code */ tempvar wc elig generate `elig' = trim(itrim(subinstr(estring,"-"," ",.))) generate int `wc' = wordcount(`elig') quietly summarize `wc' local wmax = r(max) di `wmax' generate eligstr = word(`elig',1) compress forvalues i = 2/`wmax' { replace eligstr = word(`elig',`i') /// if length(word(`elig',`i')) > length(eligstr) } /* end code */ I then go back and find when that occurs by the following: - generate int estart1 = strpos(estring,eligstr) - In general, this is sufficient, however for patients like patid==4 above, I wouldn't know about other instances of the same eligibility length. I would like to generate additional variables estart2 through estart`wmax' that contain the starting positions of all other sets of Xs that match eligstr. I thought about replacing the first set of Xs with some non-X character using - subinstr() - but the problem is that I need to preserve the position and the number of Xs can vary, so I couldn't code something like - subinstr(estring,eligstr,"---",1) -. In my mind, the solution to this would be something like the following: - subinstr(estring,eligstr,repeat("-",length(eligstr)),1) - such that Stata would generate the appropriate number of Xs to be replaced, thereby maintaining the position of the next set of Xs. However, -repeat- as used above is not a Stata function as far as I can tell. There is a -repeat- option in Nick Cox's -egenmore- package, but as near as I can tell it won't work for my purposes. The closest thing I've found is a Mata function -strdup()- or more precisely the ability to code "-"*n where n would have to be defined previously as the length of eligstr. I'm willing to work out how to write the Mata code, but I thought that first I'd check with the List to see if there was a relatively simple solution like some sort of repeat function. I am using Stata 11/MP. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Stata analog to Mata's -strdup()- or better approach?***From:*Rebecca Pope <rebecca.a.pope@gmail.com>

**st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Nick Cox <n.j.cox@durham.ac.uk>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Rebecca Pope <rebecca.a.pope@gmail.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Rebecca Pope <rebecca.a.pope@gmail.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Robert Picard <picard@netbox.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Robert Picard <picard@netbox.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Rebecca Pope <rebecca.a.pope@gmail.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Robert Picard <picard@netbox.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Rebecca Pope <rebecca.a.pope@gmail.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Rebecca Pope <rebecca.a.pope@gmail.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Rebecca Pope <rebecca.a.pope@gmail.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Rebecca Pope <rebecca.a.pope@gmail.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Robert Picard <picard@netbox.com>

**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?***From:*Rebecca Pope <rebecca.a.pope@gmail.com>

- Prev by Date:
**Re: st: -xtpcse and period dummies- no full table reported due to asymmetric and singular variance matrix** - Next by Date:
**Re: st: RE: RE: vselect & diff n** - Previous by thread:
**Re: st: RE: Stata analog to Mata's -strdup()- or better approach?** - Next by thread:
**st: FW: Nonlinear probit model** - Index(es):