Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: String help


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: String help
Date   Tue, 4 Oct 2005 21:08:12 +0100

You can make progress on this by looking
at the help for string functions. There 
is no royal road to geometry, or to this
kind of thing. The solutions tend to be 
pedestrian and literal. This could be 
refined, but it should help a bit. 

"PAGES" or "COST" can be the last word, 
so let's pick that off when it occurs. 

. rename string s 

. gen pages_or_cost = word(s,-1) if inlist(word(s,-1), "PAGES", "COST") 

Zapping that will simplify things a bit. Create a copy to be safe. 

. gen s2 = subinstr(s, pages_or_cost,"",1) if pages_or_cost == word(s,-1) 

Now let's look for the drug number. We want the position of the 
first numeric digit. Some people would do this with regexps but as I 
warned my solution is pedestrian. First I find where the first "0" 
is by 

. gen index = index(s2,"0") if index(s2, "0") 

but I make sure that I get missing not 0 if there is no occurrence. 

and then I see if any other numeric digit 1 ... 9 occurs earlier. 
Again, I need to be careful to ignore 0 results for -index()- which 
mean "not found".  

. qui forval i = 1/9 { 
  2. replace index = min(index, index(s2, "`i'")) if index(s2, "`i'") 
  3. }

Then like Caesar and ancient Gaul I attempt a division into three parts. 

. gen company = substr(s2,1,index - 1) 

. gen drug_number = substr(s2,index,5) 

. gen after = substr(s2,index + 5,.) 

Nick 
n.j.cox@durham.ac.uk 

Terra Curtis
 
> I am dealing with a string variable called 'string' like the 
> example below
> (this is copied from the data browser):
> 
> string
> ABBOTT DIA 40410 CHLAMYDIA TSPK PAGES
> COST
> 40410 CHLAMYDIAZYME PAGES
> COST
> 78920 INSTITUTIONAL PAGES
> COST
> 80000 VISION BL ANALYSER PAGES
> COST
> COMPANY TOTAL PAGES
> COST
> ABBOTT HPD 04200 AMIDATE PAGES
> COST
> 60700 AMINOSYN PAGES
> COST
> 53192 AMINOSYN II PAGES
> COST
> 76340 CALCIJEX PAGES
> COST
> 78920 INSTITUTIONAL PAGES
> COST
> 78920 MULTIPLE PRODUCTS PAGES
> COST
> COMPANY TOTAL PAGES
> COST
> 
> I want to split this up a certain way.  In some of the observations, a
> company name comes first, always the words directly before 
> any number in the
> string.  So first I want to split the string just at the 
> company name (and
> words before any numbers).  Then, I want to split it after 
> the 5 numbers.
> Lastly, I want to split it after the 5 numbers and before the 
> word "PAGES."
> When I am done, I want to have -- new variables, one with 
> company name, one
> with drug number (the 5 numbers), on with drug name (words 
> following the
> numbers, except "PAGES"), and one with either "PAGES" or 
> "COST" according to
> what is the last word in 'string.'  I guess this a lot of 
> questions in one,
> but does anyone see an easy way to do this?  I'm new to 
> working with string variables.  

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index