Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Working with complex strings |
Date | Wed, 30 Nov 2011 15:26:12 +0000 |
Sounds good. But suppose there was no -p()- option --- or no -egen, concat()-. It would not be a big deal to reinvent it. Suppose we have strvar*, say strvar1-strvar10, to concatenate. gen newstrvar = strvar1 foreach v of var strvar2-strvar10 { replace newstrvar = newstrvar + " " + `v' } Nick On Wed, Nov 30, 2011 at 3:14 PM, Steve Nakoneshny <scnakone@ucalgary.ca> wrote: > Nick, > > I hadn't known about the -p- option of -concat-. That will help me solve an unrelated problem I'm working on, thanks. > > Steve > > > On 2011-11-30, at 2:08 AM, Nick Cox wrote: > >> Parsing on spaces can be more helpful than stated here. We just need >> to reject "words" once we have found the first "word" that starts with >> a numeric digit. That can be done in a loop. It also copes with the >> possibility that numeric characters might be found within medication >> names, but _not_ with the possibility that medication names start with >> numeric characters. >> >> . split medication >> variables created as string: >> medication1 medication2 medication3 medication4 >> >> . gen found = 0 >> >> 4 here is empirical for this example. See how many variables -split- creates. >> >> . qui forval j = 1/4 { >> 2. replace found = 1 if inrange(substr(medication`j', 1, 1), "0", "9") >> 3. replace medication`j' = "" if found >> 4. } >> >> . l >> >> +--------------------------------------------------------------------------------------+ >> | medication medicati~1 medicati~2 >> medica~3 medica~4 found | >> |--------------------------------------------------------------------------------------| >> 1. | metoprolol 100 mg qday metoprolol >> 1 | >> 2. | metoprolol tatrate 150mg bid metoprolol tatrate >> 1 | >> 3. | atenelol 150 mg qday atenelol >> 1 | >> 4. | hctz 25 mg qday hctz >> 1 | >> 5. | PEG interferon PEG interferon >> 0 | >> |--------------------------------------------------------------------------------------| >> 6. | cimzia 50 mg qday cimzia >> 1 | >> +--------------------------------------------------------------------------------------+ >> >> >> Then we put the words back together again: >> >> . egen medname = concat(medication?), p(" ") >> >> . l medication medname >> >> +---------------------------------------------------+ >> | medication medname | >> |---------------------------------------------------| >> 1. | metoprolol 100 mg qday metoprolol | >> 2. | metoprolol tatrate 150mg bid metoprolol tatrate | >> 3. | atenelol 150 mg qday atenelol | >> 4. | hctz 25 mg qday hctz | >> 5. | PEG interferon PEG interferon | >> |---------------------------------------------------| >> 6. | cimzia 50 mg qday cimzia | >> +---------------------------------------------------+ >> >> >> On Wed, Nov 30, 2011 at 8:36 AM, Nick Cox <njcoxstata@gmail.com> wrote: >>> -split- by default parses on spaces, which clearly is no good here >>> given that medications can have compound names and dosages will not be >>> discarded. Steve was evidently pointing to the -parse()- option, not >>> suggesting that parsing on spaces was the answer. >>> >>> If we assume that (a) dose always starts with a number and (b) dose >>> when specified always follows name of medication and (c) names never >>> have numeric characters, then -split- can be used to parse on numeric >>> characters. Here I used 1-9 but 0 should be added if it's ever the >>> first numeric digit: >>> >>> . split medication, parse(1 2 3 4 5 6 7 8 9) limit(1) >>> variable created as string: >>> medication1 >>> >>> . replace medication1 = trim(medication1) >>> (5 real changes made) >>> >>> . l >>> >>> +---------------------------------------------------+ >>> | medication medication1 | >>> |---------------------------------------------------| >>> 1. | metoprolol 100 mg qday metoprolol | >>> 2. | metoprolol tatrate 150mg bid metoprolol tatrate | >>> 3. | atenelol 150 mg qday atenelol | >>> 4. | hctz 25 mg qday hctz | >>> 5. | PEG interferon PEG interferon | >>> |---------------------------------------------------| >>> 6. | cimzia 50 mg qday cimzia | >>> +---------------------------------------------------+ >>> >>> Another approach is to use -moss- (SSC): >>> >>> . moss medication, match("(.+) [1-9]+") regex >>> >>> . drop _count _pos1 >>> >>> . rename _match1 medication2 >>> >>> With this regular expression, -moss- misses names without dosages, >>> which can just be copied across. >>> >>> . replace medication2 = medication if missing(medication2) >>> (1 real change made) >>> >>> . l >>> >>> +------------------------------------------------------------------------+ >>> | medication medication1 medication2 | >>> |------------------------------------------------------------------------| >>> 1. | metoprolol 100 mg qday metoprolol metoprolol | >>> 2. | metoprolol tatrate 150mg bid metoprolol tatrate metoprolol tatrate | >>> 3. | atenelol 150 mg qday atenelol atenelol | >>> 4. | hctz 25 mg qday hctz hctz | >>> 5. | PEG interferon PEG interferon PEG interferon | >>> |------------------------------------------------------------------------| >>> 6. | cimzia 50 mg qday cimzia cimzia | >>> +------------------------------------------------------------------------+ >>> >>> Nick >>> >>> On Wed, Nov 30, 2011 at 5:43 AM, Dudekula, Anwar <dudekulaan@upmc.edu> wrote: >>>> Thank you very much >>>> >>>> I will work on it .Would the parse() option split metoprolol tatrate 150mg bid as >>>> >>>> metoprolol tatrate and 150mg bid >>>> >>>> Or >>>> >>>> metoprolol & tatrate & 150mg & bid >>>> >>>> Thank you >>>> Anwar >>>> >>>> -----Original Message----- >>>> From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Steve Nakoneshny >>>> Sent: Wednesday, November 30, 2011 12:38 AM >>>> To: statalist@hsphsun2.harvard.edu >>>> Subject: Re: st: Working with complex strings >>>> >>>> - help split - would have answered this question. >>>> >>>> - split medication, parse( ) - >>>> >>>> should do what you want. >>> >>> >>> On Nov 29, 2011, at 9:54 PM, "Dudekula, Anwar" <dudekulaan@upmc.edu> wrote: >>> >>>>> I am working with deidentified hospitaldatabase with patient names(as string variable) and medications (as string variable)as follows >>>>> >>>>> Patients_name medication >>>>> ------------------------------------ >>>>> Patient-1 metoprolol 100 mg qday >>>>> Patient-1 metoprolol tatrate 150mg bid >>>>> Patient-1 atenelol 150 mg qday >>>>> Patient-2 hctz 25 mg qday >>>>> Patient-2 PEG interferon >>>>> Patient-3 cimzia 50 mg qday >>>>> >>>>> Question: I am interested in name of medication only , not their dosages.Is it possible to split the medication string after the name i.e., >>>>> >>>>> 1) split metoprolol tatrate 150mg bid into metoprolol tatrate & 150mg bid >>>>> 2) split metoprolol 100 mg qday into metoprolol & 100 mg qday * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/