Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: Working with complex strings

 From Nick Cox To statalist@hsphsun2.harvard.edu Subject Re: st: Working with complex strings Date Wed, 30 Nov 2011 15:26:12 +0000

```Sounds good. But suppose there was no -p()- option --- or no -egen, concat()-.

It would not be a big deal to reinvent it. Suppose we have strvar*,
say strvar1-strvar10, to concatenate.

gen newstrvar = strvar1
foreach v of var strvar2-strvar10 {
replace newstrvar = newstrvar + " " + `v'
}

Nick

On Wed, Nov 30, 2011 at 3:14 PM, Steve Nakoneshny <scnakone@ucalgary.ca> wrote:
> Nick,
>
> I hadn't known about the -p- option of -concat-. That will help me solve an unrelated problem I'm working on, thanks.
>
> Steve
>
>
> On 2011-11-30, at 2:08 AM, Nick Cox wrote:
>
>> Parsing on spaces can be more helpful than stated here. We just need
>> to reject "words" once we have found the first "word" that starts with
>> a numeric digit. That can be done in a loop. It also copes with the
>> possibility that numeric characters might be found within medication
>> names, but _not_ with the possibility that medication names start with
>> numeric characters.
>>
>> . split medication
>> variables created as string:
>> medication1  medication2  medication3  medication4
>>
>> . gen found = 0
>>
>> 4 here is empirical for this example. See how many variables -split- creates.
>>
>> . qui forval j = 1/4 {
>>  2. replace found = 1 if inrange(substr(medication`j', 1, 1), "0", "9")
>>  3. replace medication`j' = "" if found
>>  4. }
>>
>> . l
>>
>>     +--------------------------------------------------------------------------------------+
>>     |                   medication   medicati~1   medicati~2
>> medica~3   medica~4   found |
>>     |--------------------------------------------------------------------------------------|
>>  1. |       metoprolol 100 mg qday   metoprolol
>>                   1 |
>>  2. | metoprolol tatrate 150mg bid   metoprolol      tatrate
>>                   1 |
>>  3. |         atenelol 150 mg qday     atenelol
>>                   1 |
>>  4. |              hctz 25 mg qday         hctz
>>                   1 |
>>  5. |               PEG interferon          PEG   interferon
>>                   0 |
>>     |--------------------------------------------------------------------------------------|
>>  6. |            cimzia 50 mg qday       cimzia
>>                   1 |
>>     +--------------------------------------------------------------------------------------+
>>
>>
>> Then we put the words back together again:
>>
>> . egen medname = concat(medication?), p(" ")
>>
>> . l medication medname
>>
>>     +---------------------------------------------------+
>>     |                   medication              medname |
>>     |---------------------------------------------------|
>>  1. |       metoprolol 100 mg qday           metoprolol |
>>  2. | metoprolol tatrate 150mg bid   metoprolol tatrate |
>>  3. |         atenelol 150 mg qday             atenelol |
>>  4. |              hctz 25 mg qday                 hctz |
>>  5. |               PEG interferon       PEG interferon |
>>     |---------------------------------------------------|
>>  6. |            cimzia 50 mg qday               cimzia |
>>     +---------------------------------------------------+
>>
>>
>> On Wed, Nov 30, 2011 at 8:36 AM, Nick Cox <njcoxstata@gmail.com> wrote:
>>> -split- by default parses on spaces, which clearly is no good here
>>> given that medications can have compound names and dosages will not be
>>> discarded. Steve was evidently pointing to the -parse()- option, not
>>> suggesting that parsing on spaces was the answer.
>>>
>>> If we assume that (a) dose always starts with a number and (b) dose
>>> when specified always follows name of medication and (c) names never
>>> have numeric characters, then -split- can be used to parse on numeric
>>> characters. Here I used 1-9 but 0 should be added if it's ever the
>>> first numeric digit:
>>>
>>> . split medication, parse(1 2 3 4 5 6 7 8 9) limit(1)
>>> variable created as string:
>>> medication1
>>>
>>> . replace medication1 = trim(medication1)
>>>
>>> . l
>>>
>>>     +---------------------------------------------------+
>>>     |                   medication          medication1 |
>>>     |---------------------------------------------------|
>>>  1. |       metoprolol 100 mg qday           metoprolol |
>>>  2. | metoprolol tatrate 150mg bid   metoprolol tatrate |
>>>  3. |         atenelol 150 mg qday             atenelol |
>>>  4. |              hctz 25 mg qday                 hctz |
>>>  5. |               PEG interferon       PEG interferon |
>>>     |---------------------------------------------------|
>>>  6. |            cimzia 50 mg qday               cimzia |
>>>     +---------------------------------------------------+
>>>
>>> Another approach is to use -moss- (SSC):
>>>
>>> . moss medication, match("(.+) [1-9]+") regex
>>>
>>> . drop _count _pos1
>>>
>>> . rename _match1 medication2
>>>
>>> With this regular expression, -moss- misses names without dosages,
>>> which can just be copied across.
>>>
>>> . replace medication2 = medication if missing(medication2)
>>>
>>> . l
>>>
>>>     +------------------------------------------------------------------------+
>>>     |                   medication          medication1          medication2 |
>>>     |------------------------------------------------------------------------|
>>>  1. |       metoprolol 100 mg qday           metoprolol           metoprolol |
>>>  2. | metoprolol tatrate 150mg bid   metoprolol tatrate   metoprolol tatrate |
>>>  3. |         atenelol 150 mg qday             atenelol             atenelol |
>>>  4. |              hctz 25 mg qday                 hctz                 hctz |
>>>  5. |               PEG interferon       PEG interferon       PEG interferon |
>>>     |------------------------------------------------------------------------|
>>>  6. |            cimzia 50 mg qday               cimzia               cimzia |
>>>     +------------------------------------------------------------------------+
>>>
>>> Nick
>>>
>>> On Wed, Nov 30, 2011 at 5:43 AM, Dudekula, Anwar <dudekulaan@upmc.edu> wrote:
>>>> Thank you very much
>>>>
>>>> I will work on it .Would the parse() option split metoprolol tatrate 150mg bid as
>>>>
>>>> metoprolol tatrate and 150mg bid
>>>>
>>>> Or
>>>>
>>>> metoprolol & tatrate & 150mg &  bid
>>>>
>>>> Thank you
>>>> Anwar
>>>>
>>>> -----Original Message-----
>>>> From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Steve Nakoneshny
>>>> Sent: Wednesday, November 30, 2011 12:38 AM
>>>> To: statalist@hsphsun2.harvard.edu
>>>> Subject: Re: st: Working with complex strings
>>>>
>>>> - help split - would have answered this question.
>>>>
>>>> - split medication, parse( ) -
>>>>
>>>> should do what you want.
>>>
>>>
>>>  On Nov 29, 2011, at 9:54 PM, "Dudekula, Anwar" <dudekulaan@upmc.edu> wrote:
>>>
>>>>> I am working with deidentified hospitaldatabase with patient names(as string variable) and medications (as string variable)as follows
>>>>>
>>>>> Patients_name        medication
>>>>> ------------------------------------
>>>>> Patient-1            metoprolol 100 mg qday
>>>>> Patient-1            metoprolol tatrate 150mg bid
>>>>> Patient-1            atenelol 150 mg qday
>>>>> Patient-2            hctz 25 mg qday
>>>>> Patient-2            PEG interferon
>>>>> Patient-3            cimzia 50 mg qday
>>>>>
>>>>> Question: I am interested in name of medication only , not their dosages.Is it possible to split  the medication string  after the name  i.e.,
>>>>>
>>>>> 1) split  metoprolol tatrate 150mg bid into  metoprolol tatrate  &  150mg bid
>>>>> 2) split  metoprolol 100 mg qday into   metoprolol   &   100 mg qday

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```