Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Working with complex strings

From	Steve Nakoneshny <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: Working with complex strings
Date	Wed, 30 Nov 2011 08:14:22 -0700

Nick,

I hadn't known about the -p- option of -concat-. That will help me solve an unrelated problem I'm working on, thanks.

Steve


On 2011-11-30, at 2:08 AM, Nick Cox wrote:

> Parsing on spaces can be more helpful than stated here. We just need
> to reject "words" once we have found the first "word" that starts with
> a numeric digit. That can be done in a loop. It also copes with the
> possibility that numeric characters might be found within medication
> names, but _not_ with the possibility that medication names start with
> numeric characters.
> 
> . split medication
> variables created as string:
> medication1  medication2  medication3  medication4
> 
> . gen found = 0
> 
> 4 here is empirical for this example. See how many variables -split- creates.
> 
> . qui forval j = 1/4 {
>  2. replace found = 1 if inrange(substr(medication`j', 1, 1), "0", "9")
>  3. replace medication`j' = "" if found
>  4. }
> 
> . l
> 
>     +--------------------------------------------------------------------------------------+
>     |                   medication   medicati~1   medicati~2
> medica~3   medica~4   found |
>     |--------------------------------------------------------------------------------------|
>  1. |       metoprolol 100 mg qday   metoprolol
>                   1 |
>  2. | metoprolol tatrate 150mg bid   metoprolol      tatrate
>                   1 |
>  3. |         atenelol 150 mg qday     atenelol
>                   1 |
>  4. |              hctz 25 mg qday         hctz
>                   1 |
>  5. |               PEG interferon          PEG   interferon
>                   0 |
>     |--------------------------------------------------------------------------------------|
>  6. |            cimzia 50 mg qday       cimzia
>                   1 |
>     +--------------------------------------------------------------------------------------+
> 
> 
> Then we put the words back together again:
> 
> . egen medname = concat(medication?), p(" ")
> 
> . l medication medname
> 
>     +---------------------------------------------------+
>     |                   medication              medname |
>     |---------------------------------------------------|
>  1. |       metoprolol 100 mg qday           metoprolol |
>  2. | metoprolol tatrate 150mg bid   metoprolol tatrate |
>  3. |         atenelol 150 mg qday             atenelol |
>  4. |              hctz 25 mg qday                 hctz |
>  5. |               PEG interferon       PEG interferon |
>     |---------------------------------------------------|
>  6. |            cimzia 50 mg qday               cimzia |
>     +---------------------------------------------------+
> 
> 
> On Wed, Nov 30, 2011 at 8:36 AM, Nick Cox <[email protected]> wrote:
>> -split- by default parses on spaces, which clearly is no good here
>> given that medications can have compound names and dosages will not be
>> discarded. Steve was evidently pointing to the -parse()- option, not
>> suggesting that parsing on spaces was the answer.
>> 
>> If we assume that (a) dose always starts with a number and (b) dose
>> when specified always follows name of medication and (c) names never
>> have numeric characters, then -split- can be used to parse on numeric
>> characters. Here I used 1-9 but 0 should be added if it's ever the
>> first numeric digit:
>> 
>> . split medication, parse(1 2 3 4 5 6 7 8 9) limit(1)
>> variable created as string:
>> medication1
>> 
>> . replace medication1 = trim(medication1)
>> (5 real changes made)
>> 
>> . l
>> 
>>     +---------------------------------------------------+
>>     |                   medication          medication1 |
>>     |---------------------------------------------------|
>>  1. |       metoprolol 100 mg qday           metoprolol |
>>  2. | metoprolol tatrate 150mg bid   metoprolol tatrate |
>>  3. |         atenelol 150 mg qday             atenelol |
>>  4. |              hctz 25 mg qday                 hctz |
>>  5. |               PEG interferon       PEG interferon |
>>     |---------------------------------------------------|
>>  6. |            cimzia 50 mg qday               cimzia |
>>     +---------------------------------------------------+
>> 
>> Another approach is to use -moss- (SSC):
>> 
>> . moss medication, match("(.+) [1-9]+") regex
>> 
>> . drop _count _pos1
>> 
>> . rename _match1 medication2
>> 
>> With this regular expression, -moss- misses names without dosages,
>> which can just be copied across.
>> 
>> . replace medication2 = medication if missing(medication2)
>> (1 real change made)
>> 
>> . l
>> 
>>     +------------------------------------------------------------------------+
>>     |                   medication          medication1          medication2 |
>>     |------------------------------------------------------------------------|
>>  1. |       metoprolol 100 mg qday           metoprolol           metoprolol |
>>  2. | metoprolol tatrate 150mg bid   metoprolol tatrate   metoprolol tatrate |
>>  3. |         atenelol 150 mg qday             atenelol             atenelol |
>>  4. |              hctz 25 mg qday                 hctz                 hctz |
>>  5. |               PEG interferon       PEG interferon       PEG interferon |
>>     |------------------------------------------------------------------------|
>>  6. |            cimzia 50 mg qday               cimzia               cimzia |
>>     +------------------------------------------------------------------------+
>> 
>> Nick
>> 
>> On Wed, Nov 30, 2011 at 5:43 AM, Dudekula, Anwar <[email protected]> wrote:
>>> Thank you very much
>>> 
>>> I will work on it .Would the parse() option split metoprolol tatrate 150mg bid as
>>> 
>>> metoprolol tatrate and 150mg bid
>>> 
>>> Or
>>> 
>>> metoprolol & tatrate & 150mg &  bid
>>> 
>>> Thank you
>>> Anwar
>>> 
>>> -----Original Message-----
>>> From: [email protected] [mailto:[email protected]] On Behalf Of Steve Nakoneshny
>>> Sent: Wednesday, November 30, 2011 12:38 AM
>>> To: [email protected]
>>> Subject: Re: st: Working with complex strings
>>> 
>>> - help split - would have answered this question.
>>> 
>>> - split medication, parse( ) -
>>> 
>>> should do what you want.
>> 
>> 
>>  On Nov 29, 2011, at 9:54 PM, "Dudekula, Anwar" <[email protected]> wrote:
>> 
>>>> I am working with deidentified hospitaldatabase with patient names(as string variable) and medications (as string variable)as follows
>>>> 
>>>> Patients_name        medication
>>>> ------------------------------------
>>>> Patient-1            metoprolol 100 mg qday
>>>> Patient-1            metoprolol tatrate 150mg bid
>>>> Patient-1            atenelol 150 mg qday
>>>> Patient-2            hctz 25 mg qday
>>>> Patient-2            PEG interferon
>>>> Patient-3            cimzia 50 mg qday
>>>> 
>>>> Question: I am interested in name of medication only , not their dosages.Is it possible to split  the medication string  after the name  i.e.,
>>>> 
>>>> 1) split  metoprolol tatrate 150mg bid into  metoprolol tatrate  &  150mg bid
>>>> 2) split  metoprolol 100 mg qday into   metoprolol   &   100 mg qday
>>>> 
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Working with complex strings
  - From: Nick Cox <[email protected]>

References:
- Re: st: rclass AND sclass AND eclass
  - From: daniel klein <[email protected]>
- st: Working with complex strings
  - From: "Dudekula, Anwar" <[email protected]>
- Re: st: Working with complex strings
  - From: Steve Nakoneshny <[email protected]>
- RE: st: Working with complex strings
  - From: "Dudekula, Anwar" <[email protected]>
- Re: st: Working with complex strings
  - From: Nick Cox <[email protected]>
- Re: st: Working with complex strings
  - From: Nick Cox <[email protected]>

Prev by Date: RE: st: RE: Small sample with clustered data
Next by Date: Re: st: Working with complex strings
Previous by thread: Re: st: Working with complex strings
Next by thread: Re: st: Working with complex strings
Index(es):
- Date
- Thread