Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: extracting a specific portion of a string


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: RE: extracting a specific portion of a string
Date   Thu, 17 Mar 2011 09:04:54 +0000

I agree with Eric and Travis. You need to look at the string
functions. This repeats Eric's main suggestions, but I am going to
spell out the underlying  principles a bit more.

Much of the most valuable trickery was also exhibited in the very
recent thread started by Rebecca Pope.

First, as a small matter of personal taste, I am going to work in
lower case. Too much SHOUTING otherwise. It should be easy to
translate my code to the upper case version.

So, I presume a prior

gen lc = lower(v1)

For similar problems I always consider trying -strpos()- first.
-strpos()- returns the position of a substring and is non-zero when it
occurs and zero otherwise. If the feeling is that exact position of
substring is immaterial, then the twist that

strpos(strvar, "interesting substring") > 0

is 1 whenever "interesting substring" occurs and 0 otherwise gives
this function spin for this problem. Note the good taste of StataCorp
in not including a function -includes()- or -contains()- returning
true or false; such a function is superfluous, redundant and otiose
once you know this trick.

I would proceed step-by-step and generate indicators

gen is_blood = strpos(lc, "blood") > 0
gen is_serum = strpos(lc, "serum") > 0

and then classify e.g.

gen what_have_we got = cond(is_blood & !is_serum, 1,
                               cond(is_serum & !is_blood, 2,
                               cond(is_serum & is_blood, 3, 4)))

What could go wrong? Spelling mistakes for one. What happens if it is
"blod" or "bloood", and so forth? That's why there is a ragbag
category in the variable just created that you should look at.

Sometimes you can fix a few spelling mistakes quickly.

Sometimes you need the more powerful machinery of -regexm()- and -regexs()-.

On a detail that might confuse: Eric used -index()- and -strpos()-. In
essence, -index()- is the old name that still works, while -strpos()-
is the new name. It's the same function underneath the names.

Nick

On Thu, Mar 17, 2011 at 4:15 AM, Eric Booth <[email protected]> wrote:

> On Mar 16, 2011, at 10:43 PM, Travis Coan wrote:
>>
>> I would take a look at the -substr- function -- typing 'help substr' should get you there.
>>
>
> You should probably look at all the functions available in -help string_functions-.
> Note that -substr- alone wouldn't return the desired result in this example, e.g.:
>
> **********************!
> clear
> inp str20(v1)
> "BLOOD"
> "BLOOD(LIPEMIC)"
> "BLOOD(MODERATELYLY"
> "BLOOD, 2ND SPECIMEN"
> "BLOOD,1STSPECIMEN"
> "BLOOD,2NDSPECIMEN"
> "MOTHER'SBLOOD"
> "SERUM,1STSPECIMEN"
> "SERUM,2NDSPECIMEN"
> end
>
> g v2 = substr(v1, 1, 5)
> **note obs 7
>
> //using strpos and substr string functions//
> g str10 v4 = ""
> foreach x in "BLOOD" "SERUM" {
> g v`x' = strpos(v1, "`x'")
> replace v4 = substr(v1, v`x' , 5) if v`x'>0
> }
>
> //using index//
> g ind = 0
> replace ind = 1 if index(v1, "BLOOD")
> replace ind = 2 if index(v1, "SERUM")
> la def ii 1 "Blood" 2 "Serum", modify
> lab val ind ii
> li
> **********************!


Mendoza Aldana, Dr Jorge Antonio (WPRO)

>> My dataset has a string variable, from which I need a specific portion of it. The content of the variable is like:
>>
>> BLOOD
>> BLOOD(LIPEMIC)
>> BLOOD(MODERATELYLY
>> BLOOD, 2ND SPECIMEN
>> BLOOD,1STSPECIMEN
>> BLOOD,2NDSPECIMEN
>> MOTHER'SBLOOD
>> SERUM,1STSPECIMEN
>> SERUM,2NDSPECIMEN
>>
>> and I need to generate a new variable containing either "BLOOD" or "SERUM"
>> I would appreciate very much if you can give me some hints on solving this.
>> I'm using Stata 11.1 on a Windows XP machine

I agree with Eric and Travis. You need to look at the string
functions. This repeats

Much of the most valuable trickery was exhibited in the very recent
thread started by Rebecca Pope.

First, as a small matter of personal taste, I am going to work in
lower case. Too much SHOUTING otherwise. It should be easy to
translate my code to the upper case version.

So, I presume a prior

gen lc = lower(v1)

I would try using -strpos()- first. -strpos()- returns the position of
a substring and is non-zero when it occurs and zero otherwise. If the
feeling is that exact position of substring is immaterial, then the
twist that

strpos(strvar, "interesting substring") > 0

is 1 whenever "interesting substring" occurs and 0 otherwise gives
this function spin for this problem. Note the good taste of StataCorp
in not including a function -includes()- or -contains()- returning
true or false; such a function is superfluous, redundant and otiose
once you know this trick.

I would proceed step-by-step and generate indicators

gen is_blood = strpos(lc, "blood") > 0
gen is_serum = strpos(lc, "serum") > 0

and then classify e.g.

gen what_have_we got = cond(is_blood & !is_serum, 1,
                               cond(is_serum & !is_blood, 2,
                               cond(is_serum & is_blood, 3, 4)))

What could go wrong? Spelling mistakes for one. What happens if it is
"blod" or "bloood", and so forth? That's why there is a ragbag
category in the variable just created that you should look at.

Sometimes you can fix a few spelling mistakes quickly.

Sometimes you need the more powerful machinery of -regexm()- and -regexs()-.

Nick

On Thu, Mar 17, 2011 at 4:15 AM, Eric Booth <[email protected]> wrote:

> On Mar 16, 2011, at 10:43 PM, Travis Coan wrote:
>>
>> I would take a look at the -substr- function -- typing 'help substr' should get you there.
>>
>
> You should probably look at all the functions available in -help string_functions-.
> Note that -substr- alone wouldn't return the desired result in this example, e.g.:
>
> **********************!
> clear
> inp str20(v1)
> "BLOOD"
> "BLOOD(LIPEMIC)"
> "BLOOD(MODERATELYLY"
> "BLOOD, 2ND SPECIMEN"
> "BLOOD,1STSPECIMEN"
> "BLOOD,2NDSPECIMEN"
> "MOTHER'SBLOOD"
> "SERUM,1STSPECIMEN"
> "SERUM,2NDSPECIMEN"
> end
>
> g v2 = substr(v1, 1, 5)
> **note obs 7
>
> //using strpos and substr string functions//
> g str10 v4 = ""
> foreach x in "BLOOD" "SERUM" {
> g v`x' = strpos(v1, "`x'")
> replace v4 = substr(v1, v`x' , 5) if v`x'>0
> }
>
> //using index//
> g ind = 0
> replace ind = 1 if index(v1, "BLOOD")
> replace ind = 2 if index(v1, "SERUM")
> la def ii 1 "Blood" 2 "Serum", modify
> lab val ind ii
> li
> **********************!


Mendoza Aldana, Dr Jorge Antonio (WPRO)

>> My dataset has a string variable, from which I need a specific portion of it. The content of the variable is like:
>>
>> BLOOD
>> BLOOD(LIPEMIC)
>> BLOOD(MODERATELYLY
>> BLOOD, 2ND SPECIMEN
>> BLOOD,1STSPECIMEN
>> BLOOD,2NDSPECIMEN
>> MOTHER'SBLOOD
>> SERUM,1STSPECIMEN
>> SERUM,2NDSPECIMEN
>>
>> and I need to generate a new variable containing either "BLOOD" or "SERUM"
>> I would appreciate very much if you can give me some hints on solving this.
>> I'm using Stata 11.1 on a Windows XP machine
>> Kind regards,
>> Jorge
>>
>>
>>
>>
>> ====================================================
>> This message was scanned for viruses with Trend Micro ScanMail, GFI MailSecurity and  GFI MailEssentials by the World Health Organization Regional Office for the Western Pacific.  However, the recipient is advised to scan this e-mail and any attached files for viruses.
>>
>> Disclaimer:
>>
>> This e-mail, together with any attachments, is intended for the named recipients only and is confidential. It may also be privileged or otherwise protected by law.
>>
>> If you have received it in error, please notify the sender immediately by reply e-mail and delete it and any attachments from your system. You may not copy or disclose its contents to anyone.
>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>> ________________________________________
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1204 / Virus Database: 1498/3510 - Release Date: 03/16/11
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>
>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index