Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Extracting substrings from variables.


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: Extracting substrings from variables.
Date   Fri, 25 May 2012 18:42:59 +0100

Yes indeed; the regular expression is catching more (net, at least)
than your originally stated rules imply. I didn't look at it
carefully, as that was Brendan's idea. You need to be aware of the
precedence rules for operators: I don't know that StataCorp documents
those anywhere, but it seems that the alternatives are "^637", "642",
"O1", which is not what you said.

Consider

. di regexm("  642", "^637|642|O1")
1

. di regexm("Stata rules OK O1", "^637|642|O1")
1

In these examples, an opening space does not rule out the regular
expression matching; nor do leading characters.

You need to look at the cases in which  the two procedures give
different results:

list mdiag1 if (ht == 0 & prehgt != "") | (ht == 1 & prehgt == "")

Perhaps the regular expression is picking up cases with stuff you want
ignored, and perhaps the other way round.

If leading spaces exist, you need to worry that you are missing "
637", "  637", etc.

Nick

On Fri, May 25, 2012 at 3:26 PM, Amal Khanolkar <[email protected]> wrote:
> Hi again,
>
> It works now! I forgot to specify the '=1' in the gen command.
>
> However doing this the two ways (using gen with inlist and the regexs commands) I get slightly different numbers which shouldn't be the case....
>
>
> . gen ht=1 if inlist(substr(mdiag1, 1, 3), "637", "642") | substr(mdiag1,1, 2) == "O1"
> (2951413 missing values generated)
>
> . tab ht
>
>         ht |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>          1 |     40,043      100.00      100.00
> ------------+-----------------------------------
>      Total |     40,043      100.00
>
>
> . gen preght1 = regexs(0) if regexm(mdiag1, "^637|642|O1")
>
> . tab preght1
>
>    preght1 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>        637 |      8,314       20.62       20.62
>        642 |     21,537       53.42       74.05
>         O1 |     10,462       25.95      100.00
> ------------+-----------------------------------
>      Total |     40,313      100.00
>
>
> Both ht & preght are the same variables above (or atleast should be the same - not sure what's causing the difference of 270!)
>
>
> I also tried to combine/merge the many variables of preght I created all including the same diagnostic codes but from different time periods (named preght 1, preght2, 3, 4, 5, 6 etc....) using the egen command with the concat function - but it doesn't give me the right numbers - any other command that would do the job better?
>
> /Amal.
>
> ______
> From: [email protected] [[email protected]] on behalf of Nick Cox [[email protected]]
> Sent: 25 May 2012 16:04
> To: [email protected]
> Subject: Re: st: Extracting substrings from variables.
>
> Yes; my idea is that one of your parentheses ( or ) was missing! I've
> rechecked my example and it looks OK.
>
> if inlist(substr(m1diagx, 1, 3), "637", "642") | substr(m1diagx,
> 1, 2) == "O1"
>
> Stata is just like elementary algebra: parentheses () brackets [] and
> braces { } must all occur in pairs. You don't show us your code, and
> so you need to count for yourself.
>
> Nick
>
> On Fri, May 25, 2012 at 2:31 PM, Amal Khanolkar <[email protected]> wrote:
>> Thanks Brendan - it worked like a charm!  :)
>>
>> Nick - I tried your way using 'inlist' however I kept getting an error message that one bracket was missing - I tried several ways to try and solve the issue - but was unable to do so - any ideas?
>>
>> I agree with both of you - regexs can be annoying esp for me who came across it for the first time today :)
>>
>>
>> Thanks!
>>
>> /Amal.
>>
>>
>> ________________________________________
>> From: [email protected] [[email protected]] on behalf of Brendan Halpin [[email protected]]
>> Sent: 25 May 2012 14:07
>> To: [email protected]
>> Subject: Re: st: Extracting substrings from variables.
>>
>> On Fri, May 25 2012, Brendan Halpin wrote:
>>
>>> On Fri, May 25 2012, Amal Khanolkar wrote:
>>>
>>>> gen preght = regexs(0) if regexm(mdiag1x, "[^637] | [^642] | [^O1]")
>>>
>>> A quick and untested suggestion:
>>>
>>> . gen preght = regexs(0) if regexm(mdiag1x, "^(637)|(642)|(O1)")
>>
>> On testing, it seems the grouping parentheses are not necessary:
>>
>> ...................................................................
>> . input str10 mdiag1x
>>
>>        mdiag1x
>>  1.    "637 asdf"
>>  2.    "638 asdf"
>>  3.    "8637 asdf"
>>  4.    "642 asdf"
>>  5.    "O1 asdf"
>>  6. end
>>
>> . gen preght = regexs(0) if regexm(mdiag1x, "^637|642|O1")
>> (2 missing values generated)
>>
>> . gen hasdiag = regexm(mdiag1x, "^637|642|O1")
>>
>> . list
>>
>>     +------------------------------+
>>     |   mdiag1x   preght   hasdiag |
>>     |------------------------------|
>>  1. |  637 asdf      637         1 |
>>  2. |  638 asdf                  0 |
>>  3. | 8637 asdf                  0 |
>>  4. |  642 asdf      642         1 |
>>  5. |   O1 asdf       O1         1 |
>>     +------------------------------+
>> ...................................................................

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index