Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Extracting substrings from variable and combining variables.
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: Extracting substrings from variable and combining variables.
Date
Fri, 1 Jun 2012 09:26:05 +0100
That sounds predictable to me. "637" and "637" counts as 2 one way and
-- in its guise as "637637" -- as 1 the other way. It's important to
correct for counting the same thing twice or more.
Consider this:
clear
input str12 diag freq w
"637" 20922 1
"637637" 960 2
"637637637" 104 3
"637637637637" 3 4
"637642" 2 2
"642" 42108 1
"642637" 1 2
"642642" 748 2
"642642642" 7 3
"O1" 22634 1
"O1O1" 720 2
"O1O1O1" 17 3
"O1O1O1O1" 2 4
end
. tab diag [w=freq*w]
(frequency weights assumed)
diag | Freq. Percent Cum.
-------------+-----------------------------------
637 | 20,922 23.01 23.01
637637 | 1,920 2.11 25.12
637637637 | 312 0.34 25.46
637637637637 | 12 0.01 25.48
637642 | 4 0.00 25.48
642 | 42,108 46.31 71.79
642637 | 2 0.00 71.79
642642 | 1,496 1.65 73.44
642642642 | 21 0.02 73.46
O1 | 22,634 24.89 98.35
O1O1 | 1,440 1.58 99.94
O1O1O1 | 51 0.06 99.99
O1O1O1O1 | 8 0.01 100.00
-------------+-----------------------------------
Total | 90,930 100.00
On Fri, Jun 1, 2012 at 9:11 AM, Amal Khanolkar <[email protected]> wrote:
> Hi again,
>
> I tried to combine 12 such variables (examples of three below) to form one variable with the same 3 categories.
>
>
> tab preght1
>
> preght1 | Freq. Percent Cum.
> ------------+-----------------------------------
> 637 | 8,314 20.76 20.76
> 642 | 21,268 53.11 73.88
> O1 | 10,461 26.12 100.00
> ------------+-----------------------------------
> Total | 40,043 100.00
>
>
> . tab preght2
>
> preght2 | Freq. Percent Cum.
> ------------+-----------------------------------
> 637 | 11,202 33.51 33.51
> 642 | 15,191 45.44 78.95
> O1 | 7,036 21.05 100.00
> ------------+-----------------------------------
> Total | 33,429 100.00
>
>
>
> . tab preght4
>
> preght4 | Freq. Percent Cum.
> ------------+-----------------------------------
> 637 | 797 18.02 18.02
> 642 | 1,747 39.51 57.53
> O1 | 1,878 42.47 100.00
> ------------+-----------------------------------
> Total | 4,422 100.00
>
> When I add-up the 12 preght variables, I get a total of 90930 observations that should have my diagnosis of interest. However when using the egn as below I get only 88228!
>
> This what I get when I run 'egen with the concat' function:
>
> egen preght=concat(preght1 preght2 preght3 preght4 preght5 preght6 preght7 preght8 preght9 preght10 preght11 preght12)
> (2903228 missing values generated)
>
>
> preght | Freq. Percent Cum.
> -------------+-----------------------------------
> 637 | 20,922 23.71 23.71
> 637637 | 960 1.09 24.80
> 637637637 | 104 0.12 24.92
> 637637637637 | 3 0.00 24.92
> 637642 | 2 0.00 24.93
> 642 | 42,108 47.73 72.65
> 642637 | 1 0.00 72.65
> 642642 | 748 0.85 73.50
> 642642642 | 7 0.01 73.51
> O1 | 22,634 25.65 99.16
> O1O1 | 720 0.82 99.98
> O1O1O1 | 17 0.02 100.00
> O1O1O1O1 | 2 0.00 100.00
> -------------+-----------------------------------
> Total | 88,228 100.00
>
>
> Thnaks!
>
> Best regards,
>
> Amal Khanolkar, PhD candidate,
> ________________________________________
> From: [email protected] [[email protected]] on behalf of Nick Cox [[email protected]]
> Sent: 31 May 2012 17:34
> To: '[email protected]'
> Subject: RE: st: Extracting substrings from variable and combining variables.
>
> -egen, concat()- "didn't work": this can not be discussed without reference to exactly (a) what you want to do, (b) what you tried and (c) what happened.
>
> Nick
> [email protected]
>
> Amal Khanolkar
>
> Hi Nick & Brendan,
>
> Thanks so much for your help with the 'regex' commands in retrieving subjects with a common diagnosis from my dataset.
>
> I know have 12 such 'diagnostic' variables (preght1-12) all for say hypertension ( 12, as a patient might have received this diagnosis as the 1st or 7th or 12th diagnosis when admitted to hospital).
>
> I need to combine these 12 variables into one. I tried doing this using the 'egen' command with the concat function but it didn't work. Any tips on other commands I could try?
>
> The variables look like this and most of the 12 variables have the same 3 categories, but some have just 2 or 1:
>
> tab preght1
>
> preght1 | Freq. Percent Cum.
> ------------+-----------------------------------
> 637 | 8,314 20.76 20.76
> 642 | 21,268 53.11 73.88
> O1 | 10,461 26.12 100.00
> ------------+-----------------------------------
> Total | 40,043 100.00
>
>
> . tab preght2
>
> preght2 | Freq. Percent Cum.
> ------------+-----------------------------------
> 637 | 11,202 33.51 33.51
> 642 | 15,191 45.44 78.95
> O1 | 7,036 21.05 100.00
> ------------+-----------------------------------
> Total | 33,429 100.00
>
>
>
> . tab preght4
>
> preght4 | Freq. Percent Cum.
> ------------+-----------------------------------
> 637 | 797 18.02 18.02
> 642 | 1,747 39.51 57.53
> O1 | 1,878 42.47 100.00
> ------------+-----------------------------------
> Total | 4,422 100.00
>
>
>
> . des preght1
>
> storage display value
> variable name type format label variable label
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> preght1 str3 %9s
>
>
>
> Thanks,
>
> /Amal.
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> From: [email protected] [[email protected]] on behalf of Nick Cox [[email protected]]
> Sent: 25 May 2012 20:22
> To: [email protected]
> Subject: Re: st: Extracting substrings from variables.
>
> As any leading spaces surely don't matter, consider using
>
> regexm(ltrim(mdiag1x), "^(637|642|O1)")
>
> Nick
>
> On Fri, May 25, 2012 at 7:17 PM, Brendan Halpin <[email protected]> wrote:
>> On Fri, May 25 2012, Nick Cox wrote:
>>
>>> . di regexm("Stata rules OK O1", "^637|642|O1")
>>> 1
>>
>> OK, I was wrong that the grouping parentheses were unnecessary. However,
>> the way I used them first was also wrong.
>>
>> Something like this is needed:
>>
>> . gen pright = regexs(0) if regexm(mdiag1x, "^(637|642|O1)")
>>
>> More evidence that Nick's reluctance about regexp is not unwise.
>>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/