Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Extracting substrings from variable and combining variables.


From   Amal Khanolkar <[email protected]>
To   "[email protected]" <[email protected]>
Subject   RE: st: Extracting substrings from variable and combining variables.
Date   Fri, 1 Jun 2012 11:21:08 +0000

Thanks - this makes sense. Is there any way to correct for this in the egen command before combining the variables?

egen preght=concat(preght1 preght2 preght3 preght4 preght5 preght6 preght7 preght8 preght9 preght10 preght11 preght12)

I'm also curious to know how Stata gets categories with multiple '637's'?? Eventhough the total number of subjects is now 90930 as expected, in effect each patient should be diagnosed with a 637 only once....in other words what does '637637' & '637637637637' mean?

/Amal.

Amal Khanolkar, PhD candidate,
Centre for Health Equity Studies (CHESS),
Karolinska Institutet,
106 91 Stockholm.

Ph# +46(0)8 162584/+46(0)73 0899409
www.chess.su.se
________________________________________
From: [email protected] [[email protected]] on behalf of Nick Cox [[email protected]]
Sent: 01 June 2012 10:26
To: [email protected]
Subject: Re: st: Extracting substrings from variable and combining variables.

That sounds predictable to me. "637" and "637" counts as 2 one way and
-- in its guise as "637637" -- as 1 the other way. It's important to
correct for counting the same thing twice or more.

Consider this:

clear
input str12 diag freq w
         "637"      20922     1
      "637637"         960     2
   "637637637"         104     3
 "637637637637"           3    4
      "637642"           2     2
         "642"      42108     1
      "642637"           1     2
      "642642"         748     2
   "642642642"           7     3
          "O1"      22634     1
        "O1O1"         720     2
      "O1O1O1"          17     3
    "O1O1O1O1"           2     4
end

. tab diag [w=freq*w]
(frequency weights assumed)

        diag |      Freq.     Percent        Cum.
-------------+-----------------------------------
         637 |     20,922       23.01       23.01
      637637 |      1,920        2.11       25.12
   637637637 |        312        0.34       25.46
637637637637 |         12        0.01       25.48
      637642 |          4        0.00       25.48
         642 |     42,108       46.31       71.79
      642637 |          2        0.00       71.79
      642642 |      1,496        1.65       73.44
   642642642 |         21        0.02       73.46
          O1 |     22,634       24.89       98.35
        O1O1 |      1,440        1.58       99.94
      O1O1O1 |         51        0.06       99.99
    O1O1O1O1 |          8        0.01      100.00
-------------+-----------------------------------
       Total |     90,930      100.00





On Fri, Jun 1, 2012 at 9:11 AM, Amal Khanolkar <[email protected]> wrote:
> Hi again,
>
> I tried to combine 12 such variables (examples of three below) to form one variable with the same 3 categories.
>
>
> tab preght1
>
>    preght1 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>        637 |      8,314       20.76       20.76
>        642 |     21,268       53.11       73.88
>         O1 |     10,461       26.12      100.00
> ------------+-----------------------------------
>      Total |     40,043      100.00
>
>
> .                         tab preght2
>
>    preght2 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>        637 |     11,202       33.51       33.51
>        642 |     15,191       45.44       78.95
>         O1 |      7,036       21.05      100.00
> ------------+-----------------------------------
>      Total |     33,429      100.00
>
>
>
> .                         tab preght4
>
>    preght4 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>        637 |        797       18.02       18.02
>        642 |      1,747       39.51       57.53
>         O1 |      1,878       42.47      100.00
> ------------+-----------------------------------
>      Total |      4,422      100.00
>
> When I add-up the 12 preght variables, I get a total of 90930 observations that should have my diagnosis of interest. However when using the egn as below I get only 88228!
>
> This what I get when I run 'egen with the concat' function:
>
> egen preght=concat(preght1 preght2 preght3 preght4 preght5 preght6 preght7 preght8 preght9 preght10 preght11 preght12)
> (2903228 missing values generated)
>
>
>      preght |      Freq.     Percent        Cum.
> -------------+-----------------------------------
>         637 |     20,922       23.71       23.71
>      637637 |        960        1.09       24.80
>   637637637 |        104        0.12       24.92
> 637637637637 |          3        0.00       24.92
>      637642 |          2        0.00       24.93
>         642 |     42,108       47.73       72.65
>      642637 |          1        0.00       72.65
>      642642 |        748        0.85       73.50
>   642642642 |          7        0.01       73.51
>          O1 |     22,634       25.65       99.16
>        O1O1 |        720        0.82       99.98
>      O1O1O1 |         17        0.02      100.00
>    O1O1O1O1 |          2        0.00      100.00
> -------------+-----------------------------------
>       Total |     88,228      100.00
>
>
> Thnaks!
>
> Best regards,
>
> Amal Khanolkar, PhD candidate,
> ________________________________________
> From: [email protected] [[email protected]] on behalf of Nick Cox [[email protected]]
> Sent: 31 May 2012 17:34
> To: '[email protected]'
> Subject: RE: st: Extracting substrings from variable and combining variables.
>
> -egen, concat()- "didn't work": this can not be discussed without reference to exactly (a) what you want to do, (b) what you tried and (c) what happened.
>
> Nick
> [email protected]
>
> Amal Khanolkar
>
> Hi Nick & Brendan,
>
> Thanks so much for your help with the 'regex' commands in retrieving subjects with a common diagnosis from my dataset.
>
> I know have 12 such 'diagnostic' variables (preght1-12) all for say hypertension ( 12, as a patient might have received this diagnosis as the 1st or 7th or 12th diagnosis when admitted to hospital).
>
> I need to combine these 12 variables into one. I tried doing this using the 'egen' command with the concat function but it didn't work. Any tips on other commands I could try?
>
> The variables look like this and most of the 12 variables have the same 3 categories, but some have just 2 or 1:
>
>                         tab preght1
>
>    preght1 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>        637 |      8,314       20.76       20.76
>        642 |     21,268       53.11       73.88
>         O1 |     10,461       26.12      100.00
> ------------+-----------------------------------
>      Total |     40,043      100.00
>
>
> .                         tab preght2
>
>    preght2 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>        637 |     11,202       33.51       33.51
>        642 |     15,191       45.44       78.95
>         O1 |      7,036       21.05      100.00
> ------------+-----------------------------------
>      Total |     33,429      100.00
>
>
>
> .                         tab preght4
>
>    preght4 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>        637 |        797       18.02       18.02
>        642 |      1,747       39.51       57.53
>         O1 |      1,878       42.47      100.00
> ------------+-----------------------------------
>      Total |      4,422      100.00
>
>
>
> . des  preght1
>
>              storage  display     value
> variable name   type   format      label      variable label
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> preght1         str3   %9s
>
>
>
> Thanks,
>
> /Amal.
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> From: [email protected] [[email protected]] on behalf of Nick Cox [[email protected]]
> Sent: 25 May 2012 20:22
> To: [email protected]
> Subject: Re: st: Extracting substrings from variables.
>
> As any leading spaces surely don't matter, consider using
>
> regexm(ltrim(mdiag1x), "^(637|642|O1)")
>
> Nick
>
> On Fri, May 25, 2012 at 7:17 PM, Brendan Halpin <[email protected]> wrote:
>> On Fri, May 25 2012, Nick Cox wrote:
>>
>>> . di regexm("Stata rules OK O1", "^637|642|O1")
>>> 1
>>
>> OK, I was wrong that the grouping parentheses were unnecessary. However,
>> the way I used them first was also wrong.
>>
>> Something like this is needed:
>>
>> . gen pright = regexs(0) if regexm(mdiag1x, "^(637|642|O1)")
>>
>> More evidence that Nick's reluctance about regexp is not unwise.
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index