Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Extracting substrings from variable and combining variables.


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Extracting substrings from variable and combining variables.
Date   Fri, 1 Jun 2012 09:26:05 +0100

That sounds predictable to me. "637" and "637" counts as 2 one way and
-- in its guise as "637637" -- as 1 the other way. It's important to
correct for counting the same thing twice or more.

Consider this:

clear
input str12 diag freq w
         "637"      20922     1
      "637637"         960     2
   "637637637"         104     3
 "637637637637"           3    4
      "637642"           2     2
         "642"      42108     1
      "642637"           1     2
      "642642"         748     2
   "642642642"           7     3
          "O1"      22634     1
        "O1O1"         720     2
      "O1O1O1"          17     3
    "O1O1O1O1"           2     4
end

. tab diag [w=freq*w]
(frequency weights assumed)

        diag |      Freq.     Percent        Cum.
-------------+-----------------------------------
         637 |     20,922       23.01       23.01
      637637 |      1,920        2.11       25.12
   637637637 |        312        0.34       25.46
637637637637 |         12        0.01       25.48
      637642 |          4        0.00       25.48
         642 |     42,108       46.31       71.79
      642637 |          2        0.00       71.79
      642642 |      1,496        1.65       73.44
   642642642 |         21        0.02       73.46
          O1 |     22,634       24.89       98.35
        O1O1 |      1,440        1.58       99.94
      O1O1O1 |         51        0.06       99.99
    O1O1O1O1 |          8        0.01      100.00
-------------+-----------------------------------
       Total |     90,930      100.00





On Fri, Jun 1, 2012 at 9:11 AM, Amal Khanolkar <Amal.Khanolkar@ki.se> wrote:
> Hi again,
>
> I tried to combine 12 such variables (examples of three below) to form one variable with the same 3 categories.
>
>
> tab preght1
>
>    preght1 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>        637 |      8,314       20.76       20.76
>        642 |     21,268       53.11       73.88
>         O1 |     10,461       26.12      100.00
> ------------+-----------------------------------
>      Total |     40,043      100.00
>
>
> .                         tab preght2
>
>    preght2 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>        637 |     11,202       33.51       33.51
>        642 |     15,191       45.44       78.95
>         O1 |      7,036       21.05      100.00
> ------------+-----------------------------------
>      Total |     33,429      100.00
>
>
>
> .                         tab preght4
>
>    preght4 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>        637 |        797       18.02       18.02
>        642 |      1,747       39.51       57.53
>         O1 |      1,878       42.47      100.00
> ------------+-----------------------------------
>      Total |      4,422      100.00
>
> When I add-up the 12 preght variables, I get a total of 90930 observations that should have my diagnosis of interest. However when using the egn as below I get only 88228!
>
> This what I get when I run 'egen with the concat' function:
>
> egen preght=concat(preght1 preght2 preght3 preght4 preght5 preght6 preght7 preght8 preght9 preght10 preght11 preght12)
> (2903228 missing values generated)
>
>
>      preght |      Freq.     Percent        Cum.
> -------------+-----------------------------------
>         637 |     20,922       23.71       23.71
>      637637 |        960        1.09       24.80
>   637637637 |        104        0.12       24.92
> 637637637637 |          3        0.00       24.92
>      637642 |          2        0.00       24.93
>         642 |     42,108       47.73       72.65
>      642637 |          1        0.00       72.65
>      642642 |        748        0.85       73.50
>   642642642 |          7        0.01       73.51
>          O1 |     22,634       25.65       99.16
>        O1O1 |        720        0.82       99.98
>      O1O1O1 |         17        0.02      100.00
>    O1O1O1O1 |          2        0.00      100.00
> -------------+-----------------------------------
>       Total |     88,228      100.00
>
>
> Thnaks!
>
> Best regards,
>
> Amal Khanolkar, PhD candidate,
> ________________________________________
> From: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] on behalf of Nick Cox [n.j.cox@durham.ac.uk]
> Sent: 31 May 2012 17:34
> To: 'statalist@hsphsun2.harvard.edu'
> Subject: RE: st: Extracting substrings from variable and combining variables.
>
> -egen, concat()- "didn't work": this can not be discussed without reference to exactly (a) what you want to do, (b) what you tried and (c) what happened.
>
> Nick
> n.j.cox@durham.ac.uk
>
> Amal Khanolkar
>
> Hi Nick & Brendan,
>
> Thanks so much for your help with the 'regex' commands in retrieving subjects with a common diagnosis from my dataset.
>
> I know have 12 such 'diagnostic' variables (preght1-12) all for say hypertension ( 12, as a patient might have received this diagnosis as the 1st or 7th or 12th diagnosis when admitted to hospital).
>
> I need to combine these 12 variables into one. I tried doing this using the 'egen' command with the concat function but it didn't work. Any tips on other commands I could try?
>
> The variables look like this and most of the 12 variables have the same 3 categories, but some have just 2 or 1:
>
>                         tab preght1
>
>    preght1 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>        637 |      8,314       20.76       20.76
>        642 |     21,268       53.11       73.88
>         O1 |     10,461       26.12      100.00
> ------------+-----------------------------------
>      Total |     40,043      100.00
>
>
> .                         tab preght2
>
>    preght2 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>        637 |     11,202       33.51       33.51
>        642 |     15,191       45.44       78.95
>         O1 |      7,036       21.05      100.00
> ------------+-----------------------------------
>      Total |     33,429      100.00
>
>
>
> .                         tab preght4
>
>    preght4 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>        637 |        797       18.02       18.02
>        642 |      1,747       39.51       57.53
>         O1 |      1,878       42.47      100.00
> ------------+-----------------------------------
>      Total |      4,422      100.00
>
>
>
> . des  preght1
>
>              storage  display     value
> variable name   type   format      label      variable label
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> preght1         str3   %9s
>
>
>
> Thanks,
>
> /Amal.
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> From: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] on behalf of Nick Cox [njcoxstata@gmail.com]
> Sent: 25 May 2012 20:22
> To: statalist@hsphsun2.harvard.edu
> Subject: Re: st: Extracting substrings from variables.
>
> As any leading spaces surely don't matter, consider using
>
> regexm(ltrim(mdiag1x), "^(637|642|O1)")
>
> Nick
>
> On Fri, May 25, 2012 at 7:17 PM, Brendan Halpin <brendan.halpin@ul.ie> wrote:
>> On Fri, May 25 2012, Nick Cox wrote:
>>
>>> . di regexm("Stata rules OK O1", "^637|642|O1")
>>> 1
>>
>> OK, I was wrong that the grouping parentheses were unnecessary. However,
>> the way I used them first was also wrong.
>>
>> Something like this is needed:
>>
>> . gen pright = regexs(0) if regexm(mdiag1x, "^(637|642|O1)")
>>
>> More evidence that Nick's reluctance about regexp is not unwise.
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index