Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Extracting substrings from variable and combining variables.


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Extracting substrings from variable and combining variables.
Date   Mon, 4 Jun 2012 11:35:34 +0100

This helps clarify what you want. But as already shown in this thread
your data show that some people are both "637" and "642", so you can't
get a variable like this. A string variable can't be both "637" and
"642". At most you can take the composite string variable and edit it.

I already explained the double counting at
http://www.stata.com/statalist/archive/2012-06/msg00010.html so that's
not an issue.

Nick

On Mon, Jun 4, 2012 at 11:09 AM, Amal Khanolkar <Amal.Khanolkar@ki.se> wrote:
> Hi Nick,
>
> Sorry for the confusion: I missed your request for a better explaination on what I mean by combining:
>
> If, I have the following 3 variables, preght1, 2 & 3:
>
> preght1 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>        637 |      8,314       20.76       20.76
>        642 |     21,268       53.11       73.88
>         O1 |     10,461       26.12      100.00
> ------------+-----------------------------------
>      Total |     40,043      100.00
>
> . tab preght2
>
>    preght2 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>        637 |     11,202       33.51       33.51
>        642 |     15,191       45.44       78.95
>         O1 |      7,036       21.05      100.00
> ------------+-----------------------------------
>      Total |     33,429      100.00
>
> I'd like to generate preghtX, where I combine the above 3 categories from both preght1 and preght2 as below:
>
>
> preghtX |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>        637 |      19516       20.76       20.76
>        642 |     36459       53.11       73.88
>         O1 |     17497       26.12      100.00
> ------------+-----------------------------------
>      Total |     73472      100.00
>
>
> I did try something very similar to what you suggested below:
>
>
>  forval j = 1/8 {
>  2.          replace hasO1 = 1 if hasO1 == 0 & substr(mdiag`j', 1, 2) == "O1"
>  3.          replace has637 = 1 if has637 == 0 & substr(mdiag`j', 1, 3) == "637"
>  4.          replace has642 = 1 if has642 == 0 & substr(mdiag`j', 1, 3) == "642"
>  5. }
> (10461 real changes made)
> (8314 real changes made)
> (21268 real changes made)
> (6753 real changes made)
> (11007 real changes made)
> (14844 real changes made)
> (3637 real changes made)
> (2092 real changes made)
> (5152 real changes made)
> (1718 real changes made)
> (579 real changes made)
> (1602 real changes made)
> (480 real changes made)
> (0 real changes made)
> (0 real changes made)
> (202 real changes made)
> (0 real changes made)
> (0 real changes made)
> (74 real changes made)
> (0 real changes made)
> (0 real changes made)
> (36 real changes made)
> (0 real changes made)
> (0 real changes made)
>
> .
> end of do-file
>
> . sum hasO1 has637 has642
>
>    Variable |       Obs        Mean    Std. Dev.       Min        Max
> -------------+--------------------------------------------------------
>       hasO1 |   2991456    .0078092    .0880242          0          1
>      has637 |   2991456    .0073516    .0854258          0          1
>      has642 |   2991456    .0143295    .1188451          0          1
>
> . tab  hasO1
>
>      hasO1 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>          0 |  2,968,095       99.22       99.22
>          1 |     23,361        0.78      100.00
> ------------+-----------------------------------
>      Total |  2,991,456      100.00
>
> . tab has637
>
>     has637 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>          0 |  2,969,464       99.26       99.26
>          1 |     21,992        0.74      100.00
> ------------+-----------------------------------
>      Total |  2,991,456      100.00
>
> . tab has642
>
>     has642 |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>          0 |  2,948,590       98.57       98.57
>          1 |     42,866        1.43      100.00
> ------------+-----------------------------------
>      Total |  2,991,456      100.00
>
> The reason I was a bit unsure of the above method is because those subjects coded as '1' above total to 88219 and not 90930 as they should. I wasn't able to figure out how I was loosing the 2711 additional subjects - if Stata treated them as duplicates or something else.
>
> But thanks for your help! Just wanted to clear-up why I didn't use the above method discussed last week.
>
> Best regards,
>
> /Amal.
> ________________________________________
> From: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] on behalf of Nick Cox [njcoxstata@gmail.com]
> Sent: 04 June 2012 11:20
> To: statalist@hsphsun2.harvard.edu
> Subject: Re: st: Extracting substrings from variable and combining variables.
>
> Previously I wrote
>
> " I don't know exactly what you want, so that rules out further
> suggestions from me for the time being. You would get better help by
> giving examples of what the variables you want would look like."
>
> You've not done this. All that I can pick up here is that you want to
> combine variables. I don't know what that "combining" means. So, this
> is another (but final) attempt from me to help.
>
> Note that -regexm()- and -regexs()- are functions, not commands. This
> is not just a piece of pedantry as (1) referring to functions as
> commands may confuse at least some readers, and clarifies nothing (2)
> thinking of these, always, as functions helps reminds everyone that
> they are defined and documented distinctly.
>
> It seems that you have variables -mdiag1-mdiag8- and wish to extract
> diagnoses "O1", "637", "642". You expect those diagnoses to be leading
> substrings.  You can create a new composite variable this way.
>
> gen anydiag = ""
>
> foreach diag in O1 637 642 {
>         forval j = 1/8 {
>                   local len = length("`diag'")
>                   replace anydiag = anydiag + "`diag'" if
> substr(mdiag`j', 1, `len') == "`diag'"
>        }
> }
>
> But we've already gone over similar ideas in this thread. I don't
> think you ever said why you can't work from that resulting composite
> variable.
>
> You can create new indicator variables this way
>
> gen hasO1 = 0
> gen has637 = 0
> gen has642 = 0
>
> forval j = 1/8 {
>         replace hasO1 = 1 if hasO1 == 0 & substr(mdiag`j', 1, 2) == "O1"
>         replace has637 = 1 if has637 == 0 & substr(mdiag`j', 1, 3) == "637"
>         replace has642 = 1 if has642 == 0 & substr(mdiag`j', 1, 3) == "642"
> }
>
> This can be done with regex machinery too as a matter of taste.
>
> Nick
>
> On Mon, Jun 4, 2012 at 9:42 AM, Amal Khanolkar <Amal.Khanolkar@ki.se> wrote:
>
>> Originally, I started using the 'regex' command to extract ICD codes from my variables of interest shown below (mdiag1, mdiag2, mdiag3, mdiag4 etc....). I'm extracting the same ICD codes from all the mdiag variables starting with the numbers/letters: 637, 642 and O1. Initially I extracted the ICD codes from each mdiag variable separately with the idea of combining them at the end. But that seems a bit more complicated now. Maybe, one solution could be to extract all ICD codes from all mdiag variables at the same time. There are 12 such mdiag variables.
>>
>> gen preght1 = regexs(0) if regexm(mdiag1, "^(637|642|O1)")
>>                        tab preght1
>>
>>                        gen preght2 = regexs(0) if regexm(mdiag2, "^(637|642|O1)")
>>                        tab preght2
>>
>>                        gen preght3 = regexs(0) if regexm(mdiag3, "^(637|642|O1)")
>>                        tab preght3
>>
>>                        gen preght4 = regexs(0) if regexm(mdiag4, "^(637|642|O1)")
>>                        tab preght4
>>
>>                        gen preght5 = regexs(0) if regexm(mdiag5, "^(637|642|O1)")
>>                        tab preght5
>>
>>                        gen preght6 = regexs(0) if regexm(mdiag6, "^(637|642|O1)")
>>                        tab preght6
>>
>>                        gen preght7 = regexs(0) if regexm(mdiag7, "^(637|642|O1)")
>>                        tab preght7
>>
>>                        gen preght8 = regexs(0) if regexm(mdiag8, "^(637|642|O1)")
>>                        tab preght8
>>
>> The above generates 8 preght variables and works great.
>>
>> Initially I tried to combine the (mdiagX, "^(637|642|O1) for each mdiag variable by enclosing them in separate brackets one after another. But it doesn't work. How do I modify the regexs/regexm commands to be able to tell Stata to pluck out the ICD codes for several variables in the same command line?

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index