Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# RE: st: Extracting substrings from variable and combining variables.

 From Amal Khanolkar To "statalist@hsphsun2.harvard.edu" Subject RE: st: Extracting substrings from variable and combining variables. Date Mon, 4 Jun 2012 10:09:53 +0000

```Hi Nick,

Sorry for the confusion: I missed your request for a better explaination on what I mean by combining:

If, I have the following 3 variables, preght1, 2 & 3:

preght1 |      Freq.     Percent        Cum.
------------+-----------------------------------
637 |      8,314       20.76       20.76
642 |     21,268       53.11       73.88
O1 |     10,461       26.12      100.00
------------+-----------------------------------
Total |     40,043      100.00

. tab preght2

preght2 |      Freq.     Percent        Cum.
------------+-----------------------------------
637 |     11,202       33.51       33.51
642 |     15,191       45.44       78.95
O1 |      7,036       21.05      100.00
------------+-----------------------------------
Total |     33,429      100.00

I'd like to generate preghtX, where I combine the above 3 categories from both preght1 and preght2 as below:

preghtX |      Freq.     Percent        Cum.
------------+-----------------------------------
637 |      19516       20.76       20.76
642 |     36459       53.11       73.88
O1 |     17497       26.12      100.00
------------+-----------------------------------
Total |     73472      100.00

I did try something very similar to what you suggested below:

forval j = 1/8 {
2.          replace hasO1 = 1 if hasO1 == 0 & substr(mdiag`j', 1, 2) == "O1"
3.          replace has637 = 1 if has637 == 0 & substr(mdiag`j', 1, 3) == "637"
4.          replace has642 = 1 if has642 == 0 & substr(mdiag`j', 1, 3) == "642"
5. }

.
end of do-file

. sum hasO1 has637 has642

Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
hasO1 |   2991456    .0078092    .0880242          0          1
has637 |   2991456    .0073516    .0854258          0          1
has642 |   2991456    .0143295    .1188451          0          1

. tab  hasO1

hasO1 |      Freq.     Percent        Cum.
------------+-----------------------------------
0 |  2,968,095       99.22       99.22
1 |     23,361        0.78      100.00
------------+-----------------------------------
Total |  2,991,456      100.00

. tab has637

has637 |      Freq.     Percent        Cum.
------------+-----------------------------------
0 |  2,969,464       99.26       99.26
1 |     21,992        0.74      100.00
------------+-----------------------------------
Total |  2,991,456      100.00

. tab has642

has642 |      Freq.     Percent        Cum.
------------+-----------------------------------
0 |  2,948,590       98.57       98.57
1 |     42,866        1.43      100.00
------------+-----------------------------------
Total |  2,991,456      100.00

The reason I was a bit unsure of the above method is because those subjects coded as '1' above total to 88219 and not 90930 as they should. I wasn't able to figure out how I was loosing the 2711 additional subjects - if Stata treated them as duplicates or something else.

But thanks for your help! Just wanted to clear-up why I didn't use the above method discussed last week.

Best regards,

/Amal.
________________________________________
From: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] on behalf of Nick Cox [njcoxstata@gmail.com]
Sent: 04 June 2012 11:20
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: Extracting substrings from variable and combining variables.

Previously I wrote

" I don't know exactly what you want, so that rules out further
suggestions from me for the time being. You would get better help by
giving examples of what the variables you want would look like."

You've not done this. All that I can pick up here is that you want to
combine variables. I don't know what that "combining" means. So, this
is another (but final) attempt from me to help.

Note that -regexm()- and -regexs()- are functions, not commands. This
is not just a piece of pedantry as (1) referring to functions as
commands may confuse at least some readers, and clarifies nothing (2)
thinking of these, always, as functions helps reminds everyone that
they are defined and documented distinctly.

It seems that you have variables -mdiag1-mdiag8- and wish to extract
diagnoses "O1", "637", "642". You expect those diagnoses to be leading
substrings.  You can create a new composite variable this way.

gen anydiag = ""

foreach diag in O1 637 642 {
forval j = 1/8 {
local len = length("`diag'")
replace anydiag = anydiag + "`diag'" if
substr(mdiag`j', 1, `len') == "`diag'"
}
}

But we've already gone over similar ideas in this thread. I don't
think you ever said why you can't work from that resulting composite
variable.

You can create new indicator variables this way

gen hasO1 = 0
gen has637 = 0
gen has642 = 0

forval j = 1/8 {
replace hasO1 = 1 if hasO1 == 0 & substr(mdiag`j', 1, 2) == "O1"
replace has637 = 1 if has637 == 0 & substr(mdiag`j', 1, 3) == "637"
replace has642 = 1 if has642 == 0 & substr(mdiag`j', 1, 3) == "642"
}

This can be done with regex machinery too as a matter of taste.

Nick

On Mon, Jun 4, 2012 at 9:42 AM, Amal Khanolkar <Amal.Khanolkar@ki.se> wrote:

> Originally, I started using the 'regex' command to extract ICD codes from my variables of interest shown below (mdiag1, mdiag2, mdiag3, mdiag4 etc....). I'm extracting the same ICD codes from all the mdiag variables starting with the numbers/letters: 637, 642 and O1. Initially I extracted the ICD codes from each mdiag variable separately with the idea of combining them at the end. But that seems a bit more complicated now. Maybe, one solution could be to extract all ICD codes from all mdiag variables at the same time. There are 12 such mdiag variables.
>
> gen preght1 = regexs(0) if regexm(mdiag1, "^(637|642|O1)")
>                        tab preght1
>
>                        gen preght2 = regexs(0) if regexm(mdiag2, "^(637|642|O1)")
>                        tab preght2
>
>                        gen preght3 = regexs(0) if regexm(mdiag3, "^(637|642|O1)")
>                        tab preght3
>
>                        gen preght4 = regexs(0) if regexm(mdiag4, "^(637|642|O1)")
>                        tab preght4
>
>                        gen preght5 = regexs(0) if regexm(mdiag5, "^(637|642|O1)")
>                        tab preght5
>
>                        gen preght6 = regexs(0) if regexm(mdiag6, "^(637|642|O1)")
>                        tab preght6
>
>                        gen preght7 = regexs(0) if regexm(mdiag7, "^(637|642|O1)")
>                        tab preght7
>
>                        gen preght8 = regexs(0) if regexm(mdiag8, "^(637|642|O1)")
>                        tab preght8
>
> The above generates 8 preght variables and works great.
>
> Initially I tried to combine the (mdiagX, "^(637|642|O1) for each mdiag variable by enclosing them in separate brackets one after another. But it doesn't work. How do I modify the regexs/regexm commands to be able to tell Stata to pluck out the ICD codes for several variables in the same command line?

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```