Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Amal Khanolkar <Amal.Khanolkar@ki.se> |

To |
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: Extracting substrings from variable and combining variables. |

Date |
Fri, 1 Jun 2012 15:29:30 +0000 |

Thanks for the input and code: I didn't really understand what the code does (''for each etc...'') But it does pluck out the those that have the 3 diagnoses of interest and creates 3 separate variables as follows: tab has637 has637 | Freq. Percent Cum. ------------+----------------------------------- 0 | 2,969,464 99.26 99.26 1 | 21,992 0.74 100.00 ------------+----------------------------------- Total | 2,991,456 100.00 . tab has642 has642 | Freq. Percent Cum. ------------+----------------------------------- 0 | 2,948,590 98.57 98.57 1 | 42,866 1.43 100.00 ------------+----------------------------------- Total | 2,991,456 100.00 . tab hasO1 hasO1 | Freq. Percent Cum. ------------+----------------------------------- 0 | 2,968,084 99.22 99.22 1 | 23,372 0.78 100.00 ------------+----------------------------------- Total | 2,991,456 100.00 - The above also gives a lower number and skips those recorded as duplicates. - I think using the replace command to restructure preght is probably easier: however you meant I do it before that is using the original 12 variables and skipping egen all together? Thanks, /Amal. Amal Khanolkar, PhD candidate, Centre for Health Equity Studies (CHESS), Karolinska Institutet, 106 91 Stockholm. Ph# +46(0)8 162584/+46(0)73 0899409 www.chess.su.se ________________________________________ From: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] on behalf of Nick Cox [n.j.cox@durham.ac.uk] Sent: 01 June 2012 13:52 To: 'statalist@hsphsun2.harvard.edu' Subject: RE: st: Extracting substrings from variable and combining variables. -egen, concat()- concatenates strings (although it is happy to convert numbers to strings on the fly). Concatenation is jargon for chaining together, or addition of strings lengthwise. Thus if you have two string variables with values "foo" and "bar", their concatenation is "foobar". Nothing changes if the string values happen to be identical, so the concatenation of "637" and "637" is "637637". (You can insert spaces or other punctuation, but leave that aside.) You can't override that; it is what -egen, concat()- is designed to do. -concat()- is not designed to select distinct strings once only, although that's programmable. Also, you pose another riddle. When you say "correct for this", exactly what do you want instead? Note that some subjects have both "637" and "642", and so that's part of the information. In your case you should be able to fix -prehgt- to your taste with a few -replace- commands and that's going to be easier than writing other code, or so I guess. But in case it is helpful, here is code for three indicator variables (not tested). foreach s in 637 642 O1 { gen has`s' = 0 qui forval j = 1/10 { replace has`s' = 1 if prehgt`j' == "`s'" & has`s' == 0 } } Nick n.j.cox@durham.ac.uk Amal Khanolkar Thanks - this makes sense. Is there any way to correct for this in the egen command before combining the variables? egen preght=concat(preght1 preght2 preght3 preght4 preght5 preght6 preght7 preght8 preght9 preght10 preght11 preght12) I'm also curious to know how Stata gets categories with multiple '637's'?? Eventhough the total number of subjects is now 90930 as expected, in effect each patient should be diagnosed with a 637 only once....in other words what does '637637' & '637637637637' mean? /Amal. Amal Khanolkar, PhD candidate, Centre for Health Equity Studies (CHESS), Karolinska Institutet, 106 91 Stockholm. Ph# +46(0)8 162584/+46(0)73 0899409 www.chess.su.se ________________________________________ From: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] on behalf of Nick Cox [njcoxstata@gmail.com] Sent: 01 June 2012 10:26 To: statalist@hsphsun2.harvard.edu Subject: Re: st: Extracting substrings from variable and combining variables. That sounds predictable to me. "637" and "637" counts as 2 one way and -- in its guise as "637637" -- as 1 the other way. It's important to correct for counting the same thing twice or more. Consider this: clear input str12 diag freq w "637" 20922 1 "637637" 960 2 "637637637" 104 3 "637637637637" 3 4 "637642" 2 2 "642" 42108 1 "642637" 1 2 "642642" 748 2 "642642642" 7 3 "O1" 22634 1 "O1O1" 720 2 "O1O1O1" 17 3 "O1O1O1O1" 2 4 end . tab diag [w=freq*w] (frequency weights assumed) diag | Freq. Percent Cum. -------------+----------------------------------- 637 | 20,922 23.01 23.01 637637 | 1,920 2.11 25.12 637637637 | 312 0.34 25.46 637637637637 | 12 0.01 25.48 637642 | 4 0.00 25.48 642 | 42,108 46.31 71.79 642637 | 2 0.00 71.79 642642 | 1,496 1.65 73.44 642642642 | 21 0.02 73.46 O1 | 22,634 24.89 98.35 O1O1 | 1,440 1.58 99.94 O1O1O1 | 51 0.06 99.99 O1O1O1O1 | 8 0.01 100.00 -------------+----------------------------------- Total | 90,930 100.00 On Fri, Jun 1, 2012 at 9:11 AM, Amal Khanolkar <Amal.Khanolkar@ki.se> wrote: > Hi again, > > I tried to combine 12 such variables (examples of three below) to form one variable with the same 3 categories. > > > tab preght1 > > preght1 | Freq. Percent Cum. > ------------+----------------------------------- > 637 | 8,314 20.76 20.76 > 642 | 21,268 53.11 73.88 > O1 | 10,461 26.12 100.00 > ------------+----------------------------------- > Total | 40,043 100.00 > > > . tab preght2 > > preght2 | Freq. Percent Cum. > ------------+----------------------------------- > 637 | 11,202 33.51 33.51 > 642 | 15,191 45.44 78.95 > O1 | 7,036 21.05 100.00 > ------------+----------------------------------- > Total | 33,429 100.00 > > > > . tab preght4 > > preght4 | Freq. Percent Cum. > ------------+----------------------------------- > 637 | 797 18.02 18.02 > 642 | 1,747 39.51 57.53 > O1 | 1,878 42.47 100.00 > ------------+----------------------------------- > Total | 4,422 100.00 > > When I add-up the 12 preght variables, I get a total of 90930 observations that should have my diagnosis of interest. However when using the egn as below I get only 88228! > > This what I get when I run 'egen with the concat' function: > > egen preght=concat(preght1 preght2 preght3 preght4 preght5 preght6 preght7 preght8 preght9 preght10 preght11 preght12) > (2903228 missing values generated) > > > preght | Freq. Percent Cum. > -------------+----------------------------------- > 637 | 20,922 23.71 23.71 > 637637 | 960 1.09 24.80 > 637637637 | 104 0.12 24.92 > 637637637637 | 3 0.00 24.92 > 637642 | 2 0.00 24.93 > 642 | 42,108 47.73 72.65 > 642637 | 1 0.00 72.65 > 642642 | 748 0.85 73.50 > 642642642 | 7 0.01 73.51 > O1 | 22,634 25.65 99.16 > O1O1 | 720 0.82 99.98 > O1O1O1 | 17 0.02 100.00 > O1O1O1O1 | 2 0.00 100.00 > -------------+----------------------------------- > Total | 88,228 100.00 > > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**RE: st: Extracting substrings from variable and combining variables.***From:*Nick Cox <n.j.cox@durham.ac.uk>

**References**:**RE: st: Extracting substrings from variable and combining variables.***From:*Amal Khanolkar <Amal.Khanolkar@ki.se>

**Re: st: Extracting substrings from variable and combining variables.***From:*Nick Cox <njcoxstata@gmail.com>

**RE: st: Extracting substrings from variable and combining variables.***From:*Amal Khanolkar <Amal.Khanolkar@ki.se>

**RE: st: Extracting substrings from variable and combining variables.***From:*Nick Cox <n.j.cox@durham.ac.uk>

- Prev by Date:
**Re: st: RE: merge** - Next by Date:
**st: Regress question group omitted** - Previous by thread:
**RE: st: Extracting substrings from variable and combining variables.** - Next by thread:
**RE: st: Extracting substrings from variable and combining variables.** - Index(es):