Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: RE: correlate by group and collapse


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: RE: correlate by group and collapse
Date   Thu, 11 Jul 2002 09:54:07 +0100

Roger Harbord 
> > 
> > I want to collapse my dataset by a group variable and retain the 
> > correlation coefficient of two variables.  In other 
> > words, I'd like to be able to do something like:
> > . collapse (correlation) var1 var2, by(group)
> > or maybe:
> > . by group: egen corr12=corr(var1 var2)
> > . collapse corr12, by(group)
> > 
> > However, collapse doesn't have correlation among its stats (it only 
> > allows a selection of univariate statistics) and egen doesn't have a 
> > corr function.
> > I know I can do:
> > . by group: correlate var1 var2
> > - but I want to save the results and do further analysis on 
> > them rather
> > than just displaying them.
> > 
> > The best I've come up with is (supposing I have 100 groups):
> > 
> > This seems kind of clumsy though, and it took me a while to work out 
> > that I needed _noheader_ and _quietly_ to stop my screen filling with 
> > output. It also becomes quite lengthy if I want several pairwise 
> > correlations. Is there a better way? 
> > 
> > I think I'd like egen to have a _corr_ and/or a _cov_ function - I 
> > would have thought it would be of wider interest than the calculation 
> > of U.S. marginal income tax rates, which is already 
> > implemented as egen 
> > function mtr! I've checked the extensions to egen in the STB package 
> > _egenodd_ and tried a couple of _findit_'s, but I didn't find 
> > anything 
> > suitable.

Nick Winter 

> I've attached below a program to do this with egen.  Save the whole
> thing as "_gcorr.ado"  (that is, DO NOT separate out the GenCorr part as
> a separate file.  
> 
> The syntax is:
> 
> 	[by varlist:] egen newvar = var1 var2 [if exp] [in exp] [ ,
> covariance ]
> 
> The ", covariance" option generates coveriances; otherwise it does
> correlations.
> 
> Nick Winter
> 
> **************** BEGINNING OF _gcorr.ado
> end

< snip > 
> 
> **************** END OF _gcorr.ado

Nick's -egen- solution solves this problem excellently. 

This is just a sidenote to opine that Roger's 
original solution is not so bad as he implies, 
and that it can be extended to the full problem  
with -forvalues- and -foreach-. And a simple but more general point 
is this: master -forvalues- and -foreach- and you have a tool for 
other problems and need not be dependent on 
programmers, who can indeed seem capricious 
sometimes in what they do and do not supply. 
(I gather that marginal tax rate is an everyday tool 
for lots of users.) 

Setting aside the -collapse-, Roger's -for- solution was 

gen corr12=.
for num 1/100, noheader: qui correlate var1 var2 if group==X \ 
qui replace corr12=r(rho) if study==X

The equivalent with -forval- is 

gen corr12 = . 

qui forval `i' = 1/100 { 
	corr var1 var2 if group == `i' 
	replace corr12 = r(rho) if study == `i' 
} 

This may seem no gain, but as was said in another thread recently, 
my main reservation about -for- is that it doesn't
grow gracefully when extended to more complicated
problems, whereas -foreach- and -forval- typically do.

(I've kept the distinction between -group- and -study-, 
which is immaterial to the main point here, whether 
or not it's a typo.) 

Now this can be extended to lots of variables: 

qui foreach x of var <varlist> { 
	foreach y of var <varlist> { 
		gen r`x'`y' = . 
		forval i = 1/100 { 
			corr `x' `y' if group == `i' 
			replace r`x'`y' = r(rho) if study == `i' 
		}
	} 
} 

Embedded in that is an assumption that variable names
are short enough that names like -r`x'`y'- remain legal
after substitution. At worst, that problem could be fixed 
by mass renaming. 

Also, there is wastefulness here by a factor of about 2, 
as correlations are symmetric and self-correlations of 
1 are of no interest. This could be tackled in 
various ways, one of which is to ignore that problem.

Another is to check for the existence of r`y'`x' 
before we calculate r`x'`y'. 

There is a tutorial on -forvalues- and -foreach- 
in Stata Journal 2(2), 202-222 (2002). The slides 
of a talk on the same subject are accessible to non-subscribers 
at 

http://www.stata.com/support/meeting/8uk/fortitude.pdf

or 

http://fmwww.bc.edu/RePEc/usug2002/fortitude.pdf

(Exactly the same file.) 

Incidentally, 

1. -statsby- 

statsby "corr var1 var2" corr = r(rho) , by(group) 

is a good solution if only one correlation is 
of interest, but not, I think, for many. The reason 
is that -statsby- includes a built-in collapsing 
of the data set, so you would need to read in the 
original data set repeatedly to do it repeatedly. 

2. -egen- extras. Most of -egenodd- is already in Stata 7. 
Other user-written packages of -egen- functions are accessible 
via -findit-. 

Nick 
n.j.cox@durham.ac.uk 
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index