# st: Average pairwise correlations for a large dataset

 From John Hund To statalist@hsphsun2.harvard.edu Subject st: Average pairwise correlations for a large dataset Date Tue, 4 Mar 2008 20:25:25 -0600

First thanks to all, esp. Nick and Maarten for previous help on parts of this...

I have a dataset that looks like this in long form:

compid ret month year retdate mktval decile
10006 -.05588235 1983 8 1983m8 262731.3 7
10006 .57480317 1983 9 1983m9 415950 7
10006 -.025 1983 10 1983m10 405551.3 7
10006 .05333333 1983 11 1983m11 426462 7
10006 -.01960784 1983 12 1983m12 419250 7
10001 -.07407408 1987 2 1987m2 6193.75 1
10001 .0368 1987 3 1987m3 6317.625 1
10001 .05142857 1987 6 1987m6 5822.125 1
10001 .0212766 1987 7 1987m7 5946 1
10001 .08333334 1987 8 1987m8 6441.5 1
10487 .10226244 1983 8 1983m8 327819.3 7
10487 .02479339 1983 9 1983m9 335947 7
10487 .08870967 1983 10 1983m10 365748.8 7
10487 .02074074 1983 11 1983m11 371167.3 7
:
:
:

where compid and retdate uniquely identify ret(urn) observations for a company's stock market returns in a particular month. Decile corresponds to the decile of market value for the company, and years run from 1930-2006. What I'm trying to obtain is the average of all pairwise correlations for company returns over subperiods defined by dates (i.e. 1970-1975) and by subgroups defined by decile. One solution would seem to be to run through the data with forval, dropping all observations that don't correspond to the subgroup/ subperiod intersection, then reshape to wide (or alternatively xpose) to give me variables that look like for instance:

retdate ret10006 ret10487
1983m8 -.05588235 .10226244
1983m9 .57480317 .02479339
:
:

and then hand the variables ret10006 and ret10487 to the mata function I wrote (just for speed) to calculate the average of all the correlations (attached below).

Reshaping the entire dataset first might be a possibility, but there are over 18,000 unique values for compid (for instance) so xpose-ing the entire dataset is very difficult. Similarly, it seems very inefficient to keep reloading the original data, but since reshape is global for all the observations, I don't know the alternative. I thought about using mkmat and then operating directly on the matrix (since in matlab, I'd use for and then extract a matrix, transpose it and calculate), but that doesn't seem to be a very good solution either (since later year matrices are likely to be on the order of 300 x 300 at least.

Is there a better way method than the brute force drop/reshape? There must be...I just don't really know how. Plus the brute force approach doesn't really make it easy to store the actual average correlation series.

Thanks for any of your help,
John

mata code, just FYI:

mata:
void calcavgcorr(string scalar varnames, string scalar touse)
{
st_view(DataMatrix, ., tokens(varnames), touse)
CorrMatrix = correlation(DataMatrix)
_diag(CorrMatrix,.)
x = vech(CorrMatrix)
st_numscalar("r(allcorrmean)", mean(x))
}

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/