Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: RES: generating a variable with pre-specified  correlations with other two (given) variables
From 
 
Richard Williams <[email protected]> 
To 
 
[email protected], [email protected] 
Subject 
 
Re: st: RES: generating a variable with pre-specified  correlations with other two (given) variables 
Date 
 
Wed, 31 Aug 2011 08:46:32 -0500 
At 07:00 AM 8/31/2011, Tirthankar Chakravarty wrote:
This question has appeared a few times before - in that you want to
create a variable with a pattern of correlation with _existing_
variables, which -corr2data- does not do. In an example where means
are normalised to zero, this can be had by solving a system of linear
equations in appropriate expectations.
Suppose you generate a variable as
Z = a*X+ b*Y ---(0)
where a, and b are constants to be determined. Then you can derive the
following identities under the zero mean assumption:
Cov(Z, X) = a*Var(X) + b*Cov(X, Y)  ---(1)
Cov(Z, Y) = b*Var(Y) + a*Cov(X, Y)  ---(2)
Here you know everything (you set Cov(Z, X) and Cov(Z, Y)), and this
is a system of two equations in two unknowns, a and b. Solve them and
generate your variables as in equation (0).
So for example, if I have Cov(X, Y) = .6, and Var(X)=Var(Y)=1, then a
=0.15625 , b=0.40625.
/************************************/
mat mCov = (1, .6\ .6, 1)
// generate x and y
corr2data x y, cstorage(full) cov(mCov) n(100000) clear
// generate z based on current sample of x and y
g z = .15625*x+.40625*y
corr, covariance
/************************************/
I am going to tweak your example a bit. Instead of doing the algebra 
(and possibly screwing it up) let Stata do the work. Make mCov a 
combo of the correlations you observe in your data and the 
correlations you want for the new variable:
mat mCov = (1, .6, .4\ .6, 1, .5 \ .4, .5, 1)
corr2data x y z, cstorage(full) cov(mCov) n(100000) clear
reg z x y
Here are the regression results:
. reg z x y
      Source |       SS       df       MS              Number of obs =  100000
-------------+------------------------------           F(  2, 99997) =18084.56
       Model |  26562.2344     2  13281.1172           Prob > F      =  0.0000
    Residual |  73436.7656 99997  .734389687           R-squared     =  0.2656
-------------+------------------------------           Adj R-squared =  0.2656
       Total |  99998.9999 99999  .999999999           Root MSE      =  .85697
------------------------------------------------------------------------------
           z |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |     .15625   .0033875    46.13   0.000     .1496106    .1628894
           y |     .40625   .0033875   119.93   0.000     .3996106    .4128894
       _cons |  -1.06e-08     .00271    -0.00   1.000    -.0053115    .0053115
------------------------------------------------------------------------------
You could now do something like
gen newvar = .15625*realx + .40625 * realy
You can easily make this more complicated, e.g. include the standard 
deviations and the means, add more Xs, etc. The -reg- command will do 
all the algebra for you.
-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
HOME:   (574)289-5227
EMAIL:  [email protected]
WWW:    http://www.nd.edu/~rwilliam
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/