Note: This FAQ is for users of Stata 11.
It is not relevant for more recent versions.
The following discussion was extracted from a question and response
that was posted on Statalist.

Title | Estimating correlations with survey data | |

Author | Bill Sribney, StataCorp | |

Date | November 2001; updated April 2005 |

There are two options:

(1) use
**correlate**
with **aweight**s for point estimates of the correlation.

(2) use
**svy: regress** for
*p*-values. Do **svy: regress y x** and **svy: regress x y** and take the
biggest *p*-value, which is the conservative thing to do.

Consider a fixed finite population of N elements from which the sample was drawn. The population (i.e., true) value of Pearson’s correlation coefficient rho for two variables X and Y is

N _ _ Sum (Y_i - Y)*(X_i - X) i=1 rho = ------------------------------------------- N _ N _ {[Sum (Y_i - Y)^2] * [Sum (X_i - X)^2]}^1/2 i=1 i=1

where the sums are over all elements of the population.

From my viewpoint, rho is a fixed population parameter that one estimates based on a sample.

Now, rho measures the linear association of X and Y. The beta coefficient from a linear regression of Y on X (or X on Y) for the entire population yields an equivalent parameter (i.e., if one knows the population standard deviations of X and Y, one can derive the linear regression slope parameter from the correlation parameter and vice versa). The null hypotheses rho = 0 and beta = 0 are equivalent. Here I am talking about population parameters, i.e., the true values of the parameters.

A mechanical equivalence also exists for the standard estimates computed from a simple random sample. One can derive the estimate of the regression slope coefficient from the estimate of the correlation coefficient and vice versa (again, assuming one has the standard deviations).

Furthermore, the *p*-value from a linear regression of Y on X (or X on
Y) is the same as a *p*-value for Pearson’s correlation
coefficient for a simple random sample under the assumption of normality of
the population.

Now, consider a complex survey sample.

As I wrote above, the population (i.e., true) value of Pearson’s correlation coefficient is

N _ _ Sum (Y_i - Y)*(X_i - X) i=1 rho = ------------------------------------------- N _ N _ {[Sum (Y_i - Y)^2] * [Sum (X_i - X)^2]}^1/2 i=1 i=1

where the sums are over all persons in the population.

Clearly, this population value can be estimated by replacing the various
sums with weighted sums over the sample. Effectively, this is what the
**correlate** command will compute for you when you specify
**aweight**s.

But what about a *p*-value?

The above comments about the equivalence of the hypotheses rho = 0 and beta
= 0 might make one think, "Great, I can just use **svy: regress** to
get a *p*-value for the correlation coefficient." Well, this is indeed
the case, and it is indeed what I recommend, but there are some issues here
that are not at all straightforward. For the linearization variance
estimator used by **svy: regress** and **regress, robust**, the
*p*-value for the slope of the regression of Y on X is not the same as
the *p*-value for the regression of X on Y, unlike the case for the OLS
regression estimator. (Try it!)

It’s really difficult to put into words why this is the case.

For a simple random sample from a normal population, the *p*-values for
OLS regression Y on X, X on Y, and the correlation coefficient are all the
same. This is really amazing if you think about it. After all, the slopes
of Y on X and X on Y for OLS regressions are not inverses. An OLS
regression of Y on X *IS* different from a regression of X on Y, so the
simple random sample result should be considered the odd result. That the
*p*-values of **svy: regress y x** and **svy: regress x
y** are different should not be surprising.

Now, I said that the null hypotheses rho = 0 and beta = 0 are equivalent.
These are hypotheses about the true population values, so how you sampled
the population isn’t an issue. Since you can test beta = 0 using
**svy: regress**, can’t this be used as a *p*-value for
the test of rho = 0 since these are equivalent hypotheses? Well... yes! So
the bottom line is that you can use either **svy: regress y x** or
**svy: regress x y** as a test of rho = 0. You will likely get
slightly different but very similar *p*-values. They are slightly
different tests of the same null hypothesis.

Using linearization methods, we could produce an **svy** command that
directly estimates the variance of the weighted estimator for rho (i.e., an
estimate of the variance of the estimates produced by **correlate** with
weights).

However, this method would not necessarily lead to any better
*p*-values—they may in fact be worse! The estimator for rho is a
ratio with the denominator containing a square root. A **svy** command
estimating the variance of this monster would use an approximation based on
the first-derivative linearization of this nonlinear function, and hence the
variance estimates of the estimator of rho would be very crude. Therefore,
the resulting *p*-values would also be very crude.

This is my advice: use **correlate** with **aweight**s for point
estimates of the correlation (as you are, no doubt, doing) and use
**svy: regress** for *p*-values. Do **svy: regress y
x** and **svy: regress x y** and take the biggest
*p*-value—that is the conservative thing to do. This is a fine
(and perhaps superior) *p*-value for the test of rho = 0.