Home  /  Resources & support  /  FAQs  /  Estimating correlations with survey data
Note: This FAQ is for users of Stata 11. It is not relevant for more recent versions. Since Stata 11, svy:sem can be used to compute covariance and correlation matrices, as well as the p-values for survey data.

The following discussion was extracted from a question and response that was posted on Statalist.

How can I estimate correlations and their level of significance with survey data?

Title   Estimating correlations with survey data
Author Bill Sribney, StataCorp

There are two options:

(1) use correlate with aweights for point estimates of the correlation.

(2) use svy: regress for p-values. Do svy: regress y x and svy: regress x y and take the biggest p-value, which is the conservative thing to do.

Consider a fixed finite population of N elements from which the sample was drawn. The population (for example, true) value of Pearson’s correlation coefficient \(\rho\) for two variables X and Y is

$$ \rho =\frac{\sum ^N _{i=1}(Y_i - \bar{Y})(X_i - \bar{X})} {\sqrt{\sum ^N _{i=1}(Y_i - \bar{Y})^2} \sqrt{\sum ^N _{i=1}(X_i - \bar{X})^2}} $$

where the sums are over all elements of the population.

From my viewpoint, \(\rho\) is a fixed population parameter that one estimates based on a sample.

Now, \(\rho\) measures the linear association of X and Y. The \(\beta\) coefficient from a linear regression of Y on X (or X on Y) for the entire population yields an equivalent parameter (for example, if one knows the population standard deviations of X and Y, one can derive the linear regression slope parameter from the correlation parameter and vice versa). The null hypotheses \(\rho = 0\) and \(\beta = 0\) are equivalent. Here I am talking about population parameters, for example, the true values of the parameters.

A mechanical equivalence also exists for the standard estimates computed from a simple random sample. One can derive the estimate of the regression slope coefficient from the estimate of the correlation coefficient and vice versa (again, assuming one has the standard deviations).

Furthermore, the p-value from a linear regression of Y on X (or X on Y) is the same as a p-value for Pearson’s correlation coefficient for a simple random sample under the assumption of normality of the population.

Now, consider a complex survey sample.

As I wrote above, the population (for example, true) value of Pearson’s correlation coefficient is

$$ \rho =\frac{\sum ^N _{i=1}(Y_i - \bar{Y})(X_i - \bar{X})} {\sqrt{\sum ^N _{i=1}(Y_i - \bar{Y})^2} \sqrt{\sum ^N _{i=1}(X_i - \bar{X})^2}} $$

where the sums are over all persons in the population.

Clearly, this population value can be estimated by replacing the various sums with weighted sums over the sample. Effectively, this is what the correlate command will compute for you when you specify aweights.

But what about a p-value?

The above comments about the equivalence of the hypotheses \(\rho = 0\) and \(\beta = 0\) might make one think, "Great, I can just use svy: regress to get a p-value for the correlation coefficient." Well, this is indeed the case, and it is indeed what I recommend, but there are some issues here that are not at all straightforward. For the linearization variance estimator used by svy: regress and regress, robust, the p-value for the slope of the regression of Y on X is not the same as the p-value for the regression of X on Y, unlike the case for the OLS regression estimator. (Try it!)

It’s really difficult to put into words why this is the case.

For a simple random sample from a normal population, the p-values for OLS regression Y on X, X on Y, and the correlation coefficient are all the same. This is really amazing if you think about it. After all, the slopes of Y on X and X on Y for OLS regressions are not inverses. An OLS regression of Y on X *IS* different from a regression of X on Y, so the simple random sample result should be considered the odd result. That the p-values of svy: regress y x and svy: regress x y are different should not be surprising.

Now, I said that the null hypotheses \(\rho = 0\) and \(\beta = 0\) are equivalent. These are hypotheses about the true population values, so how you sampled the population isn’t an issue. Since you can test \(\beta = 0\) using svy: regress, can’t this be used as a p-value for the test of \(\rho = 0\) since these are equivalent hypotheses? Well... yes! So the bottom line is that you can use either svy: regress y x or svy: regress x y as a test of \(\rho = 0\). You will likely get slightly different but very similar p-values. They are slightly different tests of the same null hypothesis.

Using linearization methods, we could produce an svy command that directly estimates the variance of the weighted estimator for \(\rho\) (for example, an estimate of the variance of the estimates produced by correlate with weights).

However, this method would not necessarily lead to any better p-values—they may in fact be worse! The estimator for \(\rho\) is a ratio with the denominator containing a square root. A svy command estimating the variance of this monster would use an approximation based on the first-derivative linearization of this nonlinear function, and hence the variance estimates of the estimator of \(\rho\) would be very crude. Therefore, the resulting p-values would also be very crude.

This is my advice: use correlate with aweights for point estimates of the correlation (as you are, no doubt, doing) and use svy: regress for p-values. Do svy: regress y x and svy: regress x y and take the biggest p-value—that is the conservative thing to do. This is a fine (and perhaps superior) p-value for the test of \(\rho = 0\).