Home  /  Resources & support  /  FAQs  /  Estimating correlations with survey data
Note: This FAQ is for users of Stata 11. It is not relevant for more recent versions.

The following discussion was extracted from a question and response that was posted on Statalist.

How can I estimate correlations and their level of significance with survey data?

Title   Estimating correlations with survey data
Author Bill Sribney, StataCorp

There are two options:

(1) use correlate with aweights for point estimates of the correlation.

(2) use svy: regress for p-values. Do svy: regress y x and svy: regress x y and take the biggest p-value, which is the conservative thing to do.

Consider a fixed finite population of N elements from which the sample was drawn. The population (i.e., true) value of Pearson’s correlation coefficient rho for two variables X and Y is

                      N         _         _
                     Sum (Y_i - Y)*(X_i - X)
                     i=1
   
        rho = -------------------------------------------
   
                 N         _         N         _
              {[Sum (Y_i - Y)^2] * [Sum (X_i - X)^2]}^1/2
                i=1                 i=1

where the sums are over all elements of the population.

From my viewpoint, rho is a fixed population parameter that one estimates based on a sample.

Now, rho measures the linear association of X and Y. The beta coefficient from a linear regression of Y on X (or X on Y) for the entire population yields an equivalent parameter (i.e., if one knows the population standard deviations of X and Y, one can derive the linear regression slope parameter from the correlation parameter and vice versa). The null hypotheses rho = 0 and beta = 0 are equivalent. Here I am talking about population parameters, i.e., the true values of the parameters.

A mechanical equivalence also exists for the standard estimates computed from a simple random sample. One can derive the estimate of the regression slope coefficient from the estimate of the correlation coefficient and vice versa (again, assuming one has the standard deviations).

Furthermore, the p-value from a linear regression of Y on X (or X on Y) is the same as a p-value for Pearson’s correlation coefficient for a simple random sample under the assumption of normality of the population.

Now, consider a complex survey sample.

As I wrote above, the population (i.e., true) value of Pearson’s correlation coefficient is

                      N         _         _
                     Sum (Y_i - Y)*(X_i - X)
                     i=1
   
        rho = -------------------------------------------
   
                 N         _         N         _
              {[Sum (Y_i - Y)^2] * [Sum (X_i - X)^2]}^1/2
                i=1                 i=1

where the sums are over all persons in the population.

Clearly, this population value can be estimated by replacing the various sums with weighted sums over the sample. Effectively, this is what the correlate command will compute for you when you specify aweights.

But what about a p-value?

The above comments about the equivalence of the hypotheses rho = 0 and beta = 0 might make one think, "Great, I can just use svy: regress to get a p-value for the correlation coefficient." Well, this is indeed the case, and it is indeed what I recommend, but there are some issues here that are not at all straightforward. For the linearization variance estimator used by svy: regress and regress, robust, the p-value for the slope of the regression of Y on X is not the same as the p-value for the regression of X on Y, unlike the case for the OLS regression estimator. (Try it!)

It’s really difficult to put into words why this is the case.

For a simple random sample from a normal population, the p-values for OLS regression Y on X, X on Y, and the correlation coefficient are all the same. This is really amazing if you think about it. After all, the slopes of Y on X and X on Y for OLS regressions are not inverses. An OLS regression of Y on X *IS* different from a regression of X on Y, so the simple random sample result should be considered the odd result. That the p-values of svy: regress y x and svy: regress x y are different should not be surprising.

Now, I said that the null hypotheses rho = 0 and beta = 0 are equivalent. These are hypotheses about the true population values, so how you sampled the population isn’t an issue. Since you can test beta = 0 using svy: regress, can’t this be used as a p-value for the test of rho = 0 since these are equivalent hypotheses? Well... yes! So the bottom line is that you can use either svy: regress y x or svy: regress x y as a test of rho = 0. You will likely get slightly different but very similar p-values. They are slightly different tests of the same null hypothesis.

Using linearization methods, we could produce an svy command that directly estimates the variance of the weighted estimator for rho (i.e., an estimate of the variance of the estimates produced by correlate with weights).

However, this method would not necessarily lead to any better p-values—they may in fact be worse! The estimator for rho is a ratio with the denominator containing a square root. A svy command estimating the variance of this monster would use an approximation based on the first-derivative linearization of this nonlinear function, and hence the variance estimates of the estimator of rho would be very crude. Therefore, the resulting p-values would also be very crude.

This is my advice: use correlate with aweights for point estimates of the correlation (as you are, no doubt, doing) and use svy: regress for p-values. Do svy: regress y x and svy: regress x y and take the biggest p-value—that is the conservative thing to do. This is a fine (and perhaps superior) p-value for the test of rho = 0.