Home  /  Resources & support  /  FAQs  /  Estimating correlations with survey data using svy:sem

How can I estimate correlation coefficients and their p-values for complex survey data?

Title   Estimating correlations with survey data using svy:sem
Authors Mia Lv, StataCorp
Jeff Pitblado, StataCorp

In Stata, the standard commands for computing correlations—such as correlate and pwcorr—do not support survey design data or allow the use of sampling weights (pweights). In modern Stata, we can now use svy: sem to directly estimate correlations between variables, along with their p-values, for complex survey datasets.

Here is a quick example:

webuse nhanes2l, clear

* Declare the survey data characteristics such as sampling units, weights, strata and finite population corrections 
     svyset psu [pweight=finalwgt] ,strata(strata) svy: sem (<- bmi bpsystol tcresult), standardized

The output of the above svy: sem command is

. svy: sem (<- bmi bpsystol tcresult), standardized
(running sem on estimation sample)

Survey: Structural equation model                Number of obs   =      10,351
Number of strata = 31                            Population size = 117,157,513
Number of PSUs   = 62                            Design df       =          31
Linearized
Standardized Coefficient std. err. t P>|t| [95% conf. interval]
mean(bmi) 5.262723 .0618202 85.13 0.000 5.13664 5.388806
mean(bpsystol) 5.93213 .0824527 71.95 0.000 5.763967 6.100293
mean(tcresult) 4.398236 .0482051 91.24 0.000 4.299921 4.496551
var(bmi) 1 . . .
var(bpsystol) 1 . . .
var(tcresult) 1 . . .
cov(bmi,
bpsystol) .3667122 .0113745 32.24 0.000 .3435138 .3899106
cov(bmi,
tcresult) .2007772 .0124536 16.12 0.000 .175378 .2261764
cov(bpsystol,
tcresult) .2639261 .0112336 23.49 0.000 .2410151 .2868371

The third section of the table displays the estimated covariances (which, with the standardized option, are correlations) between each pair of variables. The P>|t| column provides p-values for a test of each estimate against zero separately. In this example, all three p-values are smaller than 0.001, indicating statistically significant correlation coefficients.

Why this code computes correlations

The sem command fits a structural equation model (SEM). When specifying a model with only exogenous variables on the right side of “<-” [for example, (<- height weight race)], you are asking Stata to estimate the covariance matrix among these variables.

The standardized option displays standardized values and transforms the covariances into correlations, so the output gives you the correlation coefficients directly.

Because this command includes the svy: prefix, Stata also incorporates the survey design information specified in the svyset command. This ensures that the correlations, their standard errors, and p-values are correctly adjusted for the complex survey design, including sampling weights, PSU, strata, etc.

How to compare two correlations

After estimating the correlations (standardized covariances) among your variables with svy: sem, you may want to test whether two correlations are statistically different from each other. For example, in medical research, people may want to compare whether body mass index (bmi) has a stronger correlation with systolic blood pressure (bpsystol) than with serum cholesterol level (tcresult). The estat stdize: test command allows you to do that by performing a hypothesis test on the estimated parameters from your SEM model.

We can run the following command to test that the null hypothesis that the correlation between bmi and bpsystol is equal to the correlation between bmi and tcresult in the population:

. estat stdize: test _b[cov(bmi,bpsystol)]= _b[cov(bmi,tcresult)]

Adjusted Wald test

 ( 1)  [/]cov(bmi,bpsystol) - [/]cov(bmi,tcresult) = 0

       F(  1,    31) =  110.33
            Prob > F =    0.0000

The test results show a significant difference between the two correlations. Therefore, we conclude that body mass index has a stronger correlation with systolic blood pressure than with serum cholesterol level.

We need to specify the estat stdize prefix because the test command operates on the unstandardized parameter estimates by default even when we specified standardized with the previous svy:sem estimation command. However, when comparing correlations (which are standardized covariances), we need to use the standardized parameter estimates, not the unstandardized ones. estat stdize temporarily replaces the parameter vector with the standardized estimates (correlations) so that we can perform the hypothesis test on the standardized covariances (correlations).

If you are unsure what labels you should use with test to refer to those coefficients, you can redo the svy: sem estimation by typing sem with the option coeflegend. Here is the command and its output:

. sem, coeflegend

Survey: Structural equation model                Number of obs   =      10,351
Number of strata = 31                            Population size = 117,157,513
Number of PSUs   = 62                            Design df       =          31

Coefficient Legend
mean(bmi) 25.27584 _b[/mean(bmi)]
mean(bpsystol) 126.9458 _b[/mean(bpsystol)]
mean(tcresult) 213.0977 _b[/mean(tcresult)]
var(bmi) 23.06695 _b[/var(bmi)]
var(bpsystol) 457.947 _b[/var(bpsystol)]
var(tcresult) 2347.472 _b[/var(tcresult)]
cov(bmi,bpsystol) 37.69017 _b[/cov(bmi,bpsystol)]
cov(bmi,tcresult) 46.72074 _b[/cov(bmi,tcresult)]
cov(bpsystol,tcresult) 273.6466 _b[/cov(bpsystol,tcresult)]

Now we can see the labels for each coefficient estimated by the previous svy:sem model. Then we can use the correct labels in the test command.