
Title | Estimating correlations with survey data using svy:sem | |
Authors |
Mia Lv, StataCorp Jeff Pitblado, StataCorp |
In Stata, the standard commands for computing correlations—such as correlate and pwcorr—do not support survey design data or allow the use of sampling weights (pweights). For older Stata versions, we had FAQ: How can I estimate correlations and their level of significance with survey data?, which explained how to work around using correlate with aweights to get the point estimates and svy: regress to obtain the p-values. In modern Stata, we can now use svy: sem to directly estimate correlations between variables, along with their p-values, for complex survey datasets.
Here is a quick example:
webuse nhanes2l, clear * Declare the survey data characteristics such as sampling units, weights, strata and finite population corrections svyset psu [pweight=finalwgt] ,strata(strata) svy: sem (<- bmi bpsystol tcresult), standardized
The output of the above svy: sem command is
. svy: sem (<- bmi bpsystol tcresult), standardized (running sem on estimation sample) Survey: Structural equation model Number of obs = 10,351 Number of strata = 31 Population size = 117,157,513 Number of PSUs = 62 Design df = 31
Linearized | ||
Standardized | Coefficient std. err. t P>|t| [95% conf. interval] | |
mean(bmi) | 5.262723 .0618202 85.13 0.000 5.13664 5.388806 | |
mean(bpsystol) | 5.93213 .0824527 71.95 0.000 5.763967 6.100293 | |
mean(tcresult) | 4.398236 .0482051 91.24 0.000 4.299921 4.496551 | |
var(bmi) | 1 . . . | |
var(bpsystol) | 1 . . . | |
var(tcresult) | 1 . . . | |
cov(bmi, | ||
bpsystol) | .3667122 .0113745 32.24 0.000 .3435138 .3899106 | |
cov(bmi, | ||
tcresult) | .2007772 .0124536 16.12 0.000 .175378 .2261764 | |
cov(bpsystol, | ||
tcresult) | .2639261 .0112336 23.49 0.000 .2410151 .2868371 | |
The third section of the table displays the estimated covariances (which, with the standardized option, are correlations) between each pair of variables. The P>|t| column provides p-values for a test of each estimate against zero separately. In this example, all three p-values are smaller than 0.001, indicating statistically significant correlation coefficients.
The sem command fits a structural equation model (SEM). When specifying a model with only exogenous variables on the right side of “<-” [for example, (<- height weight race)], you are asking Stata to estimate the covariance matrix among these variables.
The standardized option displays standardized values and transforms the covariances into correlations, so the output gives you the correlation coefficients directly.
Because this command includes the svy: prefix, Stata also incorporates the survey design information specified in the svyset command. This ensures that the correlations, their standard errors, and p-values are correctly adjusted for the complex survey design, including sampling weights, PSU, strata, etc.
In the following, we use the collect suite of commands to build customizable tables that look like the output tables produced by correlate and pwcorr:
webuse nhanes2l, clear * Declare the survey data characteristics such as sampling units, weights, strata, and finite population corrections svyset psu [pweight=finalwgt] ,strata(strata) svy: sem (<- bmi bpsystol tcresult), standardized collect clear * Collect estimation results reported in r(table) collect get r() * Get list of variables for looping to add tags for custom table layouts local vlist = e(oxvars) local k_vlist : list sizeof vlist forvalues i = 1/`k_vlist' { local v1 : word `i' of `vlist' * Add row and col tags for the diagonal elements of the estimated * correlation matrix, use result[_r_b] in -fortags()- to ignore * the missing valued p-values collect addtags row[`v1'] col[`v1'], fortags(result[_r_b]#colname[var(`v1')]) forvalues j = `=`i'+1'/`k_vlist' { local v2 : word `j' of `vlist' * add row and col tags for the lower diagonal elements of * the estimated correlation matrix and collect addtags row[`v2'] col[`v1'], fortags(colname[cov(`v1',`v2')]) } local label: variable label `v1' collect label levels row `v1' `"`label'"', modify collect label levels col `v1' `"`label'"', modify } * Apply the style etable to our table collect style use etable, replace collect style header row, level(label) collect style header col, level(label) * The first layout: Stack correlations on their p-values collect layout (row#result[_r_b _r_p]) (col#stars[value]) * The second layout: show correlations with stars and stars note collect stars, shownote collect layout (row#result[_r_b]) (col#stars)
At the end of the above code, I organized the results into two different table layouts. In the first layout, the correlations are stacked on top of their p-values:
. collect layout (row#result[_r_b _r_p]) (col#stars[value]) Collection: default Rows: row#result[_r_b _r_p] Columns: col#stars[value] Table 1: 5 x 3
Body mass index (BMI) Systolic blood pressure Serum cholesterol (mg/dL) |
Body mass index (BMI) 1.000 |
Systolic blood pressure 0.367 1.000 |
0.00 |
Serum cholesterol (mg/dL) 0.201 0.264 1.000 |
0.00 0.00 |
In the second layout, stars for significant results are inserted after each correlation, and the star notes are displayed:
. collect stars, shownote . collect layout (row#result[_r_b]) (col#stars) Collection: default Rows: row#result[_r_b] Columns: col#stars Table 1: 3 x 5
Body mass index (BMI) Systolic blood pressure Serum cholesterol (mg/dL) |
Body mass index (BMI) 1.000 |
Systolic blood pressure 0.367 ** 1.000 |
Serum cholesterol (mg/dL) 0.201 ** 0.264 ** 1.000 |
In the above code, we apply the predefined style for etable to our customized table using the command
. collect style use etable, replace
This style is the system default for tables generated with the etable command. We apply this style because this style looks nice, and it already has rules for displaying significant stars (one star for p-values less than 0.05 and two stars for p-values less than 0.01). So there is no need to add stars using collect stars separately.
After this style is applied, the new star items are created and tagged with result[_r_b] and a new tag: stars[label]. All other existing items, such as those numbers tagged with result[_r_b] or result[_r_p], are tagged with stars[value].
This is why, when you specify the tag stars[value] in the first layout with
. collect layout (row#result[_r_b _r_p]) (col#stars[value])
the table does not include the stars.
However, the table displays the significant stars along with the correlations when we include both levels from the stars dimension (not specifying any levels means including all the levels when autolevels are not defined for this dimension):
. collect layout (row#result[_r_b]) (col#stars)
Why do we need to include the stars dimension in the second layout? Because for some combinations of the levels of row and col and result[_r_b], collect layout will not find a unique value. In other words, it will find multiple values (stars and correlation coefficient). When that happens, nothing will be displayed for those cells. For a more detailed introduction to tags and specifying layout in collections, please refer to [TABLES] Collection principles.
Tables in collections can be exported to other file formats easily. Please see FAQ: What methods can we use to export a customizable table from Stata to another format? for more detailed information on how to export such tables.
After estimating the correlations (standardized covariances) among your variables with svy: sem, you may want to test whether two correlations are statistically different from each other. For example, in medical research, people may want to compare whether body mass index (bmi) has a stronger correlation with systolic blood pressure (bpsystol) than with serum cholesterol level (tcresult). The estat stdize: test command allows you to do that by performing a hypothesis test on the estimated parameters from your SEM model.
We can run the following command to test that the null hypothesis that the correlation between bmi and bpsystol is equal to the correlation between bmi and tcresult in the population:
. estat stdize: test _b[cov(bmi,bpsystol)]= _b[cov(bmi,tcresult)] Adjusted Wald test ( 1) [/]cov(bmi,bpsystol) - [/]cov(bmi,tcresult) = 0 F( 1, 31) = 110.33 Prob > F = 0.0000
The test results show a significant difference between the two correlations. Therefore, we conclude that body mass index has a stronger correlation with systolic blood pressure than with serum cholesterol level.
We need to specify the estat stdize prefix because the test command operates on the unstandardized parameter estimates by default even when we specified standardized with the previous svy:sem estimation command. However, when comparing correlations (which are standardized covariances), we need to use the standardized parameter estimates, not the unstandardized ones. estat stdize temporarily replaces the parameter vector with the standardized estimates (correlations) so that we can perform the hypothesis test on the standardized covariances (correlations).
If you are unsure what labels you should use with test to refer to those coefficients, you can redo the svy: sem estimation by typing sem with the option coeflegend. Here is the command and its output:
. sem, coeflegend Survey: Structural equation model Number of obs = 10,351 Number of strata = 31 Population size = 117,157,513 Number of PSUs = 62 Design df = 31
Coefficient Legend | ||
mean(bmi) | 25.27584 _b[/mean(bmi)] | |
mean(bpsystol) | 126.9458 _b[/mean(bpsystol)] | |
mean(tcresult) | 213.0977 _b[/mean(tcresult)] | |
var(bmi) | 23.06695 _b[/var(bmi)] | |
var(bpsystol) | 457.947 _b[/var(bpsystol)] | |
var(tcresult) | 2347.472 _b[/var(tcresult)] | |
cov(bmi,bpsystol) | 37.69017 _b[/cov(bmi,bpsystol)] | |
cov(bmi,tcresult) | 46.72074 _b[/cov(bmi,tcresult)] | |
cov(bpsystol,tcresult) | 273.6466 _b[/cov(bpsystol,tcresult)] | |
Now we can see the labels for each coefficient estimated by the previous svy:sem model. Then we can use the correct labels in the test command.