Stata 15 help for corr2data

[D] corr2data -- Create dataset with specified correlation structure

Syntax

corr2data newvarlist [, options]

options Description ------------------------------------------------------------------------- Main clear replace the current dataset double generate variable type as double; default is float n(#) generate # observations; default is current number sds(vector) standard deviations of generated variables corr(matrix|vector) correlation matrix cov(matrix|vector) covariance matrix cstorage(full) store correlation/covariance structure as a symmetric k*k matrix cstorage(lower) store correlation/covariance structure as a lower triangular matrix cstorage(upper) store correlation/covariance structure as an upper triangular matrix forcepsd force the covariance/correlation matrix to be positive semidefinite means(vector) means of generated variables; default is means(0)

Options seed(#) seed for random-number generator -------------------------------------------------------------------------

Menu

Data > Create or change data > Other variable-creation commands > Create dataset with specified correlation

Description

corr2data adds new variables with specified covariance (correlation) structure to the existing dataset or creates a new dataset with a specified covariance (correlation) structure. Singular covariance (correlation) structures are permitted. The purpose of this is to allow you to perform analyses from summary statistics (correlations/covariances and maybe the means) when these summary statistics are all you know and summary statistics are sufficient to obtain results. For example, these summary statistics are sufficient for performing analysis of t tests, variance, principal components, regression, and factor analysis. The recommended process is

. clear (clear memory) . corr2data ..., n(#) cov(...) ... (create artificial data) . regress ... (use artificial data appropriately)

However, for factor analyses and principal components, the commands factormat and pcamat allow you to skip the step of using corr2data; see [MV] factor and [MV] pca.

The data created by corr2data are artificial; they are not the original data, and it is not a sample from an underlying population with the summary statistics specified. See [D] drawnorm if you want to generate a random sample. In a sample, the summary statistics will differ from the population values and will differ from one sample to the next.

The dataset corr2data creates is suitable for one purpose only: performing analyses when all that is known are summary statistics and those summary statistics are sufficient for the analysis at hand. The artificial data tricks the analysis command into producing the desired result. The analysis command, being by assumption only a function of the summary statistics, extracts from the artificial data the summary statistics, which are the same summary statistics you specified, and then makes its calculation based on those statistics.

If you doubt whether the analysis depends only on the specified summary statistics, you can generate different artificial datasets by using different seeds of the random-number generator (see the seed() option below) and compare the results, which should be the same within rounding error.

Options

+------+ ----+ Main +-------------------------------------------------------------

clear specifies that it is okay to replace the dataset in memory, even though the current dataset has not been saved on disk.

double specifies that the new variables be stored as Stata doubles, meaning 8-byte reals. If double is not specified, variables are stored as floats, meaning 4-byte reals. See [D] data types.

n(#) specifies the number of observations to be generated; the default is the current number of observations. If n(#) is not specified or is the same as the current number of observations, corr2data adds the new variables to the existing dataset; otherwise, corr2data replaces the dataset in memory.

sds(vector) specifies the standard deviations of the generated variables. sds() may not be specified with cov().

corr(matrix|vector) specifies the correlation matrix. If neither corr() nor cov() is specified, the default is orthogonal data.

cov(matrix|vector) specifies the covariance matrix. If neither corr() nor cov() is specified, the default is orthogonal data.

cstorage(full|lower|upper) specifies the storage mode for the correlation or covariance structure in corr() or cov(). The following storage modes are supported:

full specifies that the correlation or covariance structure is stored (recorded) as a symmetric k*k matrix.

lower specifies that the correlation or covariance structure is recorded as a lower triangular matrix. With k variables, the matrix should have k(k+1)/2 elements in the following order:

C(11) C(21) C(22) C(31) C(32) C(33) ... C(k1) C(k2) ... C(kk)

upper specifies that the correlation or covariance structure is recorded as an upper triangular matrix. With k variables, the matrix should have k(k+1)/2 elements in the following order:

C(11) C(12) (C13) ... C(1k) C(22) C(23) ... C(2k) ... C(k-1k-1) C(k-1k) C(kk)

Specifying cstorage(full) is optional if the matrix is square. cstorage(lower) or cstorage(upper) is required for the vectorized storage methods. See storage modes for examples.

forcepsd modifies the matrix C to be positive semidefinite (psd) and to thus be a proper covariance matrix. If C is not positive semidefinite, it will have negative eigenvalues. By setting the negative eigenvalues to 0 and reconstructing, we obtain the least-squares positive-semidefinite approximation to C. This approximation is a singular covariance matrix.

means(vector) specifies the means of the generated variables. The default is means(0).

+---------+ ----+ Options +----------------------------------------------------------

seed(#) specifies the seed of the random-number generator used to generate data. # defaults to 0. The random numbers generated inside corr2data do not affect the seed of the standard random-number generator.

Examples

Create new dataset with 2000 observations having mean and standard deviation for x of 2 and .5 and for y of 3 and 2 . corr2data x y, n(2000) means(2 3) sds(.5 2)

Display summary statistics . summarize

Setup . clear . matrix C = (1, .5 \ .5, 1)

Create new dataset with 2000 observations with variables x and y correlated as defined by matrix C . corr2data x y, n(2000) corr(C)

Display correlation matrix . correlate x y


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index