help corr2data dialog: corr2data
-------------------------------------------------------------------------------
Title
[D] corr2data -- Create dataset with specified correlation structure
Syntax
corr2data newvarlist [, options]
options description
-------------------------------------------------------------------------
Main
clear replace the current dataset
double generate variable type as double; default is
float
n(#) # of observations to be generated; default is
current number
sds(vector) standard deviations of generated variables
corr(matrix|vector) correlation matrix
cov(matrix|vector) covariance matrix
cstorage(full) correlation/covariance structure is stored as a
symmetric k*k matrix
cstorage(lower) correlation/covariance structure is stored as a
lower triangular matrix
cstorage(upper) correlation/covariance structure is stored as an
upper triangular matrix
forcepsd force the covariance/correlation matrix to be
positive semidefinite
means(vector) means of generated variables; default is
means(0)
Options
seed(#) seed for random-number generator
-------------------------------------------------------------------------
Menu
Data > Create or change data > Other variable-creation commands > Create
dataset with specified correlation
Description
corr2data adds new variables with specified covariance (correlation)
structure to the existing dataset or creates a new dataset with a
specified covariance (correlation) structure. Singular covariance
(correlation) structures are permitted. The purpose of this is to allow
you to perform analyses from summary statistics (correlations/covariances
and maybe the means) when these summary statistics are all you know and
summary statistics are sufficient to obtain results. For example, these
summary statistics are sufficient for performing analysis of t tests,
variance, principal components, regression, and factor analysis. The
recommended process is
. clear (clear memory)
. corr2data ..., n(#) cov(...) ... (create artificial data)
. regress ... (use artificial data appropriately)
However, for factor analyses and principal components, the commands
factormat and pcamat allow you to skip the step of using corr2data; see
[MV] factor and [MV] pca.
The data created by corr2data are artificial; they are not the original
data, and it is not a sample from an underlying population with the
summary statistics specified. See [D] drawnorm if you want to generate a
random sample. In a sample, the summary statistics will differ from the
population values and will differ from one sample to the next.
The dataset corr2data creates is suitable for one purpose only:
performing analyses when all that is known are summary statistics and
those summary statistics are sufficient for the analysis at hand. The
artificial data tricks the analysis command into producing the desired
result. The analysis command, being by assumption only a function of the
summary statistics, extracts from the artificial data the summary
statistics, which are the same summary statistics you specified, and then
makes its calculation based on those statistics.
If you doubt whether the analysis depends only on the specified summary
statistics, you can generate different artificial datasets by using
different seeds of the random-number generator (see the seed() option
below) and compare the results, which should be the same within rounding
error.
Options
+------+
----+ Main +-------------------------------------------------------------
clear specifies that it is okay to replace the dataset in memory, even
though the current dataset has not been saved on disk.
double specifies that the new variables be stored as Stata doubles,
meaning 8-byte reals. If double is not specified, variables are
stored as floats, meaning 4-byte reals. See [D] data types.
n(#) specifies the number of observations to be generated; the default is
the current number of observations. If n(#) is not specified or is
the same as the current number of observations, corr2data adds the
new variables to the existing dataset; otherwise, corr2data replaces
the dataset in memory.
sds(vector) specifies the standard deviations of the generated variables.
sds() may not be specified with cov().
corr(matrix|vector) specifies the correlation matrix. If neither corr()
nor cov() is specified, the default is orthogonal data.
cov(matrix|vector) specifies the covariance matrix. If neither corr()
nor cov() is specified, the default is orthogonal data.
cstorage(full|lower|upper) specifies the storage mode for the correlation
or covariance structure in corr() or cov(). The following storage
modes are supported:
full specifies that the correlation or covariance structure is stored
(recorded) as a symmetric k*k matrix.
lower specifies that the correlation or covariance structure is
recorded as a lower triangular matrix. With k variables, the matrix
should have k(k+1)/2 elements in the following order:
C(11) C(21) C(22) C(31) C(32) C(33) ... C(k1) C(k2) ... C(kk)
upper specifies that the correlation or covariance structure is
recorded as an upper triangular matrix. With k variables, the matrix
should have k(k+1)/2 elements in the following order:
C(11) C(12) (C13) ... C(1k) C(22) C(23) ... C(2k) ...
C(k-1k-1) C(k-1k) C(kk)
Specifying cstorage(full) is optional if the matrix is square.
cstorage(lower) or cstorage(upper) is required for the vectorized
storage methods. See storage modes for examples.
forcepsd modifies the matrix C to be positive semidefinite (psd) and to
thus be a proper covariance matrix. If C is not positive
semidefinite, it will have negative eigenvalues. By setting the
negative eigenvalues to 0 and reconstructing, we obtain the
least-squares positive-semidefinite approximation to C. This
approximation is a singular covariance matrix.
means(vector) specifies the means of the generated variables. The
default is means(0).
+---------+
----+ Options +----------------------------------------------------------
seed(#) specifies the seed of the random-number generator used to
generate data. # defaults to 0. The random numbers generated inside
corr2data do not affect the seed of the standard random-number
generator.
Examples
Create new dataset with 2000 observations having mean and standard
deviation for x of 2 and .5 and for y of 3 and 2
. corr2data x y, n(2000) means(2 3) sds(.5 2)
Display summary statistics
. summarize
Setup
. clear
. matrix C = (1, .5 \ .5, 1)
Create new dataset with 2000 observations with variables x and y
correlated as defined by matrix C
. corr2data x y, n(2000) corr(C)
Display correlation matrix
. correlate x y
Also see
Manual: [D] corr2data
Help: [D] drawnorm, [D] data types