**[D] corr2data** -- Create dataset with specified correlation structure

__Syntax__

**corr2data** *newvarlist* [**,** *options*]

*options* Description
-------------------------------------------------------------------------
Main
**clear** replace the current dataset
__d__**ouble** generate variable type as **double**; default is
**float**
**n(***#***)** generate *#* observations; default is current
number
__sd__**s(***vector***)** standard deviations of generated variables
**corr(***matrix*|*vector***)** correlation matrix
**cov(***matrix*|*vector***)** covariance matrix
__cs__**torage(**__f__**ull)** store correlation/covariance structure as a
symmetric k*k matrix
__cs__**torage(**__l__**ower)** store correlation/covariance structure as a
lower triangular matrix
__cs__**torage(**__u__**pper)** store correlation/covariance structure as an
upper triangular matrix
**forcepsd** force the covariance/correlation matrix to be
positive semidefinite
__m__**eans(***vector***)** means of generated variables; default is
**means(0)**

Options
**seed(***#***)** seed for random-number generator
-------------------------------------------------------------------------

__Menu__

**Data > Create or change data > Other variable-creation commands >** **Create**
**dataset with specified correlation**

__Description__

**corr2data** adds new variables with specified covariance (correlation)
structure to the existing dataset or creates a new dataset with a
specified covariance (correlation) structure. Singular covariance
(correlation) structures are permitted. The purpose of this is to allow
you to perform analyses from summary statistics (correlations/covariances
and maybe the means) when these summary statistics are all you know and
summary statistics are sufficient to obtain results. For example, these
summary statistics are sufficient for performing analysis of t tests,
variance, principal components, regression, and factor analysis. The
recommended process is

**. clear** (clear memory)
**. corr2data** ...**,** **n(***#***)** **cov(**...**)** ... (create artificial data)
**. regress** ... (use artificial data appropriately)

However, for factor analyses and principal components, the commands
**factormat** and **pcamat** allow you to skip the step of using **corr2data**; see
**[MV] factor** and **[MV] pca**.

The data created by **corr2data** are artificial; they are not the original
data, and it is not a sample from an underlying population with the
summary statistics specified. See **[D] drawnorm** if you want to generate a
random sample. In a sample, the summary statistics will differ from the
population values and will differ from one sample to the next.

The dataset **corr2data** creates is suitable for one purpose only:
performing analyses when all that is known are summary statistics and
those summary statistics are sufficient for the analysis at hand. The
artificial data tricks the analysis command into producing the desired
result. The analysis command, being by assumption only a function of the
summary statistics, extracts from the artificial data the summary
statistics, which are the same summary statistics you specified, and then
makes its calculation based on those statistics.

If you doubt whether the analysis depends only on the specified summary
statistics, you can generate different artificial datasets by using
different seeds of the random-number generator (see the **seed()** option
below) and compare the results, which should be the same within rounding
error.

__Options__

+------+
----+ Main +-------------------------------------------------------------

**clear** specifies that it is okay to replace the dataset in memory, even
though the current dataset has not been saved on disk.

**double** specifies that the new variables be stored as Stata **double**s,
meaning 8-byte reals. If **double** is not specified, variables are
stored as **float**s, meaning 4-byte reals. See **[D] data types**.

**n(***#***)** specifies the number of observations to be generated; the default is
the current number of observations. If **n(***#***)** is not specified or is
the same as the current number of observations, **corr2data** adds the
new variables to the existing dataset; otherwise, **corr2data** replaces
the dataset in memory.

**sds(***vector***)** specifies the standard deviations of the generated variables.
**sds()** may not be specified with **cov()**.

**corr(***matrix*|*vector***)** specifies the correlation matrix. If neither **corr()**
nor **cov()** is specified, the default is orthogonal data.

**cov(***matrix*|*vector***)** specifies the covariance matrix. If neither **corr()**
nor **cov()** is specified, the default is orthogonal data.

**cstorage(full**|**lower**|**upper)** specifies the storage mode for the correlation
or covariance structure in **corr()** or **cov()**. The following storage
modes are supported:

**full** specifies that the correlation or covariance structure is stored
(recorded) as a symmetric k*k matrix.

**lower** specifies that the correlation or covariance structure is
recorded as a lower triangular matrix. With k variables, the matrix
should have k(k+1)/2 elements in the following order:

C(11) C(21) C(22) C(31) C(32) C(33) ... C(k1) C(k2) ... C(kk)

**upper** specifies that the correlation or covariance structure is
recorded as an upper triangular matrix. With k variables, the matrix
should have k(k+1)/2 elements in the following order:

C(11) C(12) (C13) ... C(1k) C(22) C(23) ... C(2k) ...
C(k-1k-1) C(k-1k) C(kk)

Specifying **cstorage(full)** is optional if the matrix is square.
**cstorage(lower)** or **cstorage(upper)** is required for the vectorized
storage methods. See storage modes for examples.

**forcepsd** modifies the matrix C to be positive semidefinite (psd) and to
thus be a proper covariance matrix. If C is not positive
semidefinite, it will have negative eigenvalues. By setting the
negative eigenvalues to 0 and reconstructing, we obtain the
least-squares positive-semidefinite approximation to C. This
approximation is a singular covariance matrix.

**means(***vector***)** specifies the means of the generated variables. The
default is **means(0)**.

+---------+
----+ Options +----------------------------------------------------------

**seed(***#***)** specifies the seed of the random-number generator used to
generate data. *#* defaults to 0. The random numbers generated inside
**corr2data** do not affect the seed of the standard random-number
generator.

__Examples__

Create new dataset with 2000 observations having mean and standard
deviation for **x** of 2 and .5 and for **y** of 3 and 2
**. corr2data x y, n(2000) means(2 3) sds(.5 2)**

Display summary statistics
**. summarize**

Setup
**. clear**
**. matrix C = (1, .5 \ .5, 1)**

Create new dataset with 2000 observations with variables **x** and **y**
correlated as defined by matrix **C**
**. corr2data x y, n(2000) corr(C)**

Display correlation matrix
**. correlate x y**