[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: Simulate and corr2data - solutions & comment

From	[email protected] (Roberto G. Gutierrez, StataCorp)
To	[email protected]
Subject	Re: st: Simulate and corr2data - solutions & comment
Date	Wed, 21 Jan 2004 10:23:40 -0600
In a recent exchange on the uses of -corr2data- vs. -drawnorm-, Richard
Williams <[email protected]> asks why -corr2data- now accepts a
-seed()- option, which is counter-intuitive to -corr2data-'s intended purpose
of producing a "made up" dataset with fixed first and second moments.

I just rewrote the introduction to the help file for -corr2data-, which now
says:

    BEGIN QUOTE

    corr2data creates a new dataset -- or adds new variables to the existing
    data in memory -- that have exactly the correlation (covariance) structure
    specified and, optionally, exactly the means also specified.  The purpose
    of this is to allow you to perform analyses from summary statistics when
    summary statistics are all that is known and summary statistics are
    sufficient to obtain results.  For example, summary statistics are
    sufficient for performng t-tests, anova, principal components, regression,
    and factor analyses.  The recommended process is

        . clear                                      (clear memory)

	. corr2data ..., n(#) cov(..) ...            (create artificial data)
 
        . factor ...                      (use artificial data appropriately)

    The data created by corr2data is artificial; it is not the original data
    and it is a not a sample from an underlying population with the summary
    statistics specified.  See help drawnorm if you want to generate a random
    SAMPLE from a population with a specified correlation (covariance) and
    mean.  In such a sample, the summary statistics will differ from the
    population values and from one sample to the next.

    The data corr2data creates is suitable for one purpose only:  performing
    analyses when all that is known are the summary statistics and those
    summary statistics are sufficient for the analysis at hand.  The
    artificial data in effect tricks the analysis command into producing the
    desired result:  The analysis command, being by assumption only a function
    of the summary statistics, extracts from the artificial data the summary
    statistics -- which are the same summary statistics you specified -- and
    then makes it calculation based on those statistics.

    END QUOTE


Some background
---------------

There are various methods of statistical analysis that rely solely on the a
mean vector and the estimated variance-covariance (or correlation) matrix
(VCE) as calculated from a set of observations on some number of variables.
That is, you could have two different datasets of different length, but if the
estimated means and VCE are the same, then the results of the statistical
analyses will be identical for both datasets.

In these cases, it is common in the literature to not give access to the 
whole dataset, but merely state the mean vector and VCE, as they are
sufficient for the analysis at hand.

Commands in Stata (such as -factor-), however, require a dataset from which to
calculate the mean vector and VCE.  Suppose you have access to a mean vector
and VCE from (say) some published work but you do not have access to the
entire dataset.  In order to replicate the analysis in Stata (or to perform
additional analysis not included in that work), you would then need to create
a dataset with the mean and VCE EXACTLY equal to what you have at hand.  This
is what -corr2data- is for.  You use -corr2data- to create the dataset, then
run the analysis command.

The dataset you create with -corr2data-, however, has no statistical
properties other than being an artifact such that when you plug its
observations into the caculator for mean and VCE, you get what you want.  The
observations are not normally distributed, nor is the distribution even
spherically symmetric.  The dataset will most likely even violate one or more
assumptions of the method you are applying to it.  But, you don't care about
that, because you only care about replicating an analysis not on that data
itself, but on a dataset with the same mean and VCE that presumably does meet
the assumptions of the method.  Since you will get the same results from the
analysis, it does not matter.

Difference between -corr2data- and -drawnorm-
---------------------------------------------

-drawnorm-, on the other hand, will sample from a multivariate normal
distribution with POPULATION mean and variance equal to what you specify.
Since this is a bona fide random sample, the sample mean and VCE will not be
exactly equal to what you specify -- you will have some random variability.
The observations, however, will be normally distributed and such samples are
appropriate for simulations where the goal is to assess the effect of the
randomness on statistical results.

corr2data's new seed() option
-----------------------------

Previously, before the 06 jan 2004 update, -corr2data- would always produce
the exact same made-up dataset given the same mean and VCE specification,
regardless of what Stata's random number seed was set to.  The reason for
this is explained above -- it doesn't matter what the dataset looks like, only
that it exactly produces a given mean and VCE.

Recently, however, users have wondered whether certain statistical analyses 
were appropriate for use with -corr2data-.  

Consider the following hypothetical situation:  You are reading a paper that
gives only the mean vector and VCE for the data and not the data itself.  
The authors of the paper give a factor analysis, but you want a principal 
components analysis instead.  

You can use -corr2data- to create a dataset and replicate the results of the
factor analysis, but you wonder if a principal components analysis would
depend only on the mean and VCE and not on higher order moments or other
features of the data.  If you could create two different datasets with the
same mean and VCE, and if the results of the principle components analysis do
not differ for both, then you have pretty much proved to yourself that
principle components works on only the mean and VCE.  That is, -corr2data- is
appropriate for doing principle components from only a mean and VCE
specification.  In fact, principle components does pass this test.  Note that
the above is not a formal mathematical proof, but is sufficient to put most
minds at ease.

The -seed()- option allows you to change the dataset so that you can perform
the above test.  Change the seed, you change the data, but the mean and VCE
are the same.  Note, however, that this seed is unique to -corr2data- and has
nothing to do with Stata's random number seed.  Also, the randomness of the
data induced by changing the seed in -corr2data- has no interesting
distributional properties other than merely giving you different data each
time.

--Bobby
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Follow-Ups:
- Re: st: Simulate and corr2data - solutions & comment
  - From: Richard Williams <[email protected]>
Prev by Date: Re: st: Documenting ssc contributions: efficiency for Man and machine
Next by Date: st: Efficient handling of missing data
Previous by thread: st: Graphing zero on Y = the bottom line for X & more
Next by thread: Re: st: Simulate and corr2data - solutions & comment
Index(es):
- Date
- Thread