Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Simulate and corr2data - solutions & comment


From   Richard Williams <[email protected]>
To   [email protected]
Subject   Re: st: Simulate and corr2data - solutions & comment
Date   Wed, 21 Jan 2004 07:37:20 -0500

At 11:22 AM 1/21/2004 +0000, Allan Reese wrote:
> Given corr2data's intended purpose, I don't think this [giving the same
> results every time]  is really a bug.  corr2data is meant to generate
> data where only the means, correlations, sds and N are required for the
> analysis.

As it is documented, it is a feature.  But I suspect a misapplied logic
here.  corr2data creates a pair of variables with given mean and
covariances.  But any use of that sample must look at other features of
the sample; if you "only need the means ..." then you do not need the
observations.
Indeed, in SPSS you do not need the observations, you just input the means, correlations and standard deviations. But, Stata doesn't let you do that, so you have to create fake data with the specified means, etc. Observations are a necessary evil to get the means, etc., since you can't input them directly.

As far as use of that sample, you have to know what you can and cannot do. You want to run a regression using all the cases, with some or all of the variables, fine. But, you want to compute an interaction term? Compute and use log of income instead of income? Select a subsample of minorities only? Sorry, you can't do it. Well, actually, you can do it, but it will be wrong! SPSS is good in that it won't let you even try to do such things. Stata will let you do it, but the results won't mean anything.


Apart from the problem of parsing the English, I can't make head or tail
of this advice.  At the core seems to be a reminder that the mathematical
binormal distribution is characterized completely by the first two
moments; beyond that I don't know what "meaningless aspects of the
representation of the data" might be.
Again, a good example is trying to compute an interaction term. Or, trying to get subgroup means for blacks and whites. If what you want is something you couldn't get just from the means, sds, etc., then whatever Stata gives you will be meaningless. Put another way, whatever you want to do ought to be something you could do just from the summary statistics without the original data.



> corr2data is great if, say, you want to replicate a published regression
> analysis where the means, sds and correlations are in the paper.

English!  You ain't *replicating* the analysis, but simulating it.  You
replicate the method to understand the process and perhaps investigate its
robustness and sensitivity - for which purpose you require repeated
different samples.
Perhaps, but my point is that whatever sample you create with corr2data, the results you get will be identical, at least if you are using it correctly. Every sample you create with it will have the exact same means, sds, correlations, so any analysis that only requires means, etc. will produce the same results. If you are doing something that produces different results, it is basically illegitimate and based on meaningless aspects of the data.

HOWEVER, that is not to say that what you want to do is illegit. corr2data is like creating a population with known parameters; and you want to sample from that population and do bootstrapping or whatever. I think that is fine. But, you wouldn't achieve that goal by creating a 1000 different data sets with corr2data, because there would be no sampling variability in any of them; every one of them would have the exact same means, etc. Drawing a subsample from a data set created by corr2data, or using drawnorm, is what you want (I think!)


I remain unconvinced that the default action of producing the *same*
arbitrary data sample whenever corr2data is called is sensible or helpful.
The usual practice with any random process is to generate (pseudo-)
independent samples, unless the user puts in a fixed seed.
Perhaps Stata can explain its rationale, but my thinking would be that one sample is as good as another, so there is no need to generate different ones. This issue would never even come up if you were using SPSS - the means, etc., are what they are, and for many types of analyses it doesn't matter what data set generated them. When it does matter, corr2data won't help you out.

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index