Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Can't generate perfect theoretic standard normal.


From   "D H" <dave1063@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Can't generate perfect theoretic standard normal.
Date   Thu, 10 May 2007 12:46:14 -0700

I think I can summarize the situation.

I have to agree that these variations from theoretic perfection
probably don't make much of a difference in most contexts.
Nonetheless, here are some options.

corr2data gives us perfect variance, but imperfect skew, unlike
invnorm.  The kurtosis is also imperfect.  The imperfect skew is
visible in a histogram, FWIW.

invnorm gives perfect skew, imperfect variance and imperfect kurtosis.
The variance is easy to fix: just divide the variable by its own
standard deviation.  Kurtosis can be addressed by increasing the
sample size sufficiently, though hitting 3.00000 might require
500,000,000 observations, as Jeph Herrin has shown (thanks Jeph!).

The late Gunnar Blom's recommendation of (_n - 3/8) / (_N + 3/4),
yields perfect variance, but imperfect kurtosis and skew.  (_n/_N+1)
gives (almost) perfect skew, imperfect kurtosis and imperfect variance
that can be easily corrected.

Here's another approach for smaller sample sizes.  The inflection
point of a Gaussian variable is 1 sigma.  So increasing the kurtosis
involves making the central data points smaller and the points beyond
+/-1 sigma larger.  Consider the following code:

version 8.2
set memory 10m
set obs 20000
generate double perfuniform = _n/(_N+1)
generate double perfnorm = invnorm(perfuniform)

gen double perfnorm5=perfnorm
scalar ep = .00001
scalar kurt = 0
while kurt<2.999999 {
 quietly summarize perfnorm5
 scalar sdnorm5=r(sd)
 * Increase kurtosis:
 quietly replace perfnorm5 = perfnorm5*(1-ep) ///
   if perfnorm5<sdnorm5 & perfnorm5>-sdnorm5
 quietly replace perfnorm5 = perfnorm5*(1+ep) ///
   if perfnorm5>sdnorm5 & perfnorm5<-sdnorm5
 quietly sum perfnorm5, detail
 scalar kurt=r(kurtosis)
 di "kurtosis: " kurt "   ep: " ep "
}
di "kurtosis: " kurt "   ep: " ep "
sum perfnorm5, detail

scalar ep = .000001
while kurt>3.000001 {
 quietly summarize perfnorm5
 scalar sdnorm5=r(sd)
 * Decrease kurtosis:
 quietly replace perfnorm5 = perfnorm5*(1+ep) ///
   if perfnorm5<sdnorm5 & perfnorm5>-sdnorm5
 quietly replace perfnorm5 = perfnorm5*(1-ep) ///
   if perfnorm5>sdnorm5 & perfnorm5<-sdnorm5
 quietly sum perfnorm5, detail
 scalar kurt=r(kurtosis)
}
di "kurtosis: " kurt "   ep: " ep "
quietly sum perfnorm5
replace perfnorm5 = perfnorm5/r(sd)
sum perfnorm5, detail

Here's what we get:

. sum perfnorm5, detail

                         perfnorm5
-------------------------------------------------------------
     Percentiles      Smallest
1%    -2.329282      -3.897047
5%    -1.647359      -3.725188
10%     -1.28356        -3.6213       Obs               20000
25%    -.6713475       -3.54596       Sum of Wgt.       20000

50%    -6.93e-17                      Mean          -1.38e-17
                       Largest       Std. Dev.             1
75%     .6713475        3.54596
90%      1.28356         3.6213       Variance              1
95%     1.647359       3.725188       Skewness      -8.25e-16
99%     2.329282       3.897047       Kurtosis              3


Perfection!  Or so I hope.  The histogram looks ok to my eyes.

I should disclose that I have not read about kurtosis adjustment
anywhere, and my technique (which occurred to me this morning) may
have problems that I have not considered.

Does this matter?  Probably not in most cases.  But consider a smaller
sample size:

set obs 200
generate double perfuniform = _n/(_N+1)
generate double perfnorm = invnorm(perfuniform)
sum perfnorm, det

                         perfnorm
-------------------------------------------------------------
     Percentiles      Smallest
1%    -2.250142      -2.577553
5%    -1.623964      -2.328219
10%    -1.270418      -2.172065       Obs                 200
25%    -.6706013      -2.055808       Sum of Wgt.         200

50%     6.94e-17                      Mean           1.43e-17
                       Largest       Std. Dev.      .9797342
75%     .6706013       2.055808
90%     1.270418       2.172065       Variance       .9598791
95%     1.623964       2.328219       Skewness       4.38e-16
99%     2.250142       2.577553       Kurtosis       2.746629

Conceivably, it might make sense to improve upon a kurtosis of 2.747
and a variance of .96.

Great thanks to Jeph Herrin, Nick Cox and Richard Williams for their comments.

On 5/10/07, Nick Cox <n.j.cox@durham.ac.uk> wrote:
I can't take the minute differences in this thread very seriously
but I think I recall, for what it's worth, that the late Gunnar Blom
recommended (in this notation) (_n - 3/8) / (_N + 3/4)
as a plotting position for (near) Gaussian data.

Nick
n.j.cox@durham.ac.uk


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index