Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Can't generate perfect theoretic standard normal.


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: Can't generate perfect theoretic standard normal.
Date   Thu, 10 May 2007 22:15:48 +0100

D H [can't trace his, her or its name] is concentrating on an artificial 
set-up, uniform quantiles defined to be regularly spaced. However, going 
to very large sample sizes, and, in the case of skewness and kurtosis 
calculations, looking at sums of third and fourth powers of various quantities, 
means that the comparison is as much as a matter of numerical analysis as 
of any statistical issues. 

FWIW, the skewness and kurtosis estimators used by Stata are biased. You 
might expect the biases to be swamped by the large sample sizes, but 
no one seems to have mentioned the principle. 

The late Irving Kaplansky, better known as a leader in the theories of
groups, ring and fields than for his contributions to statistics, 
which flowed out of WWII commitments, published a quartet of 
examples in 1945 with rather similar kurtosis but salutarily different 
tail weight! Here is a program. Try not only 

. kaplansky 

but also 

. kaplansky, ysc(log) legend(pos(7)) yla(.3 .1 .03 .01 .003 .001 .0003 .0001, ang(h))

*! NJC 1.0.0 22 Nov 2004 
program kaplansky 
	version 8 
	syntax [, * ] 
	
	twoway function ///
	y1 = (1/ (3 * sqrt(_pi))) * (9/4 + x^4) * exp(-x^2) ///
	, ra(0 4) ///
	|| function ///
	y2 = (3 / (2 * sqrt(2 * _pi))) * ///
	exp(-0.5 * x^2) - (1 / (6 * sqrt(_pi))) * (9/4 + x^4) * exp(-x^2) ///
	, ra(0 4) /// 
	|| function ///
	y3 = (1 / (6 * sqrt(_pi))) * (exp(-0.25 * x^2) + 4 * exp(-x^2))  ///
	, ra(0 4) ///
	|| function ///
	y4 = ((3 * sqrt(3)) / (16 * sqrt(_pi))) * (2 + x^2) * exp(-0.75 * x^2) ///
	, ra(0 4) ///
	|| function /// 
	y5 = normden(x) ///
	, ra(0 4) ///
	legend(order(- "kurtosis" 1 "2.75" 2 "3.125" 3 "4.5" 4 "2.667" 5 "3") ///
	pos(1) ring(0)) ytitle(probability density) ///  
	subtitle("Irving Kaplansky. 1945. A common error concerning kurtosis." "Journal, American Statistical Association 40: 259") ///
	`options' 
end	

Given this and many other considerations, I would not place much trust 
or interest in whether kurtosis is close to 3. 

Nick 
n.j.cox@durham.ac.uk 

D H
 
> I think I can summarize the situation.
> 
> I have to agree that these variations from theoretic perfection
> probably don't make much of a difference in most contexts.
> Nonetheless, here are some options.
> 
> corr2data gives us perfect variance, but imperfect skew, unlike
> invnorm.  The kurtosis is also imperfect.  The imperfect skew is
> visible in a histogram, FWIW.
> 
> invnorm gives perfect skew, imperfect variance and imperfect kurtosis.
>  The variance is easy to fix: just divide the variable by its own
> standard deviation.  Kurtosis can be addressed by increasing the
> sample size sufficiently, though hitting 3.00000 might require
> 500,000,000 observations, as Jeph Herrin has shown (thanks Jeph!).
> 
> The late Gunnar Blom's recommendation of (_n - 3/8) / (_N + 3/4),
> yields perfect variance, but imperfect kurtosis and skew.  (_n/_N+1)
> gives (almost) perfect skew, imperfect kurtosis and imperfect variance
> that can be easily corrected.
> 
> Here's another approach for smaller sample sizes.  The inflection
> point of a Gaussian variable is 1 sigma.  So increasing the kurtosis
> involves making the central data points smaller and the points beyond
> +/-1 sigma larger.  Consider the following code:
> 
> version 8.2
> set memory 10m
> set obs 20000
> generate double perfuniform = _n/(_N+1)
> generate double perfnorm = invnorm(perfuniform)
> 
> gen double perfnorm5=perfnorm
> scalar ep = .00001
> scalar kurt = 0
> while kurt<2.999999 {
>   quietly summarize perfnorm5
>   scalar sdnorm5=r(sd)
>   * Increase kurtosis:
>   quietly replace perfnorm5 = perfnorm5*(1-ep) ///
>     if perfnorm5<sdnorm5 & perfnorm5>-sdnorm5
>   quietly replace perfnorm5 = perfnorm5*(1+ep) ///
>     if perfnorm5>sdnorm5 & perfnorm5<-sdnorm5
>   quietly sum perfnorm5, detail
>   scalar kurt=r(kurtosis)
>   di "kurtosis: " kurt "   ep: " ep "
> }
> di "kurtosis: " kurt "   ep: " ep "
> sum perfnorm5, detail
> 
> scalar ep = .000001
> while kurt>3.000001 {
>   quietly summarize perfnorm5
>   scalar sdnorm5=r(sd)
>   * Decrease kurtosis:
>   quietly replace perfnorm5 = perfnorm5*(1+ep) ///
>     if perfnorm5<sdnorm5 & perfnorm5>-sdnorm5
>   quietly replace perfnorm5 = perfnorm5*(1-ep) ///
>     if perfnorm5>sdnorm5 & perfnorm5<-sdnorm5
>   quietly sum perfnorm5, detail
>   scalar kurt=r(kurtosis)
> }
> di "kurtosis: " kurt "   ep: " ep "
> quietly sum perfnorm5
> replace perfnorm5 = perfnorm5/r(sd)
> sum perfnorm5, detail
> 
> Here's what we get:
> 
> . sum perfnorm5, detail
> 
>                           perfnorm5
> -------------------------------------------------------------
>       Percentiles      Smallest
>  1%    -2.329282      -3.897047
>  5%    -1.647359      -3.725188
> 10%     -1.28356        -3.6213       Obs               20000
> 25%    -.6713475       -3.54596       Sum of Wgt.       20000
> 
> 50%    -6.93e-17                      Mean          -1.38e-17
>                         Largest       Std. Dev.             1
> 75%     .6713475        3.54596
> 90%      1.28356         3.6213       Variance              1
> 95%     1.647359       3.725188       Skewness      -8.25e-16
> 99%     2.329282       3.897047       Kurtosis              3
> 
> 
> Perfection!  Or so I hope.  The histogram looks ok to my eyes.
> 
> I should disclose that I have not read about kurtosis adjustment
> anywhere, and my technique (which occurred to me this morning) may
> have problems that I have not considered.
> 
> Does this matter?  Probably not in most cases.  But consider a smaller
> sample size:
> 
> set obs 200
> generate double perfuniform = _n/(_N+1)
> generate double perfnorm = invnorm(perfuniform)
> sum perfnorm, det
> 
>                           perfnorm
> -------------------------------------------------------------
>       Percentiles      Smallest
>  1%    -2.250142      -2.577553
>  5%    -1.623964      -2.328219
> 10%    -1.270418      -2.172065       Obs                 200
> 25%    -.6706013      -2.055808       Sum of Wgt.         200
> 
> 50%     6.94e-17                      Mean           1.43e-17
>                         Largest       Std. Dev.      .9797342
> 75%     .6706013       2.055808
> 90%     1.270418       2.172065       Variance       .9598791
> 95%     1.623964       2.328219       Skewness       4.38e-16
> 99%     2.250142       2.577553       Kurtosis       2.746629
> 
> Conceivably, it might make sense to improve upon a kurtosis of 2.747
> and a variance of .96.
> 
> Great thanks to Jeph Herrin, Nick Cox and Richard Williams 
> for their comments.
> 
> On 5/10/07, Nick Cox <n.j.cox@durham.ac.uk> wrote:
> > I can't take the minute differences in this thread very seriously
> > but I think I recall, for what it's worth, that the late Gunnar Blom
> > recommended (in this notation) (_n - 3/8) / (_N + 3/4)
> > as a plotting position for (near) Gaussian data.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index