Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Dirk Enzmann <dirk.enzmann@uni-hamburg.de> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: sign test output |

Date |
Fri, 18 Jan 2013 22:48:56 +0100 |

http://hj.se/download/18.3bf8114412e804c78638000150/1299244445855/WP2010-8.pdf

Dirk

Date: Thu, 17 Jan 2013 13:14:39 +0100 From: Maarten Buis<maartenlbuis@gmail.com> Subject: Re: st: sign test output On Thu, Jan 17, 2013 at 11:21 AM, Nahla Betelmal wrote:> from my readings in statistics , I know that in order to decide > whether to use parametric or non-parametric tests, the data normality > distribution should be checked first. > > Shapiro-Wilk is used to test normality, when the number of > observations is less than 30. Otherwise, we should use > Kolmogorov-Smirnov for large sample (as in my sample).Unfortunately that is incorrect. Normality tests need huge samples before the p-value means what it is supposed to mean. An analogy I have heard in a different context, but which applies to this situation very well is: to go out to sea in a row boat to check whether the sea is safe for the QE II. Using a normality test with only 346 observations is not a good idea. Nick and I discussed the issue of the performance of tests for Gaussianity recently on Statalist: http://www.stata.com/statalist/archive/2012-09/msg01040.html http://www.stata.com/statalist/archive/2012-09/msg01013.html The bottom line was: you need at least somewhere between 10,000 and a 100,000 observations before the tests we discussed (Jarque-Bera and Doornik-Hansen) perform somewhat acceptably, but in such large datasets you need to worry whether deviations from Gaussianity that are statistically significant are also substantively significant. I have addepted the simulation from the discussion above for the Kolmogorov-Smirnov test. It shows that the Kolmogorov-Smirnov test does not perform acceptably for any of these sample sizes. *------------------- begin simulation ------------------- clear all program define sim, rclass drop _all set obs `=1e5' gen double x = rnormal() forvalues i = 2/5 { sum x in 1/`=1e`i'' ksmirnov x = normal((x-r(mean))/r(sd)) return scalar p`i' = r(p) return scalar p_cor`i' = r(p_cor) } end simulate p2p=r(p2) p2c=r(p_cor2) /// p3p=r(p3) p3c=r(p_cor3) /// p4p=r(p4) p4c=r(p_cor4) /// p5p=r(p5) p5c=r(p_cor5) /// , reps(2e4): sim gen id = _n reshape long p2 p3 p4 p5, i(id) j(dist) string label var p2 "N=100" label var p3 "N=1,000" label var p4 "N=10,000" label var p5 "N=100,000" gen byte distr = cond(dist=="p",1,2) label define distr 1 "p-value" /// 2 "corrected p-value", replace label value distr distr simpplot p?, by(distr) scheme(s2color) legend(cols(4)) *-------------------- end simulation -------------------- (For more on examples I sent to the Statalist see: http://www.maartenbuis.nl/example_faq ) This simulation needs the -simpplot- package in order to run. This can be downloaded by typing in Stata -ssc install simpplot-.

-- ======================================== Dr. Dirk Enzmann Institute of Criminal Sciences Dept. of Criminology Rothenbaumchaussee 33 D-20148 Hamburg Germany phone: +49-(0)40-42838.7498 (office) +49-(0)40-42838.4591 (Mrs Billon) fax: +49-(0)40-42838.2344 email: dirk.enzmann@uni-hamburg.de http://www2.jura.uni-hamburg.de/instkrim/kriminologie/Mitarbeiter/Enzmann/Enzmann.html ======================================== * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: sign test output***From:*Maarten Buis <maartenlbuis@gmail.com>

- Prev by Date:
**Re: st: text editor for mac 10.5.8** - Next by Date:
**st: Strange Behaviour When Selecting Levels For Factor Variables In Regression With i#** - Previous by thread:
**Re: st: sign test output** - Next by thread:
**Re: st: sign test output** - Index(es):