Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Significance stars


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: Significance stars
Date   Fri, 16 Mar 2007 15:48:05 -0000

A bizarre but checkable anecdote is that the practice
of significance starring was introduced purely by 
accident. Seek out copies of 

Yates, F. 1937. 
The Design and Analysis of Factorial Experiments. 
Technical Communication No 35, 
Imperial Bureau of Soil Science, Harpenden. 

and see how * and ** were used to mark 
footnotes that explained significance at different levels. 

It seems particularly ironic that it was Yates
whose little pebble started the landslide here. 
Yates himself was no friend of significance testing
and even criticised Fisher (his mentor and collaborator)
in his obituary notice for over-emphasis on tests. 

It may satisfy purely tribal imitation, namely 
doing just what other people do, but starring seems 
objectionable on several grounds: 

1. If the P-value is worth printing, it is evidence 
in itself and need not be degraded by categorisation. 
If the implication is that a table of many P-values 
is too detailed or too charmless to be readily assimilated without
decoration, then it should be replaced by a graphical
display (which could include numerical labels). 

2. Starring might be defended on the grounds that it indicates 
which hypotheses we would reject at a variety of different
levels. But that would be playing several different games
at once. Good conservative practice if you believe that 
significance testing is a good idea is to use one threshold 
level that you regard as appropriate, not two or more 
simultaneously. And once you entertain several hypotheses
simultaneously, as is usually implicit in contemplation 
of a table with several P-values, multiplicity complicates
the issue mightily (as indeed is often, but not always, recognised). 

3. All calculations in (for example) a regression are
conditional on assumptions being satisfied, assumptions
that we usually should regard as suspect at the best of
times. Loosely, we would normally regard coefficient 
estimates as being more reliable than standard errors which
in turn are more reliable than P-values. Why many analysts should
habitually choose to subject the least reliable part of
the modelling results to the most intense scrutiny is a 
deep puzzle. 

4. Significance is, or should be, always a lesser deal 
than strength of relationship or magnitude of effect. 
(If not, your sample size is too small.) Only the other
day someone asked me privately to add starring to one
of my own programs and gave as exemplar some output 
in which a correlation of 0.0753 was starred. Your 
view may well differ, but I have never yet
found a correlation of that magnitude worth any
consideration. Being assured that it really is not
zero is not very interesting or helpful to me. 
Thus starring seems to me to encourage the wrong
kind of scrutiny. 

If my history is correct, we have had 70 years 
of starring, and it is 37 years since Peter Sprent
epitomised starring as "more appropriate to a hotel
guide-book than a serious scientific paper" (JRSS A
1970 p.143). What will we see in the next 37 or 70 
years? 

Nick 
n.j.cox@durham.ac.uk 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index