A bizarre but checkable anecdote is that the practice
of significance starring was introduced purely by
accident. Seek out copies of
Yates, F. 1937.
The Design and Analysis of Factorial Experiments.
Technical Communication No 35,
Imperial Bureau of Soil Science, Harpenden.
and see how * and ** were used to mark
footnotes that explained significance at different levels.
It seems particularly ironic that it was Yates
whose little pebble started the landslide here.
Yates himself was no friend of significance testing
and even criticised Fisher (his mentor and collaborator)
in his obituary notice for over-emphasis on tests.
It may satisfy purely tribal imitation, namely
doing just what other people do, but starring seems
objectionable on several grounds:
1. If the P-value is worth printing, it is evidence
in itself and need not be degraded by categorisation.
If the implication is that a table of many P-values
is too detailed or too charmless to be readily assimilated without
decoration, then it should be replaced by a graphical
display (which could include numerical labels).
2. Starring might be defended on the grounds that it indicates
which hypotheses we would reject at a variety of different
levels. But that would be playing several different games
at once. Good conservative practice if you believe that
significance testing is a good idea is to use one threshold
level that you regard as appropriate, not two or more
simultaneously. And once you entertain several hypotheses
simultaneously, as is usually implicit in contemplation
of a table with several P-values, multiplicity complicates
the issue mightily (as indeed is often, but not always, recognised).
3. All calculations in (for example) a regression are
conditional on assumptions being satisfied, assumptions
that we usually should regard as suspect at the best of
times. Loosely, we would normally regard coefficient
estimates as being more reliable than standard errors which
in turn are more reliable than P-values. Why many analysts should
habitually choose to subject the least reliable part of
the modelling results to the most intense scrutiny is a
deep puzzle.
4. Significance is, or should be, always a lesser deal
than strength of relationship or magnitude of effect.
(If not, your sample size is too small.) Only the other
day someone asked me privately to add starring to one
of my own programs and gave as exemplar some output
in which a correlation of 0.0753 was starred. Your
view may well differ, but I have never yet
found a correlation of that magnitude worth any
consideration. Being assured that it really is not
zero is not very interesting or helpful to me.
Thus starring seems to me to encourage the wrong
kind of scrutiny.
If my history is correct, we have had 70 years
of starring, and it is 37 years since Peter Sprent
epitomised starring as "more appropriate to a hotel
guide-book than a serious scientific paper" (JRSS A
1970 p.143). What will we see in the next 37 or 70
years?
Nick
n.j.cox@durham.ac.uk
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/