Thanks to Kit Baum, a new package -hsmode- is now
downloadable from SSC. Stata 9 is required, given the
use of Mata inside. Use -ssc- as normal to install.
What follows is mostly, but not entirely, drawn from on the
online help. There are yet more details and comments in the help.
-hsmode- calculates half-sample modes based on recursive
selection of the half-sample with the shortest length.
Although it has longer roots, the implementation
is based particularly on the ideas of Bickel and Frühwirth (2006).
The mode is often disparaged or neglected by comparison with
its siblings the mean and median, but it can be of distinct interest
or even use, especially whenever distributions are unimodal but
asymmetric. (Modes also have a long history, as readers of Thucydides
will recall.)
If a variable is categorical or counted, the mode can usually
be read off a frequency table, subject to the occurrence of ties.
The same approach can be applied to any variable, subject to
the resolution of measurement. Thus -mpg- in the auto data is
presented with a resolution of 1 mpg and the mode calculated
just by counting of 18 is supported by graphical analysis as a
fair estimate. (It is, in fact, confirmed by -hsmode-.)
An automated way of getting modes from counts alone is given
in -modes- from SJ 3-2, which requires Stata 8. An earlier
version in STB-50 requires Stata 6.
The issue is how to get at an estimate of the mode whenever
a variable is measured with a resolution such that counting is
not a reliable method, if especially all or almost all measurements
are distinct. Many people will have been brought up to look
at a histogram and read off an approximate value, and may have
the impression that not much more can or should be done. Looking at
a graph is naturally always a good idea to put any estimate
of mode in context. A more modern way of doing it is to get
a kernel estimate of the density and modes have been estimated
in that way. Either of these approaches suffers from some arbitrariness,
for example over bin origin and width or kernel type and width.
This shouldn't usually matter, but sometimes a direct method
would be useful.
Less obvious than looking for a peak in density, but still worth
a try, is to look for a shoulder on a quantile plot. See -quantile-
or (preferably) -qplot- from SJ 6-4 (Stata 8 required).
Kernel estimation is an excellent method, especially when bimodality
or multimodality is a possibility. The suggestion, however, is
that -kdensity- (in Stata terms) is best kept as an independent method
of assessing modality.
An idea of estimating the mode as the midpoint of the shortest
interval that contains a fixed number of observations goes back at
least to Dalenius (1965). See also Robertson and Cryer (1974),
Bickel (2002) and Bickel and Frühwirth (2006) on other estimators of the mode.
The order statistics of a sample of n values of x are defined by
x(1) <= x(2) <= ... <= x(n-1) <= x(n).
The half-sample mode is defined in -hsmode- using two rules.
Rule 1. If n = 1, the half-sample mode is x(1). If n = 2, the
half-sample mode is (x(1) + x(2)) / 2. If n = 3, the half-sample
mode is (x(1) + x(2)) / 2 if x(1) and x(2) are closer than x(2)
and x(3), (x(2) + x(3)) / 2 if the opposite is true, and x(2) otherwise.
Rule 2. If n >= 4, we apply recursive selection until left with 3
or fewer values. First let h_1 = floor(n / 2). The shortest half of
the data from rank k to rank k + h_1 is identified to minimise
x(k + h_1) - x(k)
over k = 1, ..., n - h_1. Then the shortest half of those h_1 + 1
values is identified using h_2 = floor(h_1 / 2), and so on.
To finish, use Rule 1.
Bickel, D.R. 2002. Robust estimators of the mode and skewness of
continuous data. Computational Statistics & Data Analysis 39: 153-163.
Bickel, D.R. and R. Frühwirth. 2006. On a fast, robust estimator of
the mode: comparisons to other estimators with applications.
Computational Statistics & Data Analysis 50: 3500-3530.
Dalenius, T. 1965. The mode - A neglected statistical parameter.
Journal, Royal Statistical Society A 128: 110-117.
Robertson, T. and J.D. Cryer. 1974. An iterative procedure for
estimating the mode. Journal, American Statistical Association
69: 1012-1016.
Nick
n.j.cox@durham.ac.uk
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/