Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: egen ... = mode()


From   Joly.Patrick@ic.gc.ca
To   statalist@hsphsun2.harvard.edu
Subject   st: egen ... = mode()
Date   Fri, 20 Jun 2003 17:15:56 -0400

I find the warning message issued by -egen ... = mode()- to be tremendously
misleading and, in many cases, just plain wrong.  E.g.

   clear
   set seed 123
   set obs 3
   g byte group = _n in 1/3
   expand 5
   g byte var = int(10*uniform()+1)
   replace var = . if group==3
   sort group

.    egen mode = mode(var), min by(group)
Warning: multiple modes encountered.  Generating missing values for the
mode.  Use the maximum, minimum, or nummode()
options to control this behavior.
(5 missing values generated)

. li

     +--------------------+
     | group   var   mode |
     |--------------------|
  1. |     1     7      7 |
  2. |     1     6      7 |
  3. |     1    10      7 |
  4. |     1     7      7 |
  5. |     1     5      7 |
     |--------------------|
  6. |     2     5      1 |
  7. |     2     8      1 |
  8. |     2     9      1 |
  9. |     2     1      1 |
 10. |     2     1      1 |
     |--------------------|
 11. |     3     .      . |
 12. |     3     .      . |
 13. |     3     .      . |
 14. |     3     .      . |
 15. |     3     .      . |
     +--------------------+

Well first of all, I *did* specify option -min- thank you very much, and
secondly, there are no multiple modes *anywhere*!!  What's going on?

Looking at the code, I see that part of the confusion arises because the
-assert- statement used for identifying multiple modes is based on whether
any missing values were generated for the mode.  This doesn't work well
because missing values can occur for reasons other than multiple modes.  Not
only that, the test is also entangled with the status of option `missing' --
which I believe it shouldn't be.

To illustrate, lets recap instances that may cause the mode to be missing
(based on the code):

   `missing' not specified
   -----------------------

      Case 1:   multiple modes found (min, max, nummode not specified)

      Case 2:   `varlist' is missing for all observation in a
                   group or all of the data

   `missing' specified
   -------------------

      Case 3:   multiple modes found (min, max, nummode not specified)

      Case 4:   '.' is the mode in a group or all of the data


Actually, Case 3 is a something of a fib.  It cannot occur because mode()
requires one of minmode, maxmode or nummode() to also be specified along
with `missing' (I am not sure why though.)

Nevertheless, in Cases 2 and 4, no warning is required since the result
would be the same whether any of `missing', min, max, or nummode are
specified.  Cases 1 and (hypothetically) 3 are the ones for which users
should be warned against or at least informed of.

The main issue is that a warning is being issued in Case 2 (when it
shouldn't be) and, to a lesser extent, wouldn't be issued in Case 3 (when it
should be, hypothetically speaking). :-)


Proposed solution
-----------------

The first step is to agree the -if "`missing'" { ... }- condition must go.
It is orthogonal to the determination of the number of modes.

The -capture assert ...- statement can then be altered to discriminate
between Cases 1 (warning) and 2 (no warning), by changing:

      capture assert !missing(`g') if `touse'
   to
      capture assert !missing(`g') if `touse' & !missing(`varlist')

i.e., there are multiple modes if the mode is missing anywhere `varlist'
isn't (if `missing' is not specified).  Howver, this new -assert ...- won't
handle instances of (hypothetical) Case 3.  False positives would be
generated since, if the "true" mode is 'missing', `g' will be missing where
`varlist' is non-missing.

A general solution to all of this, which would work in all cases (IMHO) is
to test directly for multiple modes which should be easy since we already
have temporary variable `uniq' for this purpose.

The block

      if "`missing'" == "" {
            capture assert !missing(`g') if `touse'
            if _rc == 9 {
                  di "{p}{txt}Warning: multiple modes encountered."
                  di "Generating missing values for the mode.  Use the"
                  di "{cmd:maximum}, {cmd:minimum}, or {cmd:nummode()}"
                  di "options to control this behavior.{p_end}"
            }
      }

can be replaced by

      capture assert `uniq'==1 if `touse' & `freq'!=0
      if _rc == 9 {
            di "{p}{txt}Warning: multiple modes encountered."
            di "Generating missing values for the mode.  Use the"
            di "{cmd:maximum}, {cmd:minimum}, or {cmd:nummode()}"
            di "options to control this behavior.{p_end}"
      }

This way, mode() will only complain of multiple modes when there are in
fact, multiple modes.


Patrick Joly
joly.patrick@ic.gc.ca
pat.joly@utoronto.ca


P.S.:  In case you are wondering what all that talk of hypothetical Case 3
is about.  I began by listing possible scenarios, only to realize after
testing that _gmode.ado won't allow Case 3.  I left it there, for
generality.
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index