Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Collapse with sum function and missing values


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: Collapse with sum function and missing values
Date   Wed, 10 Feb 2010 12:41:52 -0000

Michael isn't the first to be surprised by this behaviour. 

Raoul Reulen remembered a 2004 thread to which I contributed, which was more than I did. Start at http://www.stata.com/statalist/archive/2004-07/msg00779.html

It's a commonplace of mathematics, philosophy, law, love, sport for all I know, that combining reasonable rules can sometimes produce unreasonable results, or so it may seem. Sometimes the rules don't then seem so reasonable. Other times a closer analysis makes them seem reasonable after all. 

Here's my version of the facts. It's an extension of Joseph McDonnell's answer, which I think has the nub of the matter. 

Stata is here following two specific rules:

1. When computing sums, initialise the result at 0 and add pertinent numbers one by one until done. The result is the sum. 

2. When computing sums, ignore missings. 

The surprise to some that a set of values which is all missings sums to 0 may arise because they feel sure what the answer should be, namely missing. This kind of principle is fine when all reasonable people can agree on what the answer should be, but designing a large program is not best based on a series of ad hoc decisions on what answers should be. It is better based on a small number of rules that hang together. The price is a few surprises, but overall users are better served by consistency than by caprice. 

Consider a related problem. What is the sum of an empty set? Let's denote an empty set by { }. 

I can think of three non-crazy answers. 

1. The question is silly, or at least unanswerable in principle. 

2. The answer is that no one can say. 

3. There is a determinate answer -- which is in fact 0. 

These three verbal answers correspond one-to-one with three ways that Stata can reply to a question which the user regards as numerical: 

1. An error message. 

2. A result of missing. 

3. A determinate result. 

The reason for the answer of 0 is given by extending the problem. Imagine combining that empty set with any non-empty set, say {1,2,3}. What is the sum of the combined set? The Stataish answer is that 

sum of {} + sum of {1,2,3} = sum of {1,2,3} 

which is clearly 6. Hence the sum of {} is 0. Insisting that sum of {} is missing prevents you ever combining that result helpfully with anything else. 

This problem, which I called "related", is in fact really the same problem in Stata's mind. Since missings are ignored, it's as if they don't exist. Thus a set of values that is all missing is equivalent to an empty set. 

If you look carefully you will find this behaviour elsewhere in Stata, so it should be less of a surprise. 

-summarize- 
===========

. sysuse auto
(1978 Automobile Data)

. su rep78 if rep78 == .

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
       rep78 |         0

. ret li

scalars:
                  r(N) =  0
              r(sum_w) =  0
                r(sum) =  0

-egen- 
======

Only recently, and rather reluctantly, did Stata add a -missing- option to -egen, rowtotal() for users who want this answer: 

rowtotal(varlist) [, missing]
            may not be combined with by.  It creates the (row) sum of the variables in
            varlist, treating missing as 0.  If missing is specified and all values in
            varlist are missing for an observation, newvar is set to missing.

The default was the initial designed behaviour. 

Mata
====

Mata has an explicit variant on empty sets, e.g.

. mata : sum(J(0,0,.))
  0

. mata : sum(J(0,0,42))
  0


It's also true that for example 

. di . + 1 + 2 + 3

yields missing as a result (not 6). 

I think that's to be justified in the following way. There is a difference between ignoring observations in a dataset which are missing -- which is essence is what a user almost always wants with statistical software -- and deciding that you didn't mean what you said when you typed . + 1 + 2 + 3, but that you really meant to type 0 + 1 + 2 + 3. The first interpretation, with several observations, is reasonable, but the second, with an expression, would be impertinent. 

Finally, it is entirely true that Stata's choices are not the only consistent choices. Some other languages insist that missings are transmitted infectiously, so that for example their equivalent of Stata's 

. == . 

yields missing, not true, so that you need different functionality to deal with that. I imagine that their defence is similar, namely that a few rules being applied consistently is the best design philosophy.

In either case, work-arounds to get what you want may be needed and are usually straightforward, as other contributions to the thread have shown clearly. 

Nick 
n.j.cox@durham.ac.uk 

Joseph McDonnell

I suspect it's a design decision to accommodate the fact that you
could be trying to sum a combination of non-missing and missing values
e.g. if row 8 had the value of 40 instead of missing. You would
probably(?) want to see the value 40 in the collapsed data set.

You can differentiate between the zeroes in David's example by using
the count function.
. collapse (sum) x (count) n=x, by(group)

. list

     +----------------+
     | group    x   n |
     |----------------|
  1. |     1   33   3 |
  2. |     2   63   3 |
  3. |     3    0   0 |
  4. |     4    0   2 |
     +----------------+


On Wed, Feb 10, 2010 at 12:20 PM, Michael Mitchell

>   I am befuddled by the following example that uses the -collapse-
> command with the -(sum)- function...
>
> . clear
>
> . input group x
>
>          group           x
>  1. 1 10
>  2. 1 11
>  3. 1 12
>  4. 2 20
>  5. 2 21
>  6. 2 22
>  7. 3 .
>  8. 4 .
>  9. 4 .
>  10. end
>
> .
> . collapse (sum) x , by(group)
>
> . list
>
>     +------------+
>     | group    x |
>     |------------|
>  1. |     1   33 |
>  2. |     2   63 |
>  3. |     3    0 |
>  4. |     4    0 |
>     +------------+
>
>  Shouldn't the value of -x- for groups 3 and 4 be missing, not zero.
> To me, the sum of a series of missing values is a missing value. I am
> doing a collapse for about 100 variables (100 x values) and need the
> value to be defined as missing (not 0) in such cases. Any ideas?

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index