# RE: st: Collapse & Missing Values

 From "Nick Cox" To Subject RE: st: Collapse & Missing Values Date Wed, 28 Sep 2005 21:53:49 +0100

```This is really a feature, although I may
have to work hard to convince you of that.

I take the issue to be how, when asked to
treat _cumulative_ sums with -sum()-, Stata
regards missing values. Without even looking
at the code, I guess this is how -collapse-
is doing its job here.

It is of course true that when asked to produce
sums by use of the addition operator + one missing
value is enough to make Stata shrug its shoulders
and confess "no idea":

. di 42 + .
.

. di 1 + 2 + 3 + 4 + 5 + 6 + 7 +  .
.

and in cases like this Stata usually does what
you want. If you don't want this, you have
to code your way around it, say by

local sum = 0
foreach v in 1 2 3 4 5 6 7 . {
local sum = `sum' + cond(mi(`v'), 0, `v')
}
di `sum'

However, if this were always the rule
in Stata, doing serious data analysis would
be much more difficult -- unless there were
never _any_ missing data. For example,
many techniques are based on means and
thus on sums and it would be really irritating
if Stata kept saying "no idea" just because
there was even one missing value. One alternative
might be that you would _always_ have to specify
-if- not missing whenever that was an issue, and
that would be painful.

A few experiments, such as

. di sum(.)

lead to the inference that, as it were,
the result of an application of sum()
is _always_ born as 0. Just as the result of any series of

1 + 2 + 3 + 4 + 5 + 6 + 7,

is based tacitly on the idea that you "start"
at 0, that is true for -sum()- regardless
of how many missings there are in a series
of values. In the case of all missings,
-sum()- just never changes its mind, which
was imprinted with 0 on that tabula rasa before
it ever saw any data.

Another principle that is important here
is that most applications of -sum()- are
based on a pass through all the observations
specified to work not on some individual values,
but on sets of values in variables. So, Stata has
no idea in advance of how many non-missings there
are. It just works its way through according
to the current -sort- order.

I guess what Eric would in effect like Stata to do
is to keep track of all the occurrences of
missing so that -sum()- would produce say

. + . + . + . + . + . + 42 = 42

but

. + . + . + . + . + . + . = .

Thus, at the end of a set that were all missing,
-sum()- would be morally compelled to say,
"No, that initial guess of 0 doesn't apply here.
These values are all missing, so the sum must
be missing. I changed my mind!"

Well, I guess we all see why Eric appears to want this,
but it's just not the way Stata's -sum()- is implemented.
Also, it complicates software design mightily once
you allow inconsistencies or complicate rules even
slightly.

So, to put it yet another way, the result of -sum()-
with all missings being 0 is a side-effect of its
tolerance of missing values, which in most circumstances
really is what you would want.

Nick
n.j.cox@durham.ac.uk

Jian Zhang

> I run into the same problem.
> What I know is the following:
> id var1 var2
> 1   12    .
> 1   14    2
> 2   16    .
> 2   17    .
> if you type
> gen var3=var1+var2,
> stata will give you:
>
> id var1 var2 var3
> 1   12    .   .
> 1   14    2   16
> 2   16    .   .
> 2   17    .   .
>
> However, if you type
> sort id
> by id: egen var3=sum(var2),
> stata will give you
> id var1 var2 var3
> 1   12    .   2
> 1   14    2   2
> 2   16    .   0
> 2   17    .   0
>
> I wonder who can explain the way state deals with missing values? Is
> there any general rule for stata?

Eric Wruck

> > I just learned, rather inconveniently, that collapse
> doesn't work the
> > way I'd like when encountering missing values.  Here's an example:
> > . l
> >
> >       +----------------------+
> >       |        date   amount |
> >       |----------------------|
> >    1. | 10-Oct-1990      200 |
> >    2. | 10-Oct-1990      -75 |
> >    3. | 10-Oct-1990       64 |
> >    4. | 11-Oct-1990        . |
> >    5. | 12-Oct-1990      107 |
> >       |----------------------|
> >    6. | 12-Oct-1990        . |
> >       +----------------------+
> >
> > . collapse (sum) net_amt=amount, by(date)
> >
> > . l
> >
> >       +-----------------------+
> >       |        date   net_amt |
> >       |-----------------------|
> >    1. | 10-Oct-1990       189 |
> >    2. | 11-Oct-1990         0 |
> >    3. | 12-Oct-1990       107 |
> >       +-----------------------+
> >
> > .
> > The problem is for the single 11-Oct-1990 observation.  After
> > collapsing, the missing value becomes a zero; in this instance I
> > would have preferred it remain missing.  The 12-Oct-1990
> treatment is
> > fine & what I expected.  I suppose I could delete
> observations before
> > performing the collapse but it would be better if there was some
> > other option.  Is there?

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```