Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Collapse & Missing Values


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: Collapse & Missing Values
Date   Wed, 28 Sep 2005 21:53:49 +0100

This is really a feature, although I may 
have to work hard to convince you of that. 

I take the issue to be how, when asked to 
treat _cumulative_ sums with -sum()-, Stata 
regards missing values. Without even looking 
at the code, I guess this is how -collapse- 
is doing its job here. 

It is of course true that when asked to produce 
sums by use of the addition operator + one missing 
value is enough to make Stata shrug its shoulders 
and confess "no idea": 

. di 42 + . 
.

. di 1 + 2 + 3 + 4 + 5 + 6 + 7 +  . 
.

and in cases like this Stata usually does what 
you want. If you don't want this, you have
to code your way around it, say by 

local sum = 0 
foreach v in 1 2 3 4 5 6 7 . { 
	local sum = `sum' + cond(mi(`v'), 0, `v') 
} 
di `sum' 

However, if this were always the rule 
in Stata, doing serious data analysis would 
be much more difficult -- unless there were 
never _any_ missing data. For example, 
many techniques are based on means and 
thus on sums and it would be really irritating 
if Stata kept saying "no idea" just because
there was even one missing value. One alternative
might be that you would _always_ have to specify 
-if- not missing whenever that was an issue, and
that would be painful. 

A few experiments, such as 

. di sum(.) 

lead to the inference that, as it were, 
the result of an application of sum() 
is _always_ born as 0. Just as the result of any series of 
additions, such as 

1 + 2 + 3 + 4 + 5 + 6 + 7, 

is based tacitly on the idea that you "start" 
at 0, that is true for -sum()- regardless
of how many missings there are in a series 
of values. In the case of all missings, 
-sum()- just never changes its mind, which 
was imprinted with 0 on that tabula rasa before 
it ever saw any data. 

Another principle that is important here 
is that most applications of -sum()- are 
based on a pass through all the observations
specified to work not on some individual values, 
but on sets of values in variables. So, Stata has 
no idea in advance of how many non-missings there 
are. It just works its way through according
to the current -sort- order. 

I guess what Eric would in effect like Stata to do 
is to keep track of all the occurrences of 
missing so that -sum()- would produce say 

. + . + . + . + . + . + 42 = 42 

but 

. + . + . + . + . + . + . = .

Thus, at the end of a set that were all missing, 
-sum()- would be morally compelled to say, 
"No, that initial guess of 0 doesn't apply here. 
These values are all missing, so the sum must
be missing. I changed my mind!" 

Well, I guess we all see why Eric appears to want this, 
but it's just not the way Stata's -sum()- is implemented. 
Also, it complicates software design mightily once
you allow inconsistencies or complicate rules even 
slightly. 

So, to put it yet another way, the result of -sum()- 
with all missings being 0 is a side-effect of its 
tolerance of missing values, which in most circumstances
really is what you would want. 

Nick 
n.j.cox@durham.ac.uk 

Jian Zhang
 
> I run into the same problem.
> What I know is the following: 
> id var1 var2
> 1   12    .
> 1   14    2
> 2   16    . 
> 2   17    .
> if you type 
> gen var3=var1+var2,
> stata will give you:
> 
> id var1 var2 var3
> 1   12    .   .
> 1   14    2   16
> 2   16    .   .
> 2   17    .   .
> 
> However, if you type 
> sort id
> by id: egen var3=sum(var2),
> stata will give you
> id var1 var2 var3
> 1   12    .   2
> 1   14    2   2
> 2   16    .   0
> 2   17    .   0
> 
> I wonder who can explain the way state deals with missing values? Is 
> there any general rule for stata?

Eric Wruck 

> > I just learned, rather inconveniently, that collapse 
> doesn't work the 
> > way I'd like when encountering missing values.  Here's an example:
> > . l
> > 
> >       +----------------------+
> >       |        date   amount |
> >       |----------------------|
> >    1. | 10-Oct-1990      200 |
> >    2. | 10-Oct-1990      -75 |
> >    3. | 10-Oct-1990       64 |
> >    4. | 11-Oct-1990        . |
> >    5. | 12-Oct-1990      107 |
> >       |----------------------|
> >    6. | 12-Oct-1990        . |
> >       +----------------------+
> > 
> > . collapse (sum) net_amt=amount, by(date)
> > 
> > . l
> > 
> >       +-----------------------+
> >       |        date   net_amt |
> >       |-----------------------|
> >    1. | 10-Oct-1990       189 |
> >    2. | 11-Oct-1990         0 |
> >    3. | 12-Oct-1990       107 |
> >       +-----------------------+
> > 
> > .
> > The problem is for the single 11-Oct-1990 observation.  After 
> > collapsing, the missing value becomes a zero; in this instance I 
> > would have preferred it remain missing.  The 12-Oct-1990 
> treatment is 
> > fine & what I expected.  I suppose I could delete 
> observations before 
> > performing the collapse but it would be better if there was some 
> > other option.  Is there?

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index