[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: Collapse & Missing Values |

Date |
Wed, 28 Sep 2005 21:53:49 +0100 |

This is really a feature, although I may have to work hard to convince you of that. I take the issue to be how, when asked to treat _cumulative_ sums with -sum()-, Stata regards missing values. Without even looking at the code, I guess this is how -collapse- is doing its job here. It is of course true that when asked to produce sums by use of the addition operator + one missing value is enough to make Stata shrug its shoulders and confess "no idea": . di 42 + . . . di 1 + 2 + 3 + 4 + 5 + 6 + 7 + . . and in cases like this Stata usually does what you want. If you don't want this, you have to code your way around it, say by local sum = 0 foreach v in 1 2 3 4 5 6 7 . { local sum = `sum' + cond(mi(`v'), 0, `v') } di `sum' However, if this were always the rule in Stata, doing serious data analysis would be much more difficult -- unless there were never _any_ missing data. For example, many techniques are based on means and thus on sums and it would be really irritating if Stata kept saying "no idea" just because there was even one missing value. One alternative might be that you would _always_ have to specify -if- not missing whenever that was an issue, and that would be painful. A few experiments, such as . di sum(.) lead to the inference that, as it were, the result of an application of sum() is _always_ born as 0. Just as the result of any series of additions, such as 1 + 2 + 3 + 4 + 5 + 6 + 7, is based tacitly on the idea that you "start" at 0, that is true for -sum()- regardless of how many missings there are in a series of values. In the case of all missings, -sum()- just never changes its mind, which was imprinted with 0 on that tabula rasa before it ever saw any data. Another principle that is important here is that most applications of -sum()- are based on a pass through all the observations specified to work not on some individual values, but on sets of values in variables. So, Stata has no idea in advance of how many non-missings there are. It just works its way through according to the current -sort- order. I guess what Eric would in effect like Stata to do is to keep track of all the occurrences of missing so that -sum()- would produce say . + . + . + . + . + . + 42 = 42 but . + . + . + . + . + . + . = . Thus, at the end of a set that were all missing, -sum()- would be morally compelled to say, "No, that initial guess of 0 doesn't apply here. These values are all missing, so the sum must be missing. I changed my mind!" Well, I guess we all see why Eric appears to want this, but it's just not the way Stata's -sum()- is implemented. Also, it complicates software design mightily once you allow inconsistencies or complicate rules even slightly. So, to put it yet another way, the result of -sum()- with all missings being 0 is a side-effect of its tolerance of missing values, which in most circumstances really is what you would want. Nick n.j.cox@durham.ac.uk Jian Zhang > I run into the same problem. > What I know is the following: > id var1 var2 > 1 12 . > 1 14 2 > 2 16 . > 2 17 . > if you type > gen var3=var1+var2, > stata will give you: > > id var1 var2 var3 > 1 12 . . > 1 14 2 16 > 2 16 . . > 2 17 . . > > However, if you type > sort id > by id: egen var3=sum(var2), > stata will give you > id var1 var2 var3 > 1 12 . 2 > 1 14 2 2 > 2 16 . 0 > 2 17 . 0 > > I wonder who can explain the way state deals with missing values? Is > there any general rule for stata? Eric Wruck > > I just learned, rather inconveniently, that collapse > doesn't work the > > way I'd like when encountering missing values. Here's an example: > > . l > > > > +----------------------+ > > | date amount | > > |----------------------| > > 1. | 10-Oct-1990 200 | > > 2. | 10-Oct-1990 -75 | > > 3. | 10-Oct-1990 64 | > > 4. | 11-Oct-1990 . | > > 5. | 12-Oct-1990 107 | > > |----------------------| > > 6. | 12-Oct-1990 . | > > +----------------------+ > > > > . collapse (sum) net_amt=amount, by(date) > > > > . l > > > > +-----------------------+ > > | date net_amt | > > |-----------------------| > > 1. | 10-Oct-1990 189 | > > 2. | 11-Oct-1990 0 | > > 3. | 12-Oct-1990 107 | > > +-----------------------+ > > > > . > > The problem is for the single 11-Oct-1990 observation. After > > collapsing, the missing value becomes a zero; in this instance I > > would have preferred it remain missing. The 12-Oct-1990 > treatment is > > fine & what I expected. I suppose I could delete > observations before > > performing the collapse but it would be better if there was some > > other option. Is there? * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**RE: st: Collapse & Missing Values***From:*"Eric G. Wruck" <ewruck@econalytics.com>

- Prev by Date:
**st: Re: Matrix Kronecker Product Question** - Next by Date:
**st: breaking strings** - Previous by thread:
**RE: st: Collapse & Missing Values** - Next by thread:
**RE: st: Collapse & Missing Values** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |