[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
n j cox <n.j.cox@durham.ac.uk> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: preserving missing values in collapse (sum) |

Date |
Mon, 22 Oct 2007 16:14:16 +0100 |

As I understand it, you just need to recapitulate the 2004 thread, with only one difference: your groups are defined by two variables, not one. From your sample data, I would go . l +----------------------------+ | youthid group duration | |----------------------------| 1. | 11 1 15 | 2. | 11 1 . | 3. | 11 2 31 | 4. | 11 2 10 | 5. | 11 5 . | |----------------------------| 6. | 12 2 5 | 7. | 12 2 8 | 8. | 12 4 42 | 9. | 12 6 55 | +----------------------------+ . bysort youthid group (duration) : gen allmissing = missing(duration[1]) . l +---------------------------------------+ | youthid group duration allmis~g | |---------------------------------------| 1. | 11 1 15 0 | 2. | 11 1 . 0 | 3. | 11 2 10 0 | 4. | 11 2 31 0 | 5. | 11 5 . 1 | |---------------------------------------| 6. | 12 2 5 0 | 7. | 12 2 8 0 | 8. | 12 4 42 0 | 9. | 12 6 55 0 | +---------------------------------------+ . collapse (sum) duration (min) allmissing , by(youthid group) . l +---------------------------------------+ | youthid group duration allmis~g | |---------------------------------------| 1. | 11 1 15 0 | 2. | 11 2 41 0 | 3. | 11 5 0 1 | 4. | 12 2 13 0 | 5. | 12 4 42 0 | |---------------------------------------| 6. | 12 6 55 0 | +---------------------------------------+ . replace duration = . if allmissing (1 real change made, 1 to missing) . l +---------------------------------------+ | youthid group duration allmis~g | |---------------------------------------| 1. | 11 1 15 0 | 2. | 11 2 41 0 | 3. | 11 5 . 1 | 4. | 12 2 13 0 | 5. | 12 4 42 0 | |---------------------------------------| 6. | 12 6 55 0 | +---------------------------------------+ You may want to -fillin- to get closer to your matrix. In a real example, 3 would also appear, I guess. . fillin youthid group . l +-------------------------------------------------+ | youthid group duration allmis~g _fillin | |-------------------------------------------------| 1. | 11 1 15 0 0 | 2. | 11 2 41 0 0 | 3. | 11 4 . . 1 | 4. | 11 5 . 1 0 | 5. | 11 6 . . 1 | |-------------------------------------------------| 6. | 12 1 . . 1 | 7. | 12 2 13 0 0 | 8. | 12 4 42 0 0 | 9. | 12 5 . . 1 | 10. | 12 6 55 0 0 | +-------------------------------------------------+ n j cox wrote:

I don't think you mean what you say. Your output requires a -collapse, by(youthid group)-. Sorry that I didn't spell that out. Nick n.j.cox@durham.ac.uk Melonie wrote: Thank you, Nick. I had considered that tweak. However, I need to collapse by (youthid), which then gives me a value of allmissing for each youth, but does not tag the missing group for replacement: youthid 11 12 grp1dur 15 0 grp2dur 41 13 grp3ddur 0 0 grp4dur 0 42 grp5dur 0 0 grp6dur 0 55 allmissing 1 0 n j cox wrote:Below my signature is the key part of the posting referred to. You can tweak that approach by bysort youthid group (duration) : gen allmissing = missing(duration[1]) and then -collapse- as before. Nick n.j.cox@durham.ac.uk OK. Consider this toy dataset: . l +------------------+ | response group | |------------------| 1. | 1 1 | 2. | 2 1 | 3. | 3 2 | 4. | . 2 | 5. | 0 3 | |------------------| 6. | . 3 | 7. | . 4 | 8. | . 4 | +------------------+ I'd regard 0 as a sensible sum for group 3 but not for group 4, but as you say . collapse (sum) response, by(group) . l +------------------+ | group response | |------------------| 1. | 1 3 | 2. | 2 3 | 3. | 3 0 | 4. | 4 0 | +------------------+ One way round this is to tag groups for which _all_ values are missing. One way to do that is . bysort group (response) : gen allmissing = mi(response[1]) That is, within groups, sort on response. If the first value within each group is missing, then all of them in that group must be. The principle here is close to one explained at http://www.stata.com/support/faqs/data/diff.html You might also like to look at http://www.stata.com/support/faqs/data/anyall.html . l +-----------------------------+ | response group allmis~g | |-----------------------------| 1. | 1 1 0 | 2. | 2 1 0 | 3. | 3 2 0 | 4. | . 2 0 | 5. | 0 3 0 | |-----------------------------| 6. | . 3 0 | 7. | . 4 1 | 8. | . 4 1 | +-----------------------------+ Now we make sure the -collapse- carries along the important information: . collapse (sum) response (min) allmissing, by(group) . l +-----------------------------+ | group response allmis~g | |-----------------------------| 1. | 1 3 0 | 2. | 2 3 0 | 3. | 3 0 0 | 4. | 4 0 1 | +-----------------------------+ and convert back: . replace response = . if allmissing (1 real change made, 1 to missing) . l +-----------------------------+ | group response allmis~g | |-----------------------------| 1. | 1 3 0 | 2. | 2 3 0 | 3. | 3 0 0 | 4. | 4 . 1 | +-----------------------------+ Melonie Sullivan I have data on history of placements into different -group-s by -youthid-: there are multiple placement records for each youth. I need to create a variable equal to the sum of -duration- of all placements into each -group- for each youth. -collapse (sum)- seems to be the appropriate procedure, but it treats missing values as zeroes. This causes a problem if a youth has only one placement in a given group with unknown duration. Example: youthid group duration 11 1 15 11 1 . 11 2 31 11 2 10 11 5 . 12 2 5 12 2 8 12 4 42 12 6 55 I create a duration variable for each group (-generate grp1dur = duration if group==1-, etc.) and -collapse (sum)- by -youthid- and I want to get this: youthid 11 12 grp1dur 15 0 grp2dur 41 13 grp3ddur 0 0 grp4dur 0 42 grp5dur . 0 grp6dur 0 55 But collapse gives me a zero on grp5dur for youth #11, though youth #11 had placement in that group, albeit of an unknown duration. The other zeroes are correct; the youth had zero days in that placement group. The problem has been addressed here before, best in the following post by Nick Cox: http://www.stata.com/statalist/archive/2004-07/msg00783.html However, this is not solving my particular problem, because my data essentially looks like a big stack of Nick's "toy datasets" -- one for each of 1800 youth in my data. So collapsing by (youthid) gives the same value of Nick's allmissing for each youth, since the allmissing tags missing durations for groups within youths.

* * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: preserving missing values in collapse (sum)***From:*Melonie Sullivan <meloniebeth@yahoo.com>

**References**:**Re: st: preserving missing values in collapse (sum)***From:*n j cox <n.j.cox@durham.ac.uk>

**Re: st: preserving missing values in collapse (sum)***From:*n j cox <n.j.cox@durham.ac.uk>

- Prev by Date:
**Re: st: preserving missing values in collapse (sum)** - Next by Date:
**st: RE: Hausman test** - Previous by thread:
**Re: st: preserving missing values in collapse (sum)** - Next by thread:
**Re: st: preserving missing values in collapse (sum)** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |