Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: preserving missing values in collapse (sum)


From   Melonie Sullivan <meloniebeth@yahoo.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: preserving missing values in collapse (sum)
Date   Mon, 22 Oct 2007 11:08:43 -0700 (PDT)

Okay, so far so good, thanks. But now how do I get
that information into a matrix of this form - one line
for each youth:

youthid y x1 x2 x3 x4 x5 x6
11      0 15 41  0  0  .  0
12      1  0  13 0  42 0 55

where y=dependent variable, x1=duration if group=1,
x2=duration if group=2, etc. If I take your solution,
then generate x1, x2...., and do a -list- I still get
a 6x6 matrix of x for each youth that looks like this:

youthid x1  x2  x3  x4  x5  x6
11      15   .   .   .   .   .
11      .   41   .   .   .   .
11      .    .   .   .   .   .
11      .    .   .   .   .   .
11      .    .   .   .   .   .
11      .    .   .   .   .   .



--- n j cox <n.j.cox@durham.ac.uk> wrote:

> As I understand it, you just need to recapitulate
> the 2004 thread, with only one difference:
> your groups are defined by two variables,
> not one.
> 
>  From your sample data, I would go
> 
> . l
> 
>       +----------------------------+
>       | youthid   group   duration |
>       |----------------------------|
>    1. |      11       1         15 |
>    2. |      11       1          . |
>    3. |      11       2         31 |
>    4. |      11       2         10 |
>    5. |      11       5          . |
>       |----------------------------|
>    6. |      12       2          5 |
>    7. |      12       2          8 |
>    8. |      12       4         42 |
>    9. |      12       6         55 |
>       +----------------------------+
> 
> . bysort youthid group (duration) : gen allmissing =
> missing(duration[1])
> 
> . l
> 
>       +---------------------------------------+
>       | youthid   group   duration   allmis~g |
>       |---------------------------------------|
>    1. |      11       1         15          0 |
>    2. |      11       1          .          0 |
>    3. |      11       2         10          0 |
>    4. |      11       2         31          0 |
>    5. |      11       5          .          1 |
>       |---------------------------------------|
>    6. |      12       2          5          0 |
>    7. |      12       2          8          0 |
>    8. |      12       4         42          0 |
>    9. |      12       6         55          0 |
>       +---------------------------------------+
> 
> . collapse (sum) duration (min)  allmissing ,
> by(youthid group)
> 
> . l
> 
>       +---------------------------------------+
>       | youthid   group   duration   allmis~g |
>       |---------------------------------------|
>    1. |      11       1         15          0 |
>    2. |      11       2         41          0 |
>    3. |      11       5          0          1 |
>    4. |      12       2         13          0 |
>    5. |      12       4         42          0 |
>       |---------------------------------------|
>    6. |      12       6         55          0 |
>       +---------------------------------------+
> 
> . replace duration = . if allmissing
> (1 real change made, 1 to missing)
> 
> . l
> 
>       +---------------------------------------+
>       | youthid   group   duration   allmis~g |
>       |---------------------------------------|
>    1. |      11       1         15          0 |
>    2. |      11       2         41          0 |
>    3. |      11       5          .          1 |
>    4. |      12       2         13          0 |
>    5. |      12       4         42          0 |
>       |---------------------------------------|
>    6. |      12       6         55          0 |
>       +---------------------------------------+
> 
> You may want to -fillin- to get closer to
> your matrix. In a real example, 3 would also
> appear, I guess.
> 
> . fillin youthid group
> 
> . l
> 
>      
> +-------------------------------------------------+
>       | youthid   group   duration   allmis~g  
> _fillin |
>      
> |-------------------------------------------------|
>    1. |      11       1         15          0       
>  0 |
>    2. |      11       2         41          0       
>  0 |
>    3. |      11       4          .          .       
>  1 |
>    4. |      11       5          .          1       
>  0 |
>    5. |      11       6          .          .       
>  1 |
>      
> |-------------------------------------------------|
>    6. |      12       1          .          .       
>  1 |
>    7. |      12       2         13          0       
>  0 |
>    8. |      12       4         42          0       
>  0 |
>    9. |      12       5          .          .       
>  1 |
>   10. |      12       6         55          0       
>  0 |
>      
> +-------------------------------------------------+
> 
> n j cox wrote:
> > I don't think you mean what you say. Your output
> > requires a -collapse, by(youthid group)-. Sorry
> > that I didn't spell that out.
> > 
> > Nick
> > n.j.cox@durham.ac.uk
> > 
> > Melonie wrote:
> > 
> > Thank you, Nick. I had considered that tweak.
> However,
> > I need to collapse by (youthid), which then gives
> me a
> > value of allmissing for each youth, but does not
> tag
> > the missing group for replacement:
> > 
> > youthid   11    12
> > grp1dur   15     0
> > grp2dur   41    13
> > grp3ddur   0     0
> > grp4dur    0    42
> > grp5dur    0     0
> > grp6dur    0    55
> > allmissing 1     0
> > 
> > n j cox wrote:
> > 
> >> Below my signature is the key part of the posting
> referred to.
> >>
> >> You can tweak that approach by
> >>
> >> bysort youthid group (duration) : gen allmissing
> = missing(duration[1])
> >>
> >> and then -collapse- as before.
> >>
> >> Nick
> >> n.j.cox@durham.ac.uk
> >>
> >> OK. Consider this toy dataset:
> >>
> >> . l
> >>
> >>      +------------------+
> >>      | response   group |
> >>      |------------------|
> >>   1. |        1       1 |
> >>   2. |        2       1 |
> >>   3. |        3       2 |
> >>   4. |        .       2 |
> >>   5. |        0       3 |
> >>      |------------------|
> >>   6. |        .       3 |
> >>   7. |        .       4 |
> >>   8. |        .       4 |
> >>      +------------------+
> >>
> >> I'd regard 0 as a sensible sum for group
> >> 3 but not for group 4, but as you say
> >>
> >> . collapse (sum) response, by(group)
> >>
> >> . l
> >>
> >>      +------------------+
> >>      | group   response |
> >>      |------------------|
> >>   1. |     1          3 |
> >>   2. |     2          3 |
> >>   3. |     3          0 |
> >>   4. |     4          0 |
> >>      +------------------+
> >>
> >> One way round this is to tag groups
> >> for which _all_ values are missing.
> >> One way to do that is
> >>
> >> . bysort group (response) : gen allmissing =
> mi(response[1])
> >>
> >> That is, within groups, sort on response. If the
> first
> >> value within each group is missing, then all of
> them
> >> in that group must be.
> >>
> >> The principle here is close to one explained at
> >> http://www.stata.com/support/faqs/data/diff.html
> >> You might also like to look at
> >>
> http://www.stata.com/support/faqs/data/anyall.html
> >>
> >> . l
> >>
> >>      +-----------------------------+
> >>      | response   group   allmis~g |
> >>      |-----------------------------|
> >>   1. |        1       1          0 |
> >>   2. |        2       1          0 |
> >>   3. |        3       2          0 |
> >>   4. |        .       2          0 |
> >>   5. |        0       3          0 |
> >>      |-----------------------------|
> >>   6. |        .       3          0 |
> >>   7. |        .       4          1 |
> >>   8. |        .       4          1 |
> >>      +-----------------------------+
> >>
> >> Now we make sure the -collapse- carries along
> >> the important information:
> >>
> >> . collapse (sum) response (min) allmissing,
> by(group)
> >>
> >> . l
> >>
> >>      +-----------------------------+
> >>      | group   response   allmis~g |
> >>      |-----------------------------|
> >>   1. |     1          3          0 |
> >>   2. |     2          3          0 |
> >>   3. |     3          0          0 |
> >>   4. |     4          0          1 |
> >>      +-----------------------------+
> >>
> >> and convert back:
> >>
> >> . replace response = . if allmissing
> >> (1 real change made, 1 to missing)
> >>
> >> . l
> >>
> >>      +-----------------------------+
> >>      | group   response   allmis~g |
> >>      |-----------------------------|
> >>   1. |     1          3          0 |
> >>   2. |     2          3          0 |
> >>   3. |     3          0          0 |
> >>   4. |     4          .          1 |
> >>      +-----------------------------+
> >>
> >> Melonie Sullivan
> >>
> >> I have data on history of placements into
> different
> >> -group-s by -youthid-: there are multiple
> placement
> >> records for each youth. I need to create a
> variable
> >> equal to the sum of -duration- of all placements
> into
> >> each -group- for each youth. -collapse (sum)-
> seems to be
> >> the appropriate procedure, but it treats missing
> >> values as zeroes. This causes a problem if a
> youth has
> >> only one placement in a given group with unknown
> >> duration. Example:
> >>
> >>
> >> youthid         group        duration
> >> 11                 1            15
> >> 11                 1             .
> >> 11                 2            31
> >> 11                 2            10
> >> 11                 5             .
> >> 12                 2             5
> >> 12                 2             8
> >> 12                 4            42
> >> 12                 6            55
> >>
> >> I create a duration variable for each group
> (-generate
> >> grp1dur = duration if group==1-, etc.) and
> -collapse
> >> (sum)- by -youthid- and I want to get this:
> >>
> >> youthid   11    12
> >> grp1dur   15     0
> >> grp2dur   41    13
> >> grp3ddur   0     0
> >> grp4dur    0    42
> >> grp5dur    .     0
> >> grp6dur    0    55
> >>
> >> But collapse gives me a zero on grp5dur for youth
> #11,
> >> though youth #11 had placement in that group,
> albeit
> >> of an unknown duration. The other zeroes are
> correct;
> >> the youth had zero days in that placement group.
> >>
> >> The problem has been addressed here before, best
> in
> >> the following post by Nick Cox:
> >>
> >>
>
http://www.stata.com/statalist/archive/2004-07/msg00783.html
> >>
> >> However, this is not solving my particular
> problem,
> >> because my data essentially looks like a big
> stack of
> >> Nick's "toy datasets" -- one for each of 1800
> youth in
> >> my data. So collapsing by (youthid) gives the
> same
> >> value of Nick's allmissing for each youth, since
> the
> >> allmissing tags missing durations for groups
> within
> >> youths.
> >>
> >>
> >>
> > 
> *
> *   For searches and help try:
> *  
> http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index