Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: preserving missing values in collapse (sum)


From   Melonie Sullivan <meloniebeth@yahoo.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: preserving missing values in collapse (sum)
Date   Mon, 22 Oct 2007 07:59:29 -0700 (PDT)

Okay, -collapse(sum) by (youthid group) results in
something like the matrix below for each youth. But I
need a single record for each youth with sum of
durations for each group. If I understand your
solution, I then type a series of -replace x=. if
allmissing- where x=grp1dur, grp2dur, etc. But then I
need to collapse again to get a single sum for each
group for each youth, and I am back where I started.

youthid   11  11  11  11  11  11
grp1dur   15   0   0   0  0   0 
grp2dur    0  41   0   0  0   0
grp3ddur   0   0   0   0  0   0
grp4dur    0   0   0   0  0   0
grp5dur    0   0   0   0  0   0
grp6dur    0   0   0   0  0   0
allmissing 0   0   0   0  1   0

--- n j cox <n.j.cox@durham.ac.uk> wrote:

> I don't think you mean what you say. Your output
> requires a -collapse, by(youthid group)-. Sorry
> that I didn't spell that out.
> 
> Nick
> n.j.cox@durham.ac.uk
> 
> Melonie wrote:
> 
> Thank you, Nick. I had considered that tweak.
> However,
> I need to collapse by (youthid), which then gives me
> a
> value of allmissing for each youth, but does not tag
> the missing group for replacement:
> 
> youthid   11    12
> grp1dur   15     0
> grp2dur   41    13
> grp3ddur   0     0
> grp4dur    0    42
> grp5dur    0     0
> grp6dur    0    55
> allmissing 1     0
> 
> n j cox wrote:
> 
> > Below my signature is the key part of the posting
> referred to.
> > 
> > You can tweak that approach by
> > 
> > bysort youthid group (duration) : gen allmissing =
> missing(duration[1])
> > 
> > and then -collapse- as before.
> > 
> > Nick
> > n.j.cox@durham.ac.uk
> > 
> > OK. Consider this toy dataset:
> > 
> > . l
> > 
> >      +------------------+
> >      | response   group |
> >      |------------------|
> >   1. |        1       1 |
> >   2. |        2       1 |
> >   3. |        3       2 |
> >   4. |        .       2 |
> >   5. |        0       3 |
> >      |------------------|
> >   6. |        .       3 |
> >   7. |        .       4 |
> >   8. |        .       4 |
> >      +------------------+
> > 
> > I'd regard 0 as a sensible sum for group
> > 3 but not for group 4, but as you say
> > 
> > . collapse (sum) response, by(group)
> > 
> > . l
> > 
> >      +------------------+
> >      | group   response |
> >      |------------------|
> >   1. |     1          3 |
> >   2. |     2          3 |
> >   3. |     3          0 |
> >   4. |     4          0 |
> >      +------------------+
> > 
> > One way round this is to tag groups
> > for which _all_ values are missing.
> > One way to do that is
> > 
> > . bysort group (response) : gen allmissing =
> mi(response[1])
> > 
> > That is, within groups, sort on response. If the
> first
> > value within each group is missing, then all of
> them
> > in that group must be.
> > 
> > The principle here is close to one explained at
> > http://www.stata.com/support/faqs/data/diff.html
> > You might also like to look at
> > http://www.stata.com/support/faqs/data/anyall.html
> > 
> > . l
> > 
> >      +-----------------------------+
> >      | response   group   allmis~g |
> >      |-----------------------------|
> >   1. |        1       1          0 |
> >   2. |        2       1          0 |
> >   3. |        3       2          0 |
> >   4. |        .       2          0 |
> >   5. |        0       3          0 |
> >      |-----------------------------|
> >   6. |        .       3          0 |
> >   7. |        .       4          1 |
> >   8. |        .       4          1 |
> >      +-----------------------------+
> > 
> > Now we make sure the -collapse- carries along
> > the important information:
> > 
> > . collapse (sum) response (min) allmissing,
> by(group)
> > 
> > . l
> > 
> >      +-----------------------------+
> >      | group   response   allmis~g |
> >      |-----------------------------|
> >   1. |     1          3          0 |
> >   2. |     2          3          0 |
> >   3. |     3          0          0 |
> >   4. |     4          0          1 |
> >      +-----------------------------+
> > 
> > and convert back:
> > 
> > . replace response = . if allmissing
> > (1 real change made, 1 to missing)
> > 
> > . l
> > 
> >      +-----------------------------+
> >      | group   response   allmis~g |
> >      |-----------------------------|
> >   1. |     1          3          0 |
> >   2. |     2          3          0 |
> >   3. |     3          0          0 |
> >   4. |     4          .          1 |
> >      +-----------------------------+
> > 
> > Melonie Sullivan
> > 
> > I have data on history of placements into
> different
> > -group-s by -youthid-: there are multiple
> placement
> > records for each youth. I need to create a
> variable
> > equal to the sum of -duration- of all placements
> into
> > each -group- for each youth. -collapse (sum)-
> seems to be
> > the appropriate procedure, but it treats missing
> > values as zeroes. This causes a problem if a youth
> has
> > only one placement in a given group with unknown
> > duration. Example:
> > 
> > 
> > youthid         group        duration
> > 11                 1            15
> > 11                 1             .
> > 11                 2            31
> > 11                 2            10
> > 11                 5             .
> > 12                 2             5
> > 12                 2             8
> > 12                 4            42
> > 12                 6            55
> > 
> > I create a duration variable for each group
> (-generate
> > grp1dur = duration if group==1-, etc.) and
> -collapse
> > (sum)- by -youthid- and I want to get this:
> > 
> > youthid   11    12
> > grp1dur   15     0
> > grp2dur   41    13
> > grp3ddur   0     0
> > grp4dur    0    42
> > grp5dur    .     0
> > grp6dur    0    55
> > 
> > But collapse gives me a zero on grp5dur for youth
> #11,
> > though youth #11 had placement in that group,
> albeit
> > of an unknown duration. The other zeroes are
> correct;
> > the youth had zero days in that placement group.
> > 
> > The problem has been addressed here before, best
> in
> > the following post by Nick Cox:
> > 
> >
>
http://www.stata.com/statalist/archive/2004-07/msg00783.html
> > 
> > However, this is not solving my particular
> problem,
> > because my data essentially looks like a big stack
> of
> > Nick's "toy datasets" -- one for each of 1800
> youth in
> > my data. So collapsing by (youthid) gives the same
> > value of Nick's allmissing for each youth, since
> the
> > allmissing tags missing durations for groups
> within
> > youths.
> > 
> > 
> > 
> *
> *   For searches and help try:
> *  
> http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index