Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: preserving missing values in collapse (sum)


From   n j cox <n.j.cox@durham.ac.uk>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: preserving missing values in collapse (sum)
Date   Mon, 22 Oct 2007 14:27:19 +0100

Below my signature is the key part of the posting referred to.

You can tweak that approach by

bysort youthid group (duration) : gen allmissing = missing(duration[1])

and then -collapse- as before.

Nick
n.j.cox@durham.ac.uk

OK. Consider this toy dataset:

. l

     +------------------+
     | response   group |
     |------------------|
  1. |        1       1 |
  2. |        2       1 |
  3. |        3       2 |
  4. |        .       2 |
  5. |        0       3 |
     |------------------|
  6. |        .       3 |
  7. |        .       4 |
  8. |        .       4 |
     +------------------+

I'd regard 0 as a sensible sum for group
3 but not for group 4, but as you say

. collapse (sum) response, by(group)

. l

     +------------------+
     | group   response |
     |------------------|
  1. |     1          3 |
  2. |     2          3 |
  3. |     3          0 |
  4. |     4          0 |
     +------------------+

One way round this is to tag groups
for which _all_ values are missing.
One way to do that is

. bysort group (response) : gen allmissing = mi(response[1])

That is, within groups, sort on response. If the first
value within each group is missing, then all of them
in that group must be.

The principle here is close to one explained at
http://www.stata.com/support/faqs/data/diff.html
You might also like to look at
http://www.stata.com/support/faqs/data/anyall.html

. l

     +-----------------------------+
     | response   group   allmis~g |
     |-----------------------------|
  1. |        1       1          0 |
  2. |        2       1          0 |
  3. |        3       2          0 |
  4. |        .       2          0 |
  5. |        0       3          0 |
     |-----------------------------|
  6. |        .       3          0 |
  7. |        .       4          1 |
  8. |        .       4          1 |
     +-----------------------------+

Now we make sure the -collapse- carries along
the important information:

. collapse (sum) response (min) allmissing, by(group)

. l

     +-----------------------------+
     | group   response   allmis~g |
     |-----------------------------|
  1. |     1          3          0 |
  2. |     2          3          0 |
  3. |     3          0          0 |
  4. |     4          0          1 |
     +-----------------------------+

and convert back:

. replace response = . if allmissing
(1 real change made, 1 to missing)

. l

     +-----------------------------+
     | group   response   allmis~g |
     |-----------------------------|
  1. |     1          3          0 |
  2. |     2          3          0 |
  3. |     3          0          0 |
  4. |     4          .          1 |
     +-----------------------------+

Melonie Sullivan

I have data on history of placements into different
-group-s by -youthid-: there are multiple placement
records for each youth. I need to create a variable
equal to the sum of -duration- of all placements into
each -group- for each youth. -collapse (sum)- seems to be
the appropriate procedure, but it treats missing
values as zeroes. This causes a problem if a youth has
only one placement in a given group with unknown
duration. Example:


youthid         group        duration
11                 1            15
11                 1             .
11                 2            31
11                 2            10
11                 5             .
12                 2             5
12                 2             8
12                 4            42
12                 6            55

I create a duration variable for each group (-generate
grp1dur = duration if group==1-, etc.) and -collapse
(sum)- by -youthid- and I want to get this:

youthid   11    12
grp1dur   15     0
grp2dur   41    13
grp3ddur   0     0
grp4dur    0    42
grp5dur    .     0
grp6dur    0    55

But collapse gives me a zero on grp5dur for youth #11,
though youth #11 had placement in that group, albeit
of an unknown duration. The other zeroes are correct;
the youth had zero days in that placement group.

The problem has been addressed here before, best in
the following post by Nick Cox:

http://www.stata.com/statalist/archive/2004-07/msg00783.html

However, this is not solving my particular problem,
because my data essentially looks like a big stack of
Nick's "toy datasets" -- one for each of 1800 youth in
my data. So collapsing by (youthid) gives the same
value of Nick's allmissing for each youth, since the
allmissing tags missing durations for groups within
youths.


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index