# Re: st: preserving missing values in collapse (sum)

 From n j cox To statalist@hsphsun2.harvard.edu Subject Re: st: preserving missing values in collapse (sum) Date Mon, 22 Oct 2007 16:14:16 +0100

```As I understand it, you just need to recapitulate
the 2004 thread, with only one difference:
your groups are defined by two variables,
not one.

From your sample data, I would go

. l

+----------------------------+
| youthid   group   duration |
|----------------------------|
1. |      11       1         15 |
2. |      11       1          . |
3. |      11       2         31 |
4. |      11       2         10 |
5. |      11       5          . |
|----------------------------|
6. |      12       2          5 |
7. |      12       2          8 |
8. |      12       4         42 |
9. |      12       6         55 |
+----------------------------+

. bysort youthid group (duration) : gen allmissing = missing(duration[1])

. l

+---------------------------------------+
| youthid   group   duration   allmis~g |
|---------------------------------------|
1. |      11       1         15          0 |
2. |      11       1          .          0 |
3. |      11       2         10          0 |
4. |      11       2         31          0 |
5. |      11       5          .          1 |
|---------------------------------------|
6. |      12       2          5          0 |
7. |      12       2          8          0 |
8. |      12       4         42          0 |
9. |      12       6         55          0 |
+---------------------------------------+

. collapse (sum) duration (min)  allmissing , by(youthid group)

. l

+---------------------------------------+
| youthid   group   duration   allmis~g |
|---------------------------------------|
1. |      11       1         15          0 |
2. |      11       2         41          0 |
3. |      11       5          0          1 |
4. |      12       2         13          0 |
5. |      12       4         42          0 |
|---------------------------------------|
6. |      12       6         55          0 |
+---------------------------------------+

. replace duration = . if allmissing
(1 real change made, 1 to missing)

. l

+---------------------------------------+
| youthid   group   duration   allmis~g |
|---------------------------------------|
1. |      11       1         15          0 |
2. |      11       2         41          0 |
3. |      11       5          .          1 |
4. |      12       2         13          0 |
5. |      12       4         42          0 |
|---------------------------------------|
6. |      12       6         55          0 |
+---------------------------------------+

You may want to -fillin- to get closer to
your matrix. In a real example, 3 would also
appear, I guess.

. fillin youthid group

. l

+-------------------------------------------------+
| youthid   group   duration   allmis~g   _fillin |
|-------------------------------------------------|
1. |      11       1         15          0         0 |
2. |      11       2         41          0         0 |
3. |      11       4          .          .         1 |
4. |      11       5          .          1         0 |
5. |      11       6          .          .         1 |
|-------------------------------------------------|
6. |      12       1          .          .         1 |
7. |      12       2         13          0         0 |
8. |      12       4         42          0         0 |
9. |      12       5          .          .         1 |
10. |      12       6         55          0         0 |
+-------------------------------------------------+

n j cox wrote:
```
```I don't think you mean what you say. Your output
requires a -collapse, by(youthid group)-. Sorry
that I didn't spell that out.

Nick
n.j.cox@durham.ac.uk

Melonie wrote:

Thank you, Nick. I had considered that tweak. However,
I need to collapse by (youthid), which then gives me a
value of allmissing for each youth, but does not tag
the missing group for replacement:

youthid   11    12
grp1dur   15     0
grp2dur   41    13
grp3ddur   0     0
grp4dur    0    42
grp5dur    0     0
grp6dur    0    55
allmissing 1     0

n j cox wrote:

```
```Below my signature is the key part of the posting referred to.

You can tweak that approach by

bysort youthid group (duration) : gen allmissing = missing(duration[1])

and then -collapse- as before.

Nick
n.j.cox@durham.ac.uk

OK. Consider this toy dataset:

. l

+------------------+
| response   group |
|------------------|
1. |        1       1 |
2. |        2       1 |
3. |        3       2 |
4. |        .       2 |
5. |        0       3 |
|------------------|
6. |        .       3 |
7. |        .       4 |
8. |        .       4 |
+------------------+

I'd regard 0 as a sensible sum for group
3 but not for group 4, but as you say

. collapse (sum) response, by(group)

. l

+------------------+
| group   response |
|------------------|
1. |     1          3 |
2. |     2          3 |
3. |     3          0 |
4. |     4          0 |
+------------------+

One way round this is to tag groups
for which _all_ values are missing.
One way to do that is

. bysort group (response) : gen allmissing = mi(response[1])

That is, within groups, sort on response. If the first
value within each group is missing, then all of them
in that group must be.

The principle here is close to one explained at
http://www.stata.com/support/faqs/data/diff.html
You might also like to look at
http://www.stata.com/support/faqs/data/anyall.html

. l

+-----------------------------+
| response   group   allmis~g |
|-----------------------------|
1. |        1       1          0 |
2. |        2       1          0 |
3. |        3       2          0 |
4. |        .       2          0 |
5. |        0       3          0 |
|-----------------------------|
6. |        .       3          0 |
7. |        .       4          1 |
8. |        .       4          1 |
+-----------------------------+

Now we make sure the -collapse- carries along
the important information:

. collapse (sum) response (min) allmissing, by(group)

. l

+-----------------------------+
| group   response   allmis~g |
|-----------------------------|
1. |     1          3          0 |
2. |     2          3          0 |
3. |     3          0          0 |
4. |     4          0          1 |
+-----------------------------+

and convert back:

. replace response = . if allmissing
(1 real change made, 1 to missing)

. l

+-----------------------------+
| group   response   allmis~g |
|-----------------------------|
1. |     1          3          0 |
2. |     2          3          0 |
3. |     3          0          0 |
4. |     4          .          1 |
+-----------------------------+

Melonie Sullivan

I have data on history of placements into different
-group-s by -youthid-: there are multiple placement
records for each youth. I need to create a variable
equal to the sum of -duration- of all placements into
each -group- for each youth. -collapse (sum)- seems to be
the appropriate procedure, but it treats missing
values as zeroes. This causes a problem if a youth has
only one placement in a given group with unknown
duration. Example:

youthid         group        duration
11                 1            15
11                 1             .
11                 2            31
11                 2            10
11                 5             .
12                 2             5
12                 2             8
12                 4            42
12                 6            55

I create a duration variable for each group (-generate
grp1dur = duration if group==1-, etc.) and -collapse
(sum)- by -youthid- and I want to get this:

youthid   11    12
grp1dur   15     0
grp2dur   41    13
grp3ddur   0     0
grp4dur    0    42
grp5dur    .     0
grp6dur    0    55

But collapse gives me a zero on grp5dur for youth #11,
though youth #11 had placement in that group, albeit
of an unknown duration. The other zeroes are correct;
the youth had zero days in that placement group.

The problem has been addressed here before, best in
the following post by Nick Cox:

http://www.stata.com/statalist/archive/2004-07/msg00783.html

However, this is not solving my particular problem,
because my data essentially looks like a big stack of
Nick's "toy datasets" -- one for each of 1800 youth in
my data. So collapsing by (youthid) gives the same
value of Nick's allmissing for each youth, since the
allmissing tags missing durations for groups within
youths.

```

```*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```