# Re: st: average value among differing numbers of variables

 From wgould@stata.com (William Gould, Stata) To statalist@hsphsun2.harvard.edu Subject Re: st: average value among differing numbers of variables Date Thu, 17 Jul 2003 09:13:36 -0500

> This is a data management question. The data that I'm looking at (daily U.S.
> weather) has the following structure.
>
>   day1 flag1 day2 flag2 day3 flag3 day4 flag4 day5 flag5 ... day31 flag31
>      0     s   a2     a    0     s    0     s   a5     a       a31
>      0     s    0     s   b3     a   b4         b5             b31
>     c1          0     s    0     s    0     s   c5     a       c31
>
> the "s" flag means that the measured element (say inches of rain) is
> accumulated over those days, which are assigned a 0 value, and the
> accumulated amount is reported in the day flagged with "a". i would like to
> replace the 0 value for the accumulation days with the average of the
> accumulated value over those days.
>
> given the notations above, specifically, i would like to replace 0, 0, b3
> (in the second row) with b3/3; 0, 0, 0, c5 (in the third row) with c5/4, and
> so on. note that, as in the first row there can be more than one
> accumulation series per row.

I do now know what the solution to this problem is yet -- I will -- but I
do know these kinds of problems are easier viewed the long way:

orig_obs   i    day   flag
1   1      0      s
1   2     a2      a
1   3      0      s
1   4      0      s
1   5     a5      a
...
1  31    a31
2   1      0      s
2   2      0      s
2   3     b3      a
2   4      0      s
2   5     c5      a
...
2  31    c31
etc.

-reshape- can make the data look like that -- we'll worry about the details
later.

I am unsure from Radu's description whether variable flag ever contains
anything on that "a" and "s".  Radu sort of implies " " is also possible,
in which case the measurement would be for that day.  If so, I could
view that as an "a" observation:  One accumulates the single day and
divides by one.  So just in case there are any blanks,

. replace flag = "a" if flag=="s"

and then, just to verify that these data are as they have been explained to
be, let's verify that flag is now always "s" or "a":

. assert flag=="s" | flag=="a"

Now the problem is getting easier:

1.  We start accumulation at the first observation for each orig_obs
group.

2.  We continue accumulation up until we see an "a".

3.  Then, we start accumulation again.

So let's add a new variable begin that will record 1 every time an accumulation
begins:

. by orig_obs: gen begin = cond(_n==1 | flag[_n-1]=="a", 1, 0)

Now our dataset looks like,

orig_obs   i    day   flag   begin
1   1      0      s       1
1   2     a2      a       0
1   3      0      s       1
1   4      0      s       0
1   5     a5      a       0
...
1  31    a31      a       <could be 0 or 1>
2   1      0      s       1
2   2      0      s       0
2   3     b3      a       0
2   4      0      s       1
2   5     c5      a       0
...
2  31    c31      a       <could be 0 or 1>

Understand what I did:  I merely created a variable equal to 1 marking
the beginning of each group in which we need to distribute sum.  If I now
sum variable begin, I will have group numbers:

. gen group = sum(begin)

The data set now looks like:

orig_obs   i    day   flag   begin    group
1   1      0      s       1        1
1   2     a2      a       0        1
1   3      0      s       1        2
1   4      0      s       0        2
1   5     a5      a       0        2
...
1  31    a31      a       ?       10  <- i just made up 10
2   1      0      s       1       11
2   2      0      s       0       11
2   3     b3      a       0       11
2   4      0      s       1       12
2   5     c5      a       0       12
...
2  31    c31      a       ?        ?
etc.

Now the problem is easy:  within group, replace all the observations
with the value of the last observation in the group, divided by the
number of observations in the group:

. sort group
. by group: gen newday = day[_N]/_N

I have my solution.  Actually, I could have typed

. by group: replace day = day[N]/_N

but I want to -list- the result and make sure it looks right to me.  Then
I can replace day:

. replace day = newday
. drop newday

Now all I need to do is switch the data back to being in the wide format.

So here is the complete solution:

. gen orig_obs = _n
. reshape long day flag, i(i)

. replace flag = "a" if flag=="s"
. assert  flag=="s" | flag=="a"

. sort orig_obs i
. by orig_obs: gen begin = cond(_n==1 | flag[_n-1]=="a", 1, 0)
. gen group = sum(begin)

. sort group
. by group: gen newday = day[_N]/_N
. list                                <-- look to make sure right
. replace day = newday
. drop newday group

. reshape wide

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/