Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: average value among differing numbers of variables


From   wgould@stata.com (William Gould, Stata)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: average value among differing numbers of variables
Date   Thu, 17 Jul 2003 09:13:36 -0500

Radu Ban <rban@nber.org> writes, 

> This is a data management question. The data that I'm looking at (daily U.S.
> weather) has the following structure.
>
>   day1 flag1 day2 flag2 day3 flag3 day4 flag4 day5 flag5 ... day31 flag31
>      0     s   a2     a    0     s    0     s   a5     a       a31  
>      0     s    0     s   b3     a   b4         b5             b31
>     c1          0     s    0     s    0     s   c5     a       c31
>
> the "s" flag means that the measured element (say inches of rain) is
> accumulated over those days, which are assigned a 0 value, and the
> accumulated amount is reported in the day flagged with "a". i would like to
> replace the 0 value for the accumulation days with the average of the
> accumulated value over those days.
>
> given the notations above, specifically, i would like to replace 0, 0, b3
> (in the second row) with b3/3; 0, 0, 0, c5 (in the third row) with c5/4, and
> so on. note that, as in the first row there can be more than one
> accumulation series per row.

I do now know what the solution to this problem is yet -- I will -- but I 
do know these kinds of problems are easier viewed the long way:

      orig_obs   i    day   flag
             1   1      0      s
             1   2     a2      a
             1   3      0      s
             1   4      0      s
             1   5     a5      a
           ...
             1  31    a31      
             2   1      0      s
             2   2      0      s
             2   3     b3      a
             2   4      0      s
             2   5     c5      a
           ...
             2  31    c31
           etc.

-reshape- can make the data look like that -- we'll worry about the details 
later.

I am unsure from Radu's description whether variable flag ever contains
anything on that "a" and "s".  Radu sort of implies " " is also possible, 
in which case the measurement would be for that day.  If so, I could 
view that as an "a" observation:  One accumulates the single day and 
divides by one.  So just in case there are any blanks, 

        . replace flag = "a" if flag=="s"

and then, just to verify that these data are as they have been explained to 
be, let's verify that flag is now always "s" or "a":

        . assert flag=="s" | flag=="a"

Now the problem is getting easier:

    1.  We start accumulation at the first observation for each orig_obs 
        group.

    2.  We continue accumulation up until we see an "a".

    3.  Then, we start accumulation again.

So let's add a new variable begin that will record 1 every time an accumulation
begins:

        . by orig_obs: gen begin = cond(_n==1 | flag[_n-1]=="a", 1, 0)

Now our dataset looks like, 

      orig_obs   i    day   flag   begin
             1   1      0      s       1
             1   2     a2      a       0
             1   3      0      s       1
             1   4      0      s       0
             1   5     a5      a       0
           ...
             1  31    a31      a       <could be 0 or 1>
             2   1      0      s       1
             2   2      0      s       0
             2   3     b3      a       0
             2   4      0      s       1
             2   5     c5      a       0
           ...
             2  31    c31      a       <could be 0 or 1>

Understand what I did:  I merely created a variable equal to 1 marking 
the beginning of each group in which we need to distribute sum.  If I now 
sum variable begin, I will have group numbers:

        . gen group = sum(begin)

The data set now looks like:

      orig_obs   i    day   flag   begin    group
             1   1      0      s       1        1
             1   2     a2      a       0        1
             1   3      0      s       1        2
             1   4      0      s       0        2
             1   5     a5      a       0        2
           ...
             1  31    a31      a       ?       10  <- i just made up 10
             2   1      0      s       1       11
             2   2      0      s       0       11
             2   3     b3      a       0       11
             2   4      0      s       1       12
             2   5     c5      a       0       12
           ...
             2  31    c31      a       ?        ?
           etc.

Now the problem is easy:  within group, replace all the observations 
with the value of the last observation in the group, divided by the 
number of observations in the group:

        . sort group
        . by group: gen newday = day[_N]/_N

I have my solution.  Actually, I could have typed 

         . by group: replace day = day[N]/_N

but I want to -list- the result and make sure it looks right to me.  Then 
I can replace day:

         . replace day = newday
         . drop newday

Now all I need to do is switch the data back to being in the wide format.

So here is the complete solution:

        . gen orig_obs = _n
        . reshape long day flag, i(i)

        . replace flag = "a" if flag=="s"
        . assert  flag=="s" | flag=="a"

        . sort orig_obs i
        . by orig_obs: gen begin = cond(_n==1 | flag[_n-1]=="a", 1, 0)
        . gen group = sum(begin)

        . sort group
        . by group: gen newday = day[_N]/_N
	. list                                <-- look to make sure right
        . replace day = newday
        . drop newday group

        . reshape wide

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index