Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: average value among differing numbers of variables

From   "Nick Cox" <>
To   <>
Subject   st: RE: average value among differing numbers of variables
Date   Thu, 17 Jul 2003 12:56:25 +0100

Radu Ban

> This is a data management question. The data that I'm
> looking at (daily
> U.S. weather) has the following structure.
> 	day1	flag1	day2	flag2	day3	flag3	day4
> flag4	day5	flag5	...	day31	flag31
> 	0	s	a2	a	0	s	0
> s	a5	a		a31
> 	0	s	0	s	b3	a	b4
> 	b5			b31
> 	c1		0	s	0	s	0
> s	c5	a		c31
> the "s" flag means that the measured element (say inches of
> rain) is
> accumulated over those days, which are assigned a 0 value, and the
> accumulated amount is reported in the day flagged with "a".
> i would like to
> replace the 0 value for the accumulation days with the
> average of the
> accumulated value over those days.
> given the notations above, specifically, i would like to
> replace 0, 0, b3
> (in the second row) with b3/3; 0, 0, 0, c5 (in the third
> row) with c5/4,
> and so on. note that, as in the first row there can be more
> than one
> accumulation series per row.
> i figured out that each type of accumulation, a_ij(starting
> at day i ending
> at day j) must be identified, so that in the end i can use:
> forval j = 2/31 {
> 	forval i = 1/`j' {
> 		egen daymean = rmean(day`i'-day`j') if a_`i'`j' == 1
> 		replace day`i' = daymean
> 		drop daymean
> 	}
> }
> but i'm not sure how to define all a_ij

Ernest Berkhout

That looks indeed like a rather complex problem, for which i don't
have a
golden key. The problem with your present setup is that for instance
rmean(day1-day2) will not work because flag1 is in that list as well.

My approach would be to first reshape the data to long form (which is
comfortable to most datamanipulation issues), so that you have the
variables day (which should better be renamed 'temperature' or
flag, an indicator for day 1 to 31, and an indicator for a/b/c.

 From there it should be possible to construct some group-variable
indicates each group of records where temperature should be filled in
their accumulated value from which it should be calculated. In your
on the second row you have a group where day1, day2 and day3 belong
together, day4 is a group and day5 is a group. Let your grouping
assign a value of 1 to the first 3 records (day1, day2, day3), a value
of 2
to the record of day4, etc.

 From there you could work somethinh out using the 'by groupvar:'
and explicit subscripting, as in "by groupvar: replace
temperature=temperature[_N] if temperature<." You might want to
accumulated temperatures to averages before this.

There are probably a lot of other ways to tackle the problem, but i
the reshape-to-long is the most important thing here.

NJC >>> I endorse Ernest's suggestion. The existence of the -egen,
functions is just about the only exception to the Stata principle that
doing it long is easier than doing it wide.

I presume that each observation is a month's worth of rainfall
data. I set on one side the small issue of what happens if the
month contains 28, 29, 30 days, probably no issue.

I created something with I think is the same structure
like this:

set obs 100
forval i = 1/31 {
	gen str flag`i' = cond(uniform() < 0.7, "s", "a")
	gen day`i' = cond(flag`i' == "s", 0, int(10*uniform()))

You need a unique identifier if you don't have one

gen id = _n

and then it's a common-or-garden -reshape-

reshape long day flag , i(id)

The variable names are now a little unclear,
so I clean up

rename day rainfall
rename _j day

Now the averaging is to be done for each
block of observations ending with a "s" flag.
I look at -tsspell-, which is my favourite
tool for making spells, Hogwarts training
notwithstanding, and the nearest equivalent is an option
for identifying spells based on something happening
in the _first_ observation of each spell (like
a birth, earthquake, eruption, election, Stata release...).
That will serve fine so long as we look _backwards_
through time:

gen seq = - _n
tsset seq

-- the -tsset- being necessary for -tsspell- to apply --

tsspell, fcond(flag == "a")

-tsspell- leaves behind three variables, of
which _spell is of direct use here:

egen mean = mean(rainfall), by(_spell)

Now reverse time to its proper order

sort id day

The variables created by -tsspell- may have
other uses.

-tsspell- is on SSC.


P.S. 1 inch = 25.4 mm. I suspect the USA is the
only country still measuring rainfall in inches.

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index