Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Calculating variable-averages of time-spans (laid out case by case via variables)


From   Wolfgang Feudenheim <[email protected]>
To   [email protected]
Subject   Re: st: RE: Calculating variable-averages of time-spans (laid out case by case via variables)
Date   Fri, 11 Mar 2011 18:05:24 +0100

Nope, I am actually fine with regular spacing.
However, as I did not find any longitudinal example-dataset with regular spacing (which obviously was the cause of a misunderstanding), it might be a good idea to add one of this kind to the library. 

____________
Whatever works! 

I understood you to want a solution that didn't depend on regular spacing. 

Any way, see also my latest, just sent to the list. 

Nick 
[email protected] 

Am 11.03.2011 um 17:59 schrieb Nick Cox:

> It is not difficult, just a little complicated. 
> 
> Take this solution 
> 
> egen mean_workload = 
> mean(cond((year == burnoutyear) | (year == burnoutyear - 1),  wks_work, .)) , by(idcode)
> 
> Let's take out the middle argument 
> 
> egen mean_workload = mean( !!! ) , by(idcode)
> 
> That says: take the mean of !!!, and do it separately for each identifier given by -idcode-. 
> 
> So, what is !!! 
> 
> cond((year == burnoutyear) | (year == burnoutyear - 1),  wks_work, .)
> 
> -cond()- is a function that here takes three arguments 
> 
> 1. true_or_false question 
> 
> 2. result if answer is true 
> 
> 3. result if answer is false 
> 
> 1. The true_or_false question is 
> 
> (year == burnoutyear) | (year == burnoutyear - 1)
> 
> i.e. is the year the same as the burnout year or the one before. 
> 
> 2. The result if true, i.e. for the observations specified above, is just the variable -wks_work-. 
> 
> 3. The result if false, i.e. for the others, is missing. 
> 
> So -egen- is instructed to take the mean of something that is either the values in the observations you care about, or missing. 
> 
> But it's standard in Stata that missings are just ignored when you do statistics. So, the missings do not interfere with the taking of the mean. They don't influence either the sum of values or the number of values. 
> 
> And as there is no -if- qualifier here, the mean ends up being assigned to all the observations in each panel. 
> 
> -egen- repeats all this for each panel. That is what -egen, by()- does. 
> 
> The manual doesn't document -by()- as an option, but it works. A more manual-like solution is this: 
> 
> bysort idcode: 
> egen mean_workload = 
> mean(cond((year == burnoutyear) | (year == burnoutyear - 1),  wks_work, .)) 
> 
> Nick 
> [email protected] 
> 
> 
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Wolfgang Feudenheim
> Sent: 11 March 2011 16:14
> To: [email protected]
> Subject: Re: st: RE: Calculating variable-averages of time-spans (laid out case by case via variables)
> 
> Alright,
> 
> thanks a lot for these helpful comments. I went for your "at length"-solution, Robert. Also thanks for your explanations, Nick, specially the tagging bit is fantastic, since I had not come across that yet. It will take some more time to figure out the way these one-liners work, though...
> 
> I will apply this technique to my economic dataset right away.
> 
> Am 11.03.2011 um 14:39 schrieb Robert Picard:
> 
>> Here's another one liner:
>> 
>> bysort idcode (year): egen m = ///
>>  sum(((wks_work + wks_work[_n-1]) / 2) * (year == burnoutyear))
>> 
>> Robert
>> 
>> 
>> On Fri, Mar 11, 2011 at 2:25 PM, Nick Cox <[email protected]> wrote:
>>> Here is another way to do it:
>>> 
>>> 
>>> egen mean_workload =
>>> mean(cond((year == burnoutyear) | (year == burnoutyear - 1), wks_work, .)
>>> , by(idcode)
>>> 
>>> Nick
>>> [email protected]
>>> 
>>> Nick Cox
>>> 
>>> I have no idea what a "burnout" is here, but I guess I don't need to know. Just curious, though....
>>> 
>>> The context is that in this dataset -idcode- and -year- are joint identifiers.
>>> 
>>> You want to identify pairs of observations that (1) have the same -idcode- and (2) are the burnout-year or the one before:
>>> 
>>> gen tag = (year == burnoutyear) | (year == (burnoutyear - 1))
>>> 
>>> Then you need to average within those groups
>>> 
>>> egen mean_workload = mean(wks_work) if tag, by(idcode)
>>> 
>>> and spread to all values within that -idcode-
>>> 
>>> bysort idcode (mean_workload) : replace mean_workload = mean_workload[1]
>>> 
>>> This would work too (two lines instead of three)
>>> 
>>> gen tag = (year == burnoutyear) | (year == (burnoutyear - 1))
>>> egen mean_workload2 = mean(wks_work/tag) , by(idcode)
>>> 
>>> That's slightly cute or perverse, according to taste. If you divide by 0, the result is missing and will be ignored by -egen, mean()-. Dividing by 1 manifestly leaves values as they are.
>>> 
>>> A possible reduction to one line follows, as an exercise!
>>> 
>>> Note that what you asked for was the mean across all the observations that satisfied the criteria. You didn't spell out to Stata that you wanted the calculation done separately by -idcode- (as above).
>>> 
>>> Looking at the data suggests that something else was wrong too with what you asked. Note that -egen- doesn't guarantee to keep the same -sort- order within its operations, just to return the data to the same -sort- order as when it started. So, it is unwise to assume otherwise.
>>> 
>>> I note that the mean of a sum is that sum, not the mean of the constituent values.
>>> 
>>> Nick
>>> [email protected]
>>> 
>>> Wolfgang Feudenheim
>>> 
>>> I am currently working on an analysis of economic data in OECD-countries. For each country, I separately fixed a key-year. For this specific year and the two preceding years I want to read out averages of economic indicators such as "GDP/capita" etc.
>>> 
>>> In the following, I try to illustrate my problems with the help of the example dataset
>>> "National Longitudinal Survey.  Young Women 14-26 years of age in 1968" by pretending I was interested in the average workload before the occurance of a burnout (sorry, couldn't make up any more positive scenario ...). Unfortunately, the time data is not available on a year-by-year-basis but in irregular steps. Therefore, I just observe one specific year and its preceding year. I am running my analysis on
>>> 
>>> -Stata/IC 11.1 for Mac (64-bit Intel)
>>> -Born 04 Nov 2010
>>> 
>>> Here is the code:
>>> 
>>> -use http://www.stata-press.com/data/r11/nlswork.dta, clear
>>> -*Add Burnout-Values to Dataset*
>>> -gen burnoutyear=.
>>> -replace burnoutyear=73 if idcode==1
>>> -replace burnoutyear=72 if idcode==2
>>> 
>>> -*Generate Variable for all observations of one person (idcode) that presents the average of weeks worked in burnout-year and*
>>> -*burnout-preceding year*
>>> -egen avworkload_b=mean(wks_work[_n]+wks_work[_n-1]) if (year==burnoutyear)&(idcode[_n]==idcode[_n-1])
>>> 
>>> 
>>> The problems that occur are the following:
>>> 
>>> 1. For both, "idcode==1" and "idcode==2", the wrong result, namely "25" is displayed. The average values should however be "27 and "17.5".
>>> 
>>> 2. The variable "avworkload_b" is only inserted into the dataset for the year indicated by "burnoutyear" for the respective "idcode". I want to have this value displayed for all years of each "idcode".
>>> 
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
> 


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index