Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: new variables creation with unbalanced panel

From	Eric Booth <[email protected]>
To	"<[email protected]>" <[email protected]>
Subject	Re: st: new variables creation with unbalanced panel
Date	Wed, 13 Oct 2010 17:34:28 +0000

<>

Nick/Ulrich:

Yes, thanks for the helpful pointers.    

@Nick:   When generating random/fake data I always just open -help random- file and copy and paste the formula from the line:
	" To generate random integers over [a,b], use a+int((b-a+1)*runiform())."
and fill in my a and b.  The laziness on my part of not solving the arithmetic probably only adds to the confusion of others.

@Uli:  Yes, I learned something here.  The use of your "tag" variable in the bysort prefix and the "w" to avoid the complicated looping is very useful.  Thanks.  Hopefully this will be easier to implement for Laura.

- Eric

__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected]
Office: +979.845.6754
Fax: +979.845.0249
http://ppri.tamu.edu

On Oct 13, 2010, at 11:45 AM, Ulrich Kohler wrote:

> Ok, do you mean then:
> 
> -----------------------------------------example.do
> 
> // Create the data (taken from E. Booth)
> clear
> set obs 20
> set seed 731
> gen str1 ctr  = "A" in 1/5
> replace ctr = "B" in 6/18
> replace ctr = "C" in 19/20
> 
> bys ctr: g year = _n
> gen x = 1+int((10-1+1)*runiform())
> 
> replace x = . in 2/4 
> replace x = . in 7/8
> replace x = . in 12/13
> gen y = 1+int((100-1+1)*runiform())
> 
> // Variable 1
> gen misx = mi(x)
> bysort misx ctr (year): gen diff_x = x-x[_n-1] if !misx
> 
> // Variable 2 
> bysort ctr (year): gen tag = 1 if !misx
> by ctr (year): replace tag = sum(tag)
> bysort ctr tag: gen w = _N
> bysort ctr tag: egen meaninit = mean(y)
> bysort ctr (year): ///
>  gen avg_y = (y + meaninit[_n-1]*w[_n-1])/(w[_n-1]+1) if !mi(x)
> 
> drop tag w meaninit
> 
> // Variable 3
> bysort misx ctr (year): gen initial_x = x[_n-1] if !misx
> 
> // Show results
> sort ctr year
> list 
> 
> -----------------------------------------------------
> 
> 
> This is also not a oneliner for variable 2 but does not require a loop. 
> It calculates the mean as a weighted average of the one value and the
> mean of tagged values. 
> 
> Uli
> 
> 
> 
> 
> 
> 
> Am Mittwoch, den 13.10.2010, 14:51 +0000 schrieb Eric Booth:
>> <>
>> 
>> The biggest challenge in Laura's questions are that, in creating the avg_y and diff_x, there is overlap in use of the observations of x.
>> So,  for observation 5 (for country A, year 5), she wants the difference between A5 and A2 after calculating the difference between A2 and A1 in obs 2.  I found it difficult to allow for this overlap when working down the columns with egen and subscripting.  (If Laura did not need this overlap, you could just create an panel indicator for A1 and A2 and gen the difference & average, and then create an indicator for A3 to A5 and gen the diff and avg, and so on.)  I have no idea about the wisdom of this approach, but it is what Laura has requested in her post.
>> 
>> I think Ulrich Kohler's solution doesn't take into account this overlap, nor will it work when there are a several missing x in a row (as is the case in Laura's first example (and it's still indicated her latest email that "variable x has many missing values while y is available for all years.")
>> 
>> I finally got this to give the output that Laura describes, but the code is not straightforward at all (particularly the answer to question (2) in Laura's email).  To complicate things, I tend to be overly redundant when writing subscripting conditions, so some of the double checking if something is missing or not could probably be removed.   Hopefully other's will have suggestions on how to do this with less effort, but this does produce what Laura describes:
>> 
>> **************!
>> **NOTE: watch for wrapping issues**
>> 
>> clear
>> set obs 20 
>> 
>> 
>> //create fake dataset//
>> g str5 country  = "A" in 1/5
>> replace country = "B" in 6/18
>> replace country = "C" in 19/20
>> 	/*
>> variable x has many missing values while 
>> y is available for all  years.
>> 	*/
>> bys country: g year = _n
>> g x = 1+int((10-1+1)*runiform())
>> 	**create some missings in x**
>> 	replace x = . in 2/4 
>> 	replace x = . in 7/8
>> 	replace x = . in 12/13
>> g y = 1+int((100-1+1)*runiform())
>> 
>> 
>> 
>> 
>> **newvars**
>> g initial_x =.
>> g diff_x = .
>> 
>> 
>> **I created these vars in the 
>> **opposite order than you described them
>> 
>> qui su year
>> forval n = 1/`r(max)' {
>> 
>> /*
>> 3) a variable that for each country, 
>> for each available x takes the value of the 
>> previous observed x.
>> */
>> 
>> bys country: replace initial_x = x[_n-`n']  ///
>> 	if !mi(x[_n-`n']) & mi(initial_x[_n]) &  ///
>> 	!mi(x[_n])
>> 
>> /*
>> 1) a variable that for each country 
>> will equal the difference between two 
>> consecutive available x values.
>> */
>> 	
>> bys country: replace diff_x =  ///
>> 	x[_n] - x[_n-`n'] if !mi(x[_n]) ///
>> 	& mi(diff_x[_n]) 
>> }
>> 
>> 
>> /*
>> 2) a variable that for each country 
>> will take the value the average of all y's 
>> available between  two consecutive 
>> available x values.
>> */
>> g avgy = .
>> 
>> egen pid = group(country year x) if !mi(x)
>> 	**fillin missing**
>> qui su year
>> forval n = 1/`r(max)' {
>> bys country: replace pid = pid[_n-`n'] ///
>> 	if !mi(pid[_n-`n']) & mi(pid[_n])
>> }
>> 
>> levelsof pid, local(panels)
>> token `"`panels'"'
>> while "`1'" != "" & "`2'"!= "" {
>> 	di "var avgy_p`1' for `1':`2'"
>> cap drop i`1'
>> cap drop o
>> g o = 1 if pid[_n] == pid[_n+1] ///
>> 	& mi(x[_n+1])
>> egen avgy_p`1' = mean(y) ///
>> 	if pid==`1' | pid==`2' ///
>> 	& o !=1
>> replace avgy_p`1' = . if pid!=`2'
>> g i`1' = 1 if country[_n]!=country[_n-1] ///
>> 	& !mi(avgy_p`1')
>> qui su i`1'
>> if "`r(max)'"=="1"  drop avgy_p`1' 
>> cap drop i`1'
>> cap replace avgy = avgy_p`1'  ///
>> 	if mi(avgy) & !mi(avgy_p`1') ///
>> 	& !mi(x)
>> cap drop avgy_p`1'
>> cap drop o
>> 		mac shift
>> 		}
>> drop pid
>> l 
>> **************!
>> 
>> 
>> 
>> - Eric 
>> __
>> Eric A. Booth
>> Public Policy Research Institute
>> Texas A&M University
>> [email protected]
>> Office: +979.845.6754
>> 
>> 
>> On Oct 13, 2010, at 2:17 AM, Ulrich Kohler wrote:
>> 
>>> Am Dienstag, den 12.10.2010, 19:41 -0700 schrieb Laura:
>>>> Hi, I am very new at using STATA. Sorry for the previous post, in which the 
>>>> panel looked all wrong.
>>>> I am using an unbalanced panel of the form:
>>>> 
>>>> country year x y diff_x average_y initial_x 
>>>> A 1 xA1 yA1    
>>>> A 2 xA2 yA2 xA2-xA1 AVG(yA2:yA1) xA1 
>>>> A 3  yA3    
>>>> A 4  yA4    
>>>> A 5 xA5 yA5 xA5-xA2 AVG(yA5:yA2) xA2 
>>>> B 1  yB1    
>>>> B 2 xB2 yB2    
>>>> B 3  yB3    
>>>> B 4 xB4 yB4 xB4-xB2 AVG(yB4:YB2) xB2 
>>>> 
>>>> Basically, variable x has many missing values while y is available for all 
>>>> years.
>>>> 
>>>> I need to create 3 types of new variables (as in the last 3 columns of the 
>>>> table):
>>>> 
>>>> 1) a variable that for each country will equal the difference between two 
>>>> consecutive available x values.(diff_x)
>>>> 2) a variable that for each country will take the value the average of all y's 
>>>> available between  two consecutive available x values. (average_y)
>>>> 3) a variable that for each country, for each available x takes the value of the 
>>>> 
>>>> previous observed x. (initial_x)
>>> 
>>> 
>>> Laura,
>>> 
>>> Something like this:
>>> 
>>> ------------------------example.do
>>> clear
>>> input ctr year x y 
>>> 1 1  2 3 
>>> 1 2  5 7 
>>> 1 3  3 4 
>>> 1 4  . 6
>>> 1 5  4 6 
>>> 2 1  . 5    
>>> 2 2  3 2    
>>> 2 3  . 4    
>>> 2 4  1 7 
>>> end
>>> 
>>> tsset ctr year
>>> 
>>> gen diff_x = d1.x
>>> 
>>> by ctr (year), sort: ///
>>> gen isconseq = (x<. & x[_n+1] <.) | (x<. & x[_n-1] <.) 
>>> 
>>> egen avgy = mean(y) if isconseq
>>> 
>>> gen initial_x = l1.x
>>> 
>>> --------------------------------------
>>> 
>>> More details are in -help tsvarlist- and -help egen-. Being new to Stata
>>> I would also recommend to learn about Recoding with -by- and explicit
>>> subscripts. Section 5.1.3 in Kohler/Kreuter, Data Analysis Using Stata,
>>> 2nd edtion is one place to start.
>>> 
>>> Many regards
>>> Uli


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: new variables creation with unbalanced panel
  - From: Laura <[email protected]>
- Re: st: new variables creation with unbalanced panel
  - From: Ulrich Kohler <[email protected]>
- Re: st: new variables creation with unbalanced panel
  - From: Eric Booth <[email protected]>
- Re: st: new variables creation with unbalanced panel
  - From: Ulrich Kohler <[email protected]>

Prev by Date: st: RE: RE: correlation, panel data
Next by Date: Re: st: gologit2 model
Previous by thread: Re: st: new variables creation with unbalanced panel
Next by thread: Re: st: new variables creation with unbalanced panel
Index(es):
- Date
- Thread