Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
AW: st: panel data analysis

From	"Martin Weiss" <[email protected]>
To	<[email protected]>
Subject	AW: st: panel data analysis
Date	Tue, 15 Jun 2010 17:56:26 +0200
<> 

" First question: try using the xi: i.variable command in your regression
analysis, so, eg:

xi: regression command dependent_var other_var
i.variable_you_want_dummies_for"


Whether you do -xi- as standalone or within the context of another command
should not matter much, though:


***
sysuse auto, clear
xi: reg price weight i.rep78
d _I*
sysuse auto, clear
xi i.rep78
d _I*
***




"etc. alternatively-you could make the groups yourself, so e.g for an
education variable with 3 levels (lower, middle, high etc), you could use
something like: 

gen name_new_var=(education==1)
gen name_new_var2=(education==2)

each of the above statements will create binary variables equal to 1/0,
1=level of education variable indicated
etc"




So the -generate- statements could be replaced by a call to -tab education,
gen(name_new_var)-





"I'm not sure how much stats experience you have-so remember in your
regressions (if you set your dummies manually) to choose a reference
category and leave it out of your equation in order to have something with
which to compare the remaining levels with. If you use the xi command, stata
will do this automatically."



Stata will chuck out the redundant level either way, whether -xi- is
involved in the call or not:

***
sysuse auto, clear
tab rep78 , gen(mydum)
reg price weight mydum?
***



HTH
Martin


-----Ursprüngliche Nachricht-----
Von: [email protected]
[mailto:[email protected]] Im Auftrag von Katya Mauff
Gesendet: Dienstag, 15. Juni 2010 17:35
An: [email protected]
Betreff: Re: st: panel data analysis

First question: try using the xi: i.variable command in your regression
analysis, so, eg:

xi: regression command dependent_var other_var
i.variable_you_want_dummies_for

etc. alternatively-you could make the groups yourself, so e.g for an
education variable with 3 levels (lower, middle, high etc), you could use
something like: 

gen name_new_var=(education==1)
gen name_new_var2=(education==2)

each of the above statements will create binary variables equal to 1/0,
1=level of education variable indicated
etc

I'm not sure how much stats experience you have-so remember in your
regressions (if you set your dummies manually) to choose a reference
category and leave it out of your equation in order to have something with
which to compare the remaining levels with. If you use the xi command, stata
will do this automatically.

Second question: you don't seem to understand the objective of doing the
panel data analysis. 
Briefly (and very simplistically): 
You have measurements taken at several occasions for each group = a RESPONSE
PROFILE for each group. You wish to analyze repeated measurement data using
a regression approach that models the group effect as a RANDOM effect. 

Why? -The chosen groups are assumed to be a random sample from a population
of groups. If the study had to be repeated, a different sample would be
included.

Ordinarily, you would not have repeated measures, and the random error in
your model would be the variance in your response that has not been
explained by that of your explanatory variables. With repeated measures, you
have two different sources of variation: variation within your groups,( i.e.
variation in your response over the repeated measures within firm i) , and
variation between your groups. 

The fe model looks at your within subject effects- it centers your responses
around the respective group means. The be model looks at your between
subject effects,-it averages the responses over occasion for each group. 

The re model is a random intercept model: it accounts for both within and
between variation, and allows you to determine how much of the overall
variance in your model is due to variation between your groups, and how much
is due to variation within them. This is the model you want to fit.

The estimates from the random-intercept model are weighted averages of the
estimates from the between and within subject models. These estimates are
more efficient ( i.e. vary less- as can be seen from the estimates of
standard errors) than the estimators obtained from the between or within
models because it makes use of both between- and within-subject information.
However, the underlying assumption is that all the models estimate the same
true population parameters.

You would not use the reported R^2 values you are referring to in order to
decide between the fe/be and re models-all they are indicating is the
proportion of between/within/overall variance explained by the included
explanatory variables. 

The objective in fitting all three is to determine whether the estimated
between and within effects are the same or not. If they are not-the
assumption underlying your model is invalid... (Note: the coefficient values
of your variables in each model cannot be used to indicate this
difference/not-don't assume that the above assumption is violated because
you have different values! The assumption refers to whether or not the TRUE
POPULATION parameters are equal/not)

To test explicitly whether the between and within effects are the same, is a
lengthy process to explain -I've done so assuming a simple model with a
single continuous explanatory variable as an example:

Fit your fe, be and re models, calculate the average (mean) value of your
explanatory variable for each group (try: egen mean=mean(var), by(i) ), then
calculate the difference between your var and this mean (e.g. gen
diff=var-mean) for every observation. Then fit an re model, with the mean
and diff as your new explanatory variables (which represent your between
group relationship between the variable and your response and the within
group relationship ... respectively). 

Following the model fit, assuming you are using xtreg, try (lincom
diff-mean), which will test to see whether the pop effects are the same or
not, (null hypothesis is that they are the same). If they are significantly
different, it would indicate correlation between the explanatory variable
and the subject-specific error term.

If you have more than one variable, do this for the first, and keep the
variable only if the null hypothesis is not rejected. Add the next variable,
and repeat the above process for this new variable, keep only if etc... 

The alternative is to fit the re model, and then check all your residual
assumptions etc. It is preferable to fit the model with the random effects
rather than without. 

Your assumptions are that your between-group error terms are normal, mean
zero, constant variance, and independent for different groups, and that your
eij are independently and identically normally distributed, mean zero,
constant variance. Please also note that your response variable is assumed
to be normal for these models-check this, (if it is not-try log-transforming
it). For your q below, to check for heteroskedasticity, plot your residuals
vs your fitted values, if the result looks random your constant variance
assumption is satisfied. 

Signs of collinearity include abnormally large standard errors-also, bear in
mind that interaction terms will automatically introduce collinearity in
your model. Check your variables before- (try corr var1 var2) and think
about which variables are likely to represent the same/too similar
information. Multicollinearity does not violate your underlying model
assumptions-but it may invalidate your inference. 

If you are using xtmixed rather than xtreg, it is really more for multilevel
models, but it will allow you to generate predicted values and standardized
residuals, where xtreg will not. It will also allow you to make model
comparisons using the lrtest command if you specify the mle option in your
model call. The poorness of your model fit may be due to the inadequacy of
the explanatory variables you have currently included-you may need to add
variables, and a means of model comparison would then be valuable to you.

Regarding your question about the negative intercept, see if you can follow
the interpretation below:

As an example, if you let y_ij be the response for group i at time j, an
random intercept model with two continuous explanatory variables would look
something like:

y_ij=mu+alpha_i + e_ij

where mu= beta_0 + beta1*x_1 + beta2*x_2 

In this model, you have fixed and random effects. What I mean by fixed
effects: a fixed effect is an effect whose levels will stay the same should
the experiment be repeated, e.g. if you did the study again, you would look
at the same levels of education that you did here.

Mu is the estimated average response for particular values of x_1 and x_2. 

Beta_0 is your intercept term (the estimated average response, holding all
other variables constant, (in this case equal to zero)-it can be
positive/negative), and beta_1 and beta_2 are the estimated average change
in your response for a 1 unit change in var_1/var_2 respectively. The beta
effects thus represent the "average" effects of the respective variables for
the population of groups as a whole.- i.e. regardless of specific group etc.


The random effects represent the deviation from the "average" fixed effect
due to group. These deviations are represented by the alpha_i, for each
group i.

Two variance components: The random effects alpha_i are normally
distributed, with mean zero, and constant variance sigma^2_1, and the e_ij
(within group residuals) are iid normally distributed, with mean zero and
constant variance sigma^2_2. 

So, e_ij ~N(0,sigma^2_1)

      alpha_i ~ N(0,sigma^2_2)

If you had a categorical variable, with say three levels, and you included
levels 2 and 3 in your model so it became:

y_ij=beta_0 + beta_1*variable_1 + beta_2*variable_2 +beta_3*var_3_level_2 +
beta_4*var_4_level_3+alpha_i + e_ij

beta_0 would represent the average response for level 1 of your categorical
variable, holding the continuous variables constant, beta_1 and beta_2 would
have the same interpretation as before, but for level 1 of your categorical
variable only, and beta_3 and beta_4 would indicate the estimated average
change in your response for level_2 vs level 1 etc. 

If you wanted the beta_1 and beta_2 effects for different levels of your
categorical variable, you would have to fit interaction terms, determine
their significance etc and whether their inclusion in the model is
justified/not, and then interpret them as well. 

There is one final option: you may use the pa option, instead of the re
option. Pa is population averaged, (the same as using xtgee models), which
means that it does not explicitly specify the distribution of the
population, because it does not include random effects. The difference lies
in what is being estimated, although the two are often very similar. 

The bottom line for someone thinking about using the GEE (or population
average estimator) is to think about whether the averaging procedure makes
sense for the type of inference that you want to make. 

If you want to estimate e.g. how a particular variable effects a specific
group's response, then the re model is better. 

If you want to look at how the average group is affected by a change in that
variable, then use the pa model. Sometimes, the results are very similar,
but large variation between groups makes the difference greater. 

I would recommend you at least do some background reading on mixed models
before attempting any further analysis.

Hope that helps.

Katya




>>> Danielle Koopmans  06/15/10 12:49 PM >>>
Hello,

it's my first time here and I have some questions. I have a dataset
with variables of 32 firms over a timespan of 10 years, I am examing
whether tenure (of a specific person) and other variables has
influence on the profitability (y) of firm i.
First I have a question about the xi command because I have some
categorical variables: I wanted to create dummies for the variable
years xi i . years and for education xi i . education, I tried this
command but nothing happenend, not even a note that I did something
wrong.

Second, because I have a dataset with cross-section data and
timeseries I ran a paneldata regression (fe, be and re) with i=firms
and t=years. It look likes this but then with a lot more variables
like financial variables, education dummy, age etc  :

y           tenure        year t      firm i
8,45        6,614        1995         1
7,39        7,616        1996         1
3,10        0,611        1997         1
9,93        1,633        1998         1
12,39      2,611        1999         1
19,24      3,614         2000        1
0,49        4,614         2001        1
1,13        0,611         2002        1
4,69        1,611         2003        1
9,14        2,614         2004        1
12,64      3,614         2005        1
3,69        2,633        1995         2
7,43        3,636        1996         2
10,30      4,636        1997         2
11,64      5,636        1998         2
10,01      6,636        1999         2


The R^2 differs between the models:

Be

R-sq:  within  = 0.0298
       between = 0.5908
       overall = 0.2349

Re

R-sq:  within  = 0.2412
       between = 0.1820
       overall = 0.2260

Fe

R-sq:  within  = 0.2611
       between = 0.0194
       overall = 0.0007
 It doesn't seem good to me these results but which model should I
choose and which R^2 do I have to look at: within, between or overall?
My constant is also negtive at the fe model, how come?

And how to check for heteroskedastiscity, serial correlation
(Durbin-Watson test?) and collinearity?

Hopefully someone can help me on this. This is all very new to me.
Danielle

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


 

###
UNIVERSITY OF CAPE TOWN 

This e-mail is subject to the UCT ICT policies and e-mail disclaimer
published on our website at
http://www.uct.ac.za/about/policies/emaildisclaimer/ or obtainable from +27
21 650 4500. This e-mail is intended only for the person(s) to whom it is
addressed. If the e-mail has reached you in error, please notify the author.
If you are not the intended recipient of the e-mail you may not use,
disclose, copy, redirect or print the content. If this e-mail is not related
to the business of UCT it is sent by the sender in the sender's individual
capacity.

###
 

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
References:
- Re: st: panel data analysis
  - From: "Katya Mauff" <[email protected]>
Prev by Date: RE: st: RE: inflection point of sigmoid curves
Next by Date: Re: st: RE: inflection point of sigmoid curves
Previous by thread: Re: st: panel data analysis
Next by thread: st: WG: PCA and Panel data
Index(es):
- Date
- Thread