Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: panel data analysis

From   "Katya Mauff" <>
To   <>
Subject   Re: st: panel data analysis
Date   Tue, 15 Jun 2010 17:35:01 +0200

First question: try using the xi: i.variable command in your regression analysis, so, eg:

xi: regression command dependent_var other_var  i.variable_you_want_dummies_for

etc. alternatively-you could make the groups yourself, so e.g for an education variable with 3 levels (lower, middle, high etc), you could use something like: 

gen name_new_var=(education==1)
gen name_new_var2=(education==2)

each of the above statements will create binary variables equal to 1/0, 1=level of education variable indicated

I'm not sure how much stats experience you have-so remember in your regressions (if you set your dummies manually) to choose a reference category and leave it out of your equation in order to have something with which to compare the remaining levels with. If you use the xi command, stata will do this automatically.

Second question: you don't seem to understand the objective of doing the panel data analysis. 
Briefly (and very simplistically): 
You have measurements taken at several occasions for each group = a RESPONSE PROFILE for each group. You wish to analyze repeated measurement data using a regression approach that models the group effect as a RANDOM effect. 

Why? -The chosen groups are assumed to be a random sample from a population of groups. If the study had to be repeated, a different sample would be included.

Ordinarily, you would not have repeated measures, and the random error in your model would be the variance in your response that has not been explained by that of your explanatory variables. With repeated measures, you have two different sources of variation: variation within your groups,( i.e. variation in your response over the repeated measures within firm i) , and variation between your groups. 

The fe model looks at your within subject effects- it centers your responses around the respective group means. The be model looks at your between subject effects,-it averages the responses over occasion for each group. 

The re model is a random intercept model: it accounts for both within and between variation, and allows you to determine how much of the overall variance in your model is due to variation between your groups, and how much is due to variation within them. This is the model you want to fit.

The estimates from the random-intercept model are weighted averages of the estimates from the between and within subject models. These estimates are more efficient ( i.e. vary less- as can be seen from the estimates of standard errors) than the estimators obtained from the between or within models because it makes use of both between- and within-subject information. However, the underlying assumption is that all the models estimate the same true population parameters.

You would not use the reported R^2 values you are referring to in order to decide between the fe/be and re models-all they are indicating is the proportion of between/within/overall variance explained by the included explanatory variables. 

The objective in fitting all three is to determine whether the estimated between and within effects are the same or not. If they are not-the assumption underlying your model is invalid... (Note: the coefficient values of your variables in each model cannot be used to indicate this difference/not-don't assume that the above assumption is violated because you have different values! The assumption refers to whether or not the TRUE POPULATION parameters are equal/not)

To test explicitly whether the between and within effects are the same, is a lengthy process to explain -I've done so assuming a simple model with a single continuous explanatory variable as an example:

Fit your fe, be and re models, calculate the average (mean) value of your explanatory variable for each group (try: egen mean=mean(var), by(i) ), then calculate the difference between your var and this mean (e.g. gen diff=var-mean) for every observation. Then fit an re model, with the mean and diff as your new explanatory variables (which represent your between group relationship between the variable and your response and the within group relationship ... respectively). 

Following the model fit, assuming you are using xtreg, try (lincom diff-mean), which will test to see whether the pop effects are the same or not, (null hypothesis is that they are the same). If they are significantly different, it would indicate correlation between the explanatory variable and the subject-specific error term.

If you have more than one variable, do this for the first, and keep the variable only if the null hypothesis is not rejected. Add the next variable, and repeat the above process for this new variable, keep only if etc... 

The alternative is to fit the re model, and then check all your residual assumptions etc. It is preferable to fit the model with the random effects rather than without. 

Your assumptions are that your between-group error terms are normal, mean zero, constant variance, and independent for different groups, and that your eij are independently and identically normally distributed, mean zero, constant variance. Please also note that your response variable is assumed to be normal for these models-check this, (if it is not-try log-transforming it). For your q below, to check for heteroskedasticity, plot your residuals vs your fitted values, if the result looks random your constant variance assumption is satisfied. 

Signs of collinearity include abnormally large standard errors-also, bear in mind that interaction terms will automatically introduce collinearity in your model. Check your variables before- (try corr var1 var2) and think about which variables are likely to represent the same/too similar information. Multicollinearity does not violate your underlying model assumptions-but it may invalidate your inference. 

If you are using xtmixed rather than xtreg, it is really more for multilevel models, but it will allow you to generate predicted values and standardized residuals, where xtreg will not. It will also allow you to make model comparisons using the lrtest command if you specify the mle option in your model call. The poorness of your model fit may be due to the inadequacy of the explanatory variables you have currently included-you may need to add variables, and a means of model comparison would then be valuable to you.

Regarding your question about the negative intercept, see if you can follow the interpretation below:

As an example, if you let y_ij be the response for group i at time j, an random intercept model with two continuous explanatory variables would look something like:

y_ij=mu+alpha_i + e_ij

where mu= beta_0 + beta1*x_1 + beta2*x_2 

In this model, you have fixed and random effects. What I mean by fixed effects: a fixed effect is an effect whose levels will stay the same should the experiment be repeated, e.g. if you did the study again, you would look at the same levels of education that you did here.

Mu is the estimated average response for particular values of x_1 and x_2. 

Beta_0 is your intercept term (the estimated average response, holding all other variables constant, (in this case equal to zero)-it can be positive/negative), and beta_1 and beta_2 are the estimated average change in your response for a 1 unit change in var_1/var_2 respectively. The beta effects thus represent the "average" effects of the respective variables for the population of groups as a whole.- i.e. regardless of specific group etc. 

The random effects represent the deviation from the "average" fixed effect due to group. These deviations are represented by the alpha_i, for each group i.

Two variance components: The random effects alpha_i are normally distributed, with mean zero, and constant variance sigma^2_1, and the e_ij (within group residuals) are iid normally distributed, with mean zero and constant variance sigma^2_2. 

So, e_ij ~N(0,sigma^2_1)

      alpha_i ~ N(0,sigma^2_2)

If you had a categorical variable, with say three levels, and you included levels 2 and 3 in your model so it became:

y_ij=beta_0 + beta_1*variable_1 + beta_2*variable_2 +beta_3*var_3_level_2 + beta_4*var_4_level_3+alpha_i + e_ij

beta_0 would represent the average response for level 1 of your categorical variable, holding the continuous variables constant, beta_1 and beta_2 would have the same interpretation as before, but for level 1 of your categorical variable only, and beta_3 and beta_4 would indicate the estimated average change in your response for level_2 vs level 1 etc. 

If you wanted the beta_1 and beta_2 effects for different levels of your categorical variable, you would have to fit interaction terms, determine their significance etc and whether their inclusion in the model is justified/not, and then interpret them as well. 

There is one final option: you may use the pa option, instead of the re option. Pa is population averaged, (the same as using xtgee models), which means that it does not explicitly specify the distribution of the population, because it does not include random effects. The difference lies in what is being estimated, although the two are often very similar. 

The bottom line for someone thinking about using the GEE (or population average estimator) is to think about whether the averaging procedure makes sense for the type of inference that you want to make. 

If you want to estimate e.g. how a particular variable effects a specific group's response, then the re model is better. 

If you want to look at how the average group is affected by a change in that variable, then use the pa model. Sometimes, the results are very similar, but large variation between groups makes the difference greater. 

I would recommend you at least do some background reading on mixed models before attempting any further analysis.

Hope that helps.


>>> Danielle Koopmans  06/15/10 12:49 PM >>>

it's my first time here and I have some questions. I have a dataset
with variables of 32 firms over a timespan of 10 years, I am examing
whether tenure (of a specific person) and other variables has
influence on the profitability (y) of firm i.
First I have a question about the xi command because I have some
categorical variables: I wanted to create dummies for the variable
years xi i . years and for education xi i . education, I tried this
command but nothing happenend, not even a note that I did something

Second, because I have a dataset with cross-section data and
timeseries I ran a paneldata regression (fe, be and re) with i=firms
and t=years. It look likes this but then with a lot more variables
like financial variables, education dummy, age etc  :

y           tenure        year t      firm i
8,45        6,614        1995         1
7,39        7,616        1996         1
3,10        0,611        1997         1
9,93        1,633        1998         1
12,39      2,611        1999         1
19,24      3,614         2000        1
0,49        4,614         2001        1
1,13        0,611         2002        1
4,69        1,611         2003        1
9,14        2,614         2004        1
12,64      3,614         2005        1
3,69        2,633        1995         2
7,43        3,636        1996         2
10,30      4,636        1997         2
11,64      5,636        1998         2
10,01      6,636        1999         2

The R^2 differs between the models:


R-sq:  within  = 0.0298
       between = 0.5908
       overall = 0.2349


R-sq:  within  = 0.2412
       between = 0.1820
       overall = 0.2260


R-sq:  within  = 0.2611
       between = 0.0194
       overall = 0.0007
 It doesn't seem good to me these results but which model should I
choose and which R^2 do I have to look at: within, between or overall?
My constant is also negtive at the fe model, how come?

And how to check for heteroskedastiscity, serial correlation
(Durbin-Watson test?) and collinearity?

Hopefully someone can help me on this. This is all very new to me.

*   For searches and help try:



This e-mail is subject to the UCT ICT policies and e-mail disclaimer published on our website at or obtainable from +27 21 650 4500. This e-mail is intended only for the person(s) to whom it is addressed. If the e-mail has reached you in error, please notify the author. If you are not the intended recipient of the e-mail you may not use, disclose, copy, redirect or print the content. If this e-mail is not related to the business of UCT it is sent by the sender in the sender's individual capacity.


*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index