Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: AW: RE: panel data regression - number of observations

From   Nick Cox <>
Subject   Re: st: AW: RE: panel data regression - number of observations
Date   Sat, 23 Jun 2012 22:52:57 +0100

"Braunfels, Philipp (Stud. SBE / Alumni)"
<> emailed me privately.

Philipp: You are asked not to take threads private. This is explained
in detail in the Statalist FAQ.

This looks like the same question to me, although you started with
monthly data, and now they look annual to me.

That aside, panels without gaps of missings are precisely those with
complete coverage. If the only problems are with missing or absent y

egen count =  count(y), by(company)
su count, meanonly
keep if count == r(max)

reduces your dataset to complete panels. There are other ways to do
it, but -tsspell- (SSC; you're asked to say where user-written
programs come from; also in the FAQ) is not needed at all.


On Sat, Jun 23, 2012 at 7:53 PM, Braunfels, Philipp (Stud. SBE /
Alumni) <> wrote:
> Thanks for the promt reply Nick!
> You really helped me with the answer. Although I do understand the "egen count" command much better now, I realized that it might not be the solution to my problem. Actually I am currently trying to eliminate all observations within my panel that exhibit gaps in their data-sequences. e.g. I would like to drop company ID001 but not ID002 and ID003 from the dataset below:
> company    Year   Return
> ID001         1990     5%
> ID001         1991     .
> ID001         1992     .
> ID001         1993     7%
> ID002         1990      .
> ID002         1991      .
> ID002         1992      8%
> ID002         1993      9%
> ID003         1990      1%
> ID003         1991      2%
> ID003         1992      3%
> ID003         1993      4%
> I just regcognized that you wrote a nice command "tsspell" to identify consecutive runs. However, I could not figure out how I can use this command, to drop all observations with non-consecutive values (in the example above, this means only deleting ID001 while keeping ID002 and ID003!).

Nick Cox []

> You don't _have_ to run -xtreg- first. It's just an easy way of
> identifying observations with non-missing values on all variables
> through e(sample).
> Missings are not it seems the main issue here: it's the observations
> that aren't there that mean that some of your panels are incomplete.
> But no doubt you can work out a protocol for identifying the
> observations you want befiore you do any regressions. You might as
> well then -drop- the others, at least temporarily.
> I don't understand your last question.
> By the way, -egen- is a command, not a function. Things like its
> -count()- are called -egen- functions.

> On Sat, Jun 23, 2012 at 4:40 PM, Braunfels, Philipp (Stud. SBE /

>> thank you for your explanation. Can you maybe tell me why I first have to run the <xtreg> before using the egen function? The reason is, that I want to run several regressions with different x variables (y will be the same for all regressions). Since the only variable with potentially missing values is the y variable I would like to ask if I always need to firstly run the regression, secondly run the egen function and thirdly rerun the regression using "count==###", or if it is fine to run the xtreg once, then run the egen function and use the "count" variable for all subsequent regressions that are based on the syme y-variable?
>> Furthermore, I was wondering whether "sample" is to be replaced by my time variable or if this expression is fixed (tried to look it up on the web and it seemed that it is a fixed expression?!).
>> Von: [] im Auftrag von Nick Cox
>> I'll pass on #2.
>> The answer to #1 is No. This is easy enough to check. For example, from this code
>> webuse grunfeld
>> xtset
>> d
>> xtreg invest year
>> drop if year < 1940 & company == 1
>> xtreg invest year
>> you will see that -xtreg- uses what it can, including incomplete panels.
>> Indeed incompleteness is, roughly speaking, like beauty, in the mind of the beholder, and not something Stata generally knows or cares about.
>> You can certainly insist on using what you regard as complete panels. Your syntax will work if and only if a variable called -time- has the value 124 for only those observations you want to include. You can do something like this
>> xtreg y x1 x2 x3
>> egen count = total(e(sample)), by(company)
>> xtreg y x1 x2 x3 if count == 120
>> or 124 if you prefer.

Braunfels, Philipp

>> I have a large panel dataset, covering 10 years of monthly data. One of my variables is stock return for which I have approximately 3000 companies. However, the return observations are not complete for all 3000 companies (some just have 2 or 5 years of data e.g.). In this respect I would have two questions:
>> 1) if I run a panel-regression, does stata automatically exclude all companies for which the return data are not complete? (e.g. I regress years 1990-2000 and for 500 companies the data from 1990-1993 are missing. Are these companies completely excluded from the analysis?) If stata does not exclude observations (companies) with an insufficient number of values (returns over time) how can I account for this (maybe using sth. <xtreg y x1 x2 x3 if time==124, fe> - where 124 are monthly observations for the 10 years I cover?
>> 2) The hausman test predicts that I should use FE. The output gives me three values for R-square (between, within, overall). But when I use the command <estimates table output, stats(r2)> I am given a different R-square than those reported in the <xtreg,fe> output. So where does the <estimates table output, stats(r2)> come from (I also run an <areg> regression and this R-square differs as well!)

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index