James W. Hardin
London Users Group meeting
6 June 1997
Panel Data
In a panel dataset, we have observations for our dependent variable
such that the observations with common value for i are believed to be
correlated. The i subscript is sometimes referred to as the
individual, panel, subject, cluster, or group. The t subscript
denotes the observation for the particular panel. There are
observations in the general unbalanced case. The t subscript is
called the replication, time, or repeated measure.
Various authors refer to longitudinal data, cross sectional
data, panel
data, and cross-sectional time-series.
There are two sources of variability from which we might build an estimator. There is the variability within (fixed effects) a cluster and there is the variability between the clusters.
Fixed Effects Estimators
To model fixed effects, one transforms the estimating equation in order to get rid of the fixed effects.
Random Effects Estimators
There are two obvious ways to approach building a random effects estimator. One may first assume that:
where is a random value from some distribution F.
Alternatively, one may assume that
and impose some restrictions on the covariance of
Random Effects Estimators
In more general terms we can write the model in terms of link and variance functions as
where or we may assume that
we have
Random Effects Estimators
When are the two approaches the same?
They are the same if all of the or when the link function
h is the identity. This is because
, it is not in general true
that the same link function will have the property
Note that the two approaches are the same for linear regression which uses the identity link. They are not the same for logistic or probit models that we examine later.
Random Effects Estimators (logit)
The two approaches for logit are
or alternatively, we may look at
along with appropriate assumptions on the covariance of the
terms (nuisance parameters) and
where we assume that
Multilevel models
There are also hybrid models that will estimate the probability that Y=1 averaged over the observations with the same covariate patterns. One method for doing this is Goldstein's multilevel models. These models at their simplest level are random effects models, but allow the researcher more flexibility in modeling the outcome.
Other Models
There are also other types of models one can use for analyzing the
panel data. The first is called the transitional model and
models the probability distribution of the outcome at time t,
as a function of the covariates at time t,
, and
the individual’s outcome history
Another model is called the response conditional model which accomodates correlation by modelling the response probability for each individual in the panel as a function of covariates for that individual and the responses for all individuals in that cluster.
Problems with SS Models
Problems with PA Models
Comparison of SS and PA coefficients
Imagine a study where the dependent variable is whether a student performs acceptably on a standardized test. There are several students under the direction of each teacher in the study. One of the covariates is whether the student’s instructor assigns to the individual student Stata in the classroom for teaching purposes.
Usually, one would consider that the instructor would either use Stata or not use it in teaching all of the teacher's students. However, imagine that an instructor is free to assign Stata to some of the students in the classroom but not to all of the students. So, the use of Stata is not a cluster level variable.
Interpretation of the coefficient for the SS model
The SS model now allows direct observation and estimation of the average log odds ratio effect of the change in using Stata to teach upon exam performance. Mathematically, we collapse across students after we take the difference in log odds at time points where the instructor did and did not use Stata in the classroom. The coefficient then represents the common log odds ratio for passing the exam of the Stata effect across students.
Interpretation of the coefficient for the PA model
The PA coefficient, mathematically, first averages to find the mean risk and then computes the log odds. The PA model ignores the fact that the effect of the change in using Stata for an instructor had been measured, and persist in estimating only the odds ratio between Stata and non-Stata instructors. Instructors who changed would appear in both groups.
Now imagine, that there really are not any instructors that assign Stata to a subset of the class so that Stata use is really a cluster level variable.
One cannot directly observe a change in utilizing Stata. The PA model measures the log odds ratio between the two groups of instructors, whereas the SS model is supposed to report the effect of the change in the instructor's usage of Stata. However, no such change was measured, so the interpretation is entirely model-based as it is a type of extrapolation with no data to check the validity of the extrapolation. Note that the conditional likelihood approach for this same model won't allow estimation of the Stata effect.
Problems with Conditional models
Note that for the logit estimator, the unconditional
estimator is inconsistent, but the conditional
estimator is consistent.
denote the conditional log-likelihood below.
So, the conditional likelihood is conditioned on the number of ones in the set (panel). Consider an example where there are a large number of panels each with two time period observations. The unconditional likelihood is given by
The observations are independent so that the likelihood
function is the
product of the probabilities (we show above the log-likelihood). Note that
for each pair of observations, we have the possibilities
The ith term of for either of these outcomes is just 1. The
log of that is zero, so that either of these outcomes contribute nothing to
the log-likelihood.
Now, suppose that and
so that
we have
which gives that
which is free of .
Monte Carlo Simulations
There are two simulations that we ran both generating SS random effects data.
is an unobserved latent variable.
is the random effect.
is the error term.
where c is some cutoff value.
Other Estimators
Simulation 1
constant within panel (cluster level variable).
constant across panels (within time)
random within and across panels.
r = 1000 is the number of simulations for a given model.
Simulation 2
constant across panels (within time)
random within and across panels.
r = 500 is the number of simulations for a given model.
The main differences for the second simulation were the removal of the cluster level variable and the focus on smaller datasets.
Random Effects Likelihood
Problems with SS Random Effects Probit
Simulation Results
Simulation Results
The probit estimator differed little from the SS-RE model in terms of RMSE:
However, misleading results will result if one uses the reported standard errors in hypothesis tests.
Simulation Results
Probit with robust standard errors
Simulation Results
SS Random Effects Probit
Simulation Results
SS Random Effects Probit
The major computational problem with the SS Random Effects Probit model is
the need to evaluate the integral using quadrature. It is for these numeric
reasons that this estimator did not perform better. However, it dominated
the other estimators for small values of
. One gains substantial improvement by increasing the
number of Hermite points to about 8 to 10, but not much improvement after
that. Guilkey and Murphy found it necessary to increase this to 16 for
to obtain good performance.
Simulation Results
SS Random Effects Probit
For Simulation 1, where for the cluster level variable, the
SS RE Probit estimator had lower than nominal coverage and
a much larger standard error than the PA Estimator. When
was small (4), the coverage was close to nominal though the
RMSE was larger than for the population averaged approach.
Estimated standard errors are too small when or
large due to numerical problems of estimating the integral (not
because the model is faulty).
Simulation Results
PA Random Effects Probit
Simulation Results
PA Random Effects Probit
Coefficients were smaller than for the SS model as theory dictates.
The standard errors were too small, but coverage is close to nominal
level for small cluster size even when , but not close
to nominal coverage when
Simulation Results
PA Random Effects Probit with robust standard errors
Simulation Results
PA Random Effects Probit with robust standard errors
Coefficients were smaller than for the SS model as theory dictates.
The standard errors are of correct size and the coverage is close
to nominal size for all sample sizes and values of
Difference in PA and SS models
with appropriate assumptions concerning the covariance of
measures the change in proportion with Y=1 for a unit
increase in X. Does not take advantage of repeated measurements on each
study subject and the fact that the effects of the covariate changes within
subjects on the response are directly observable. This model is most
appropriate for cluster level variables.
measures the change in probability of response with
covariate X for individuals in each of the underlying risk groups described
. Not appropriate for cluster level variables since this effect
is not directly observable.
Problems with Conditional models