[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: RE: Panel data and sparse data

From   "Austin Nichols" <>
Subject   Re: st: RE: Panel data and sparse data
Date   Wed, 16 Jul 2008 12:41:52 -0400

James Nachbaur--
Sounds like World Bank data or the data sources on which it is based,
e.g. national census/survey data collected at irregular heterogeneous
intervals.  Rather than interpolating or imputing, you may want to
fill forward, so each variable is measured as of the last time
observed, but limit this to one observation "filled in" in the final
data.  E.g. if you have data on India in 1975, 1980, 1985, 1990, and
1995, but you have data on Pakistan in 1977, 1982, 1989, and 1994,
maybe you want to use obs defined as of 1977, 1982, 1990, and 1995,
and use the most recent year of data for each of those.  You certainly
don't want to conduct a survival analysis as if you have 21 (or even
19) years of data on each country, which is what
interpolation/imputation would imply.  The first step in this process,
I think, is to determine the number of observations you can plausibly
use, given different choices over years to include.

What are you planning to do about countries merging/splitting/being born?

If you ignore advice not to interpolate, at least do it in logs for
vars which are strictly positive (won't matter for all vars, but where
it matters, e.g. population or GDP, it is probably superior).  E.g.

sysuse uslifeexp
g y=le if mod(year,5)==0
g lny=ln(y)
ipolate lny year, gen(iy)
g exp=exp(iy)
line le year || sc exp y year

Note in the example how poor the interpolated data can be.

On 7/16/08, Nick Cox <> wrote:
> In this context imputation is usually called interpolation, with a
> centuries-long history to boot. And you can do it inn various ways from
> linear interpolation (-ipolate-) and cubic interpolation (-cipolate-
> from SSC) upwards.
> But my visceral reaction is, for your situation, Don't. Survival
> analysis is in a strong sense geared to make use of the information you
> have and interpolation would just be a way of kidding yourself you had
> more.
> Nick
> James Nachbaur
> I have a panel data set of 165 counties over 55 years with many
> variables observed every 10 years or every 4 to 5 years.  I am running
> a survival time model with unobserved heterogeneity.  My question for
> the list is, What is a good way to impute data for the years that lack
> observations?  In my research, I have seen a lot on variables missing
> at random, or on data sets where only one variable has missing data,
> but my situation is not like those.
*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index