Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: RE: St: Panel data imputation

From	Nick Cox <[email protected]>
To	"'[email protected]'" <[email protected]>
Subject	RE: st: RE: St: Panel data imputation
Date	Tue, 21 Sep 2010 15:39:23 +0100

I don't think you are necessarily wrong, but it sounds as if you are conflating two related but different issues, how best to predict revenue and how to impute missing values. Or conversely, these issues may well be clear in your mind but which is your problem is not clear from your postings.

In addition, you have panel data, so you likely have time dependence, possibly trends, seasonality, etc., etc. You would need to pay careful attention to whether -mi- is the right tool for your situation given all those complications.

Nick
[email protected]

David Bai

Thank you, Nick. You are correct that I haven't provide more detailed
info, and here it is:
I have many more variables that can be possibly related to revenue.
Given what you and Maaren explained below, I guess using ipolate and
year info only might not be an accurate way to predict revenue. MI
might be a better approach. Correct me if I am wrong. Thank you.

Nick Cox

I don't see how your problem statement leads to that conclusion. If you
have
just revenue and year, an answer will also depend on your economic
understanding
of how revenue varies with year. Also, if you have just revenue and
year, it is
as far as I can see very much an interpolation problem.

Conversely, if you have variables other than revenue and year, then
quite what
to do depends on the rest of your problem situation, and I think
Maarten will
join me in feeling unable to advise definitively without -- and
probably even
with -- further information.

Here as elsewhere there is an enormous jump between being able to make
specific
comments on techniques or Stata and being able to advise people what is
the best
thing to do on their projects, let alone what is "correct".

Nick
[email protected]

David Bai

Thank you, Nick and Maarten, for the very detailed response. Very
helpful. Given the limitations of this command, it looks like that
multiple imputation would be the best approach to dealing with the
missing values. Am I understanding it correctly?

Nick Cox

The straight answer to this question is that -- as the help for
-ipolate- makes
clear -- there is an -epolate- option which you can use at your peril
to fill in
values at the ends of your series. This will work with panel data too,
in the
sense that you will get what you ask for.

Note that -ipolate- is a command, not a function.

On the larger issue, raised by Maarten Buis, I hope we could all agree
that
interpolation, which has a centuries-old history, is not quite a kind
of
imputation, which is currently so fashionable as a species of
statistical white
magic. (Naturally, your definition of imputation might be so wide that
interpolation is a special case; I would want to suggest that such a
wide
definition will only lead to misunderstanding.)

I can see various advantages and disadvantages:

1. Interpolation is usually relatively simple to define. The linear
interpolation offered by -ipolate- certainly qualifies.

2. Interpolation is in various senses unstatistical, as

a. it takes account of at most local structure and works with data one
response
variable at a time.

b. it typically reduces variability, which distorts statistical
analysis to an
unknown extent

c. it is deterministic so is not accompanied by any estimate of error.

Clearly, this isn't a complete characterisation. Also it simplifies
some larger
issues.

I am at an extreme position within this list, as I have never used
imputation,
but I have often used interpolation for gappy time series or spatial
series with
no covariates. Such work has had as side-effects programs -cipolate-
and
-csipolate- on SSC.

If you are using interpolation I have some hackneyed pieces of advice:

* Get a feeling of how interpolation treats data like yours by
artificially
introducing gaps in good quality data and seeing how successful
interpolation is
at reproducing known values.

* Try different kinds of interpolation to get a sense of how far they
agree.

* Go very easy on the extrapolation.

This commentary steals one cogent remark made by Patrick Royston in a
conversation at the recent London users' meeting.

Nick
[email protected]

Maarten Buis
============

-ipolate- is generally not a good imputation method. Look at -help mi-
and
-findit ice- instead.

David Bai
=========

I have a panel data (year and revenue) and would like to use
ipolate function to impute the missing values for some years. What kind
of data will not be imputed if I use this method? It looks like that,
when I have missing values for the beginning year or the end of the
year, this method will not impute the missing values in these years. Is
there a way to deal with this problem?

*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

References:
- st: St: Panel data imputation
  - From: David Bai <[email protected]>
- st: RE: St: Panel data imputation
  - From: Nick Cox <[email protected]>
- Re: st: RE: St: Panel data imputation
  - From: David Bai <[email protected]>
- RE: st: RE: St: Panel data imputation
  - From: Nick Cox <[email protected]>
- Re: st: RE: St: Panel data imputation
  - From: David Bai <[email protected]>

Prev by Date: RE: st: 2-stages cluster analysis
Next by Date: Re: st: Graphic displays or results from margins
Previous by thread: Re: st: RE: St: Panel data imputation
Next by thread: Re: st: RE: St: Panel data imputation
Index(es):
- Date
- Thread