Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Issues with missing values


From   Halua Koko <[email protected]>
To   [email protected]
Subject   Re: st: Issues with missing values
Date   Mon, 10 Mar 2014 16:27:46 +0100

Hi Nick,
Thanks for the response. Sorry didn't mention it before, my y=calorie
intake (cal_in). It's a continuous variable. I really didn't want to
go into the messy multiple imputation techniques, so I tried the
linear prediction technique, ie:
reg y x1 x2..
predict y'
But I guess due to missing values in x1, x2, this isn't working. I've
been trying to figure out other work-arounds, but unsuccessfully. At
the moment, I have about 20% of the 5000 obs missing, would you
suggest going ahead without them? Would you have any other ways of
solving this particularly perturbing issue? Indeed I'll refer to it as
a wide "structure" from now on!
Thanks again
Halua

On Mon, Mar 10, 2014 at 3:59 PM, Nick Cox <[email protected]> wrote:
> The main issue here is what you are trying to do.
>
> 1. It might seem reasonable for your purposes to replace missings with
> the mean. Even though you might be unable or unwilling to apply
> imputation, some kind of interpolation (in time) is, however, a
> possible alternative.
>
> 2. But the missings replaced with means don't carry new information
> about the distribution. Classifying into quantile-based  groups is
> spurious unless you use only the non-missings to determine quantiles.
> Unfortunately, it is also likely to be spurious applying that to the
> extra means too. -xtile- does the best it can, but necessarily often
> produces bizarre results because of its rule that identical values
> must be placed in the same group.
>
> 3. I don't understand the fudge you are imagining, but it sounds quite
> arbitrary and difficult to defend.
>
> 4. I didn't catch why you think you you need to classify these values
> any way. I don't know what -cal_in- is, but using the panel means (or
> medians) of what you have seems a more defensible way to make use of
> what information there is. That, however, may miss the point if you
> want to catch impacts during the time panels were observed.
>
> 5. Panel data are almost always better off in a long shape or
> structure (my self-imposed Sisyphean task is to persuade people not to
> say "format" given its existing use in Stata).
>
>
> Nick
> [email protected]
>
>
> On 10 March 2014 14:31, Halua Koko <[email protected]> wrote:
>
>> I've been working with a panel dataset and while putting it together
>> have replaced a number of missing values in variable cal_in with the
>> mean for each of the years. But when trying to create quintiles of the
>> baseline values to assess heterogeneity of impact (using xtile
>> Q=cal_in, nq(5)), I noticed that doing so had clumped together about
>> 1000obs around one value, ie, the mean. So in essence my xtile groups
>> are distributed unevenly and the 4th quantile seems to be entirely
>> missing. FYI my panel is in the wide format.
>> Can anyone suggest a solution to this problem? I was thinking of
>> redistributing the clumped values by small increments so as to have
>> the same mean, but differing values, but not sure how to do this.
>> Can anyone help me figure this out?
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index