Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: correcting skewness of an indep variables


From   David Hoaglin <[email protected]>
To   [email protected]
Subject   Re: st: correcting skewness of an indep variables
Date   Sun, 21 Jul 2013 09:24:12 -0400

Dimitrie,

The skewness of a predictor variable is not necessarily a problem, and
neither is a spike at 0.  The first step should be to examine whether
the relation between the dependent variable and each of the predictors
(in the full regression model) departs systematically from being
linear.  Various plots of residuals can help you do this.

If the data on the dependent variable when the predictor is 0 behave
differently from the data on the dependent variable when the predictor
is > 0, you may need to model the two parts separately (as a sort of
mixture).  You can try omitting all the observations in which the
predictor is 0 and fitting a separate regression to the remaining
data.

In a response on Cross Validated, you mentioned that your data came
from natural disasters over a span of years.  Should time be a
predictor in your models?

Without more information on your data, I can only offer general suggestions.

David Hoaglin

On Sun, Jul 21, 2013 at 7:31 AM, Mihes, Dimitrie
<[email protected]> wrote:
> Apologies, allow me to correct myself. The issue I've mentioned has also been addressed in
> http://stats.stackexchange.com/questions/64714/count-data-as-an-independent-variable-in-ols-using-a-dummy-variable-the-variab?noredirect=1#comment124994_64714
>
> However, the proposed solution seems to be in contrast with that proposed in this thread (which I had mistakenly not mentioned)
> http://www.stata.com/statalist/archive/2010-03/msg01034.html
>
> From my understanding, the former suggest using a dummy variable to account for a spike in 0 (for a predictor based on count data) only when zero means unobserved or truncated data, whereas the latter suggest either looking for a non-linear relationship between the variables (in which case, log transformation is proposed) or adding a dummy variable+ the skewed variable linearly even when the zeros represent the true value.
> I am conflicted between the two, as the former suggests that the dummy variable is useless when zeros are the observed values, while the latter, who advocates this techinque when 0 is the true value, lacks a more elaborate explanation with regards to the interpretation of the dummy alongside the linear variable and with regards to the process through which the dummy variable controls for the spike in 0.
>
> Moreover, using a log-transformation renders the 0 values as "missing values".
>
> Thanks for your consideration.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index