Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: correcting skewness of an indep variables


From   "Lachenbruch, Peter" <[email protected]>
To   "[email protected]" <[email protected]>
Subject   RE: st: correcting skewness of an indep variables
Date   Sun, 21 Jul 2013 18:40:51 +0000

try writing a model for the data:  let y be the response if its not zero, d be an indicator for 0.  Then
f(y,d) = p^d*{[1-p]h(y)}^(1-d)
the likelihood is straightforward and you can develop appropriate estimates.  You can make this a regression with covariates or you can do a Wilcoxon on the data if you have two groups (or Kruskal-Wallis for k groups)

Peter A. Lachenbruch,
Professor (retired)
________________________________________
From: [email protected] [[email protected]] on behalf of Mihes, Dimitrie [[email protected]]
Sent: Sunday, July 21, 2013 10:22 AM
To: [email protected]
Subject: RE: st: correcting skewness of an indep variables

David,

With regards to your question, time is not a predictor in my model, as naturally disasters are naturally and randomly triggered. The unit of analysis is, to be more precise, every natural disaster to which the US contributed between 1992- 2004.

Going back to the issue of linear relationship between the predictor and outcome, by regressing amount of aid (logged) on no of articles on each event (count) and then running the command -cprplot no_of_articles, lowess lsopts(bwidth(1))- , both with and without the values of 0, the relationship seemed non-linear, as confirmed by a -ovtest- with a p-value=0.0083. Even so, the bivariate relationship between aid and no. of articles was significant at p<0.001. However, after removing some of the outliers in the predictor, and running the same tests, with and without the values of 0, the relationship became linear, as confirmed by the graph and an -ovtest- , p= 0.9669.

Nevertheless, my primary concern was that the skewness would affect the validity of the p-value in the full regression model, as the "no of articles" is almost always significant, p<0.001, even when clustering or using robust standard errors, removing outliers as well as values of zero.


________________________________________
From: [email protected] [[email protected]] on behalf of David Hoaglin [[email protected]]
Sent: 21 July 2013 14:24
To: [email protected]
Subject: Re: st: correcting skewness of an indep variables

Dimitrie,

The skewness of a predictor variable is not necessarily a problem, and
neither is a spike at 0.  The first step should be to examine whether
the relation between the dependent variable and each of the predictors
(in the full regression model) departs systematically from being
linear.  Various plots of residuals can help you do this.

If the data on the dependent variable when the predictor is 0 behave
differently from the data on the dependent variable when the predictor
is > 0, you may need to model the two parts separately (as a sort of
mixture).  You can try omitting all the observations in which the
predictor is 0 and fitting a separate regression to the remaining
data.

In a response on Cross Validated, you mentioned that your data came
from natural disasters over a span of years.  Should time be a
predictor in your models?

Without more information on your data, I can only offer general suggestions.

David Hoaglin

On Sun, Jul 21, 2013 at 7:31 AM, Mihes, Dimitrie
<[email protected]> wrote:
> Apologies, allow me to correct myself. The issue I've mentioned has also been addressed in
> http://stats.stackexchange.com/questions/64714/count-data-as-an-independent-variable-in-ols-using-a-dummy-variable-the-variab?noredirect=1#comment124994_64714
>
> However, the proposed solution seems to be in contrast with that proposed in this thread (which I had mistakenly not mentioned)
> http://www.stata.com/statalist/archive/2010-03/msg01034.html
>
> From my understanding, the former suggest using a dummy variable to account for a spike in 0 (for a predictor based on count data) only when zero means unobserved or truncated data, whereas the latter suggest either looking for a non-linear relationship between the variables (in which case, log transformation is proposed) or adding a dummy variable+ the skewed variable linearly even when the zeros represent the true value.
> I am conflicted between the two, as the former suggests that the dummy variable is useless when zeros are the observed values, while the latter, who advocates this techinque when 0 is the true value, lacks a more elaborate explanation with regards to the interpretation of the dummy alongside the linear variable and with regards to the process through which the dummy variable controls for the spike in 0.
>
> Moreover, using a log-transformation renders the 0 values as "missing values".
>
> Thanks for your consideration.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index