Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

RE: st: Right skewed (positive) dependent variable

 From "Lachenbruch, Peter" To "'statalist@hsphsun2.harvard.edu'" Subject RE: st: Right skewed (positive) dependent variable Date Thu, 10 Jun 2010 09:05:20 -0700

```There is also the issue of the effect of outliers on ladder or boxcox.  I just had my class grades obtained.    Here are some results

totalscore
-------------------------------------------------------------
Percentiles      Smallest
1%        26.74          26.74
5%       41.942          35.95
10%       49.298          38.95       Obs                  66
25%       53.424         41.942       Sum of Wgt.          66

50%       61.756                      Mean           60.36615
Largest       Std. Dev.        10.349
75%       68.222         73.938
90%       72.964         77.066       Variance       107.1017
95%       73.938         79.508       Skewness       -.590668
99%         80.4           80.4       Kurtosis       3.668101

* The 26.74 is from a student who did not take the final and is likely an outlier.

Transformation         formula               chi2(2)       P(chi2)
------------------------------------------------------------------
cubic                  totals~e^3              2.13        0.345
square                 totals~e^2              0.02        0.992
identity               totals~e                5.68        0.058
square root            sqrt(totals~e)         12.30        0.002
log                    log(totals~e)          21.22        0.000
1/(square root)        1/sqrt(totals~e)       31.54        0.000
inverse                1/totals~e             42.27        0.000
1/square               1/(totals~e^2)         61.95        0.000
1/cubic                1/(totals~e^3)             .        0.000

* This suggests that the best transformation is a square to totalscore.  I don't regard this as a happy situation.  So I exclude the low score.

Transformation         formula               chi2(2)       P(chi2)
------------------------------------------------------------------
cubic                  totals~e^3              2.96        0.228
square                 totals~e^2              0.70        0.705
identity               totals~e                0.77        0.681
square root            sqrt(totals~e)          2.95        0.228
log                    log(totals~e)           6.54        0.038
1/(square root)        1/sqrt(totals~e)       11.18        0.004
inverse                1/totals~e             16.76        0.000
1/square               1/(totals~e^2)         29.23        0.000
1/cubic                1/(totals~e^3)         41.40        0.000

* Now the square and identity are about the same - I'd go with the identity.  For grading purposes, the centile command would give me a simple way of finding cutoffs - in fact, I had gone through the grades manually and came up with a set of letter grades that seemed to match the centiles pretty well.   In my experience, students sort themselves into natural groups.

Tony

Peter A. Lachenbruch
Department of Public Health
Oregon State University
Corvallis, OR 97330
Phone: 541-737-3832
FAX: 541-737-4001

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Maarten buis
Sent: Thursday, June 10, 2010 8:51 AM
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: Right skewed (positive) dependent variable

--- On Thu, 10/6/10, SURYADIPTA ROY wrote:
> However, as I look at my program now, I discover
> the source of the anomaly- my transformatrion
> was newvar=ln(1+oldvar).. that explains.

Are there 0s in your dependent variable (oldvar)?
If there are, then you really have no choice other
than go the -glm- route. There are ways of getting
a meaningfull interpretation out of a log transformed
dependent variable, but no such way exists for the
transformation log(oldvar + some constant), and
leaving the constant out is no sollution either, as
that means that he 0s will be recoded to missing
values. This may also explain your non-normality:
is there a spike at 0. If that is the case, than
there can be no transformation that will lead to
a normal distribution. In that case you could
consider modeling the zero separately using -zip-.
It is usually used for counts, but can also be
used for continuous variables in a Quasi-likelihood
kind of way, by specifying the -robust- option.

Hope this helps,
Maarten

--------------------------
Maarten L. Buis
Institut fuer Soziologie
Universitaet Tuebingen
Wilhelmstrasse 36
72074 Tuebingen
Germany

http://www.maartenbuis.nl
--------------------------

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```