Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Right skewed (positive) dependent variable

From   "Lachenbruch, Peter" <>
To   "''" <>
Subject   RE: st: Right skewed (positive) dependent variable
Date   Thu, 10 Jun 2010 09:05:20 -0700

There is also the issue of the effect of outliers on ladder or boxcox.  I just had my class grades obtained.    Here are some results

      Percentiles      Smallest
 1%        26.74          26.74
 5%       41.942          35.95
10%       49.298          38.95       Obs                  66
25%       53.424         41.942       Sum of Wgt.          66

50%       61.756                      Mean           60.36615
                        Largest       Std. Dev.        10.349
75%       68.222         73.938
90%       72.964         77.066       Variance       107.1017
95%       73.938         79.508       Skewness       -.590668
99%         80.4           80.4       Kurtosis       3.668101

* The 26.74 is from a student who did not take the final and is likely an outlier.

. ladder totalscore

Transformation         formula               chi2(2)       P(chi2)
cubic                  totals~e^3              2.13        0.345
square                 totals~e^2              0.02        0.992
identity               totals~e                5.68        0.058
square root            sqrt(totals~e)         12.30        0.002
log                    log(totals~e)          21.22        0.000
1/(square root)        1/sqrt(totals~e)       31.54        0.000
inverse                1/totals~e             42.27        0.000
1/square               1/(totals~e^2)         61.95        0.000
1/cubic                1/(totals~e^3)             .        0.000

* This suggests that the best transformation is a square to totalscore.  I don't regard this as a happy situation.  So I exclude the low score.

. ladder totalscore if totalscore>30

Transformation         formula               chi2(2)       P(chi2)
cubic                  totals~e^3              2.96        0.228
square                 totals~e^2              0.70        0.705
identity               totals~e                0.77        0.681
square root            sqrt(totals~e)          2.95        0.228
log                    log(totals~e)           6.54        0.038
1/(square root)        1/sqrt(totals~e)       11.18        0.004
inverse                1/totals~e             16.76        0.000
1/square               1/(totals~e^2)         29.23        0.000
1/cubic                1/(totals~e^3)         41.40        0.000

* Now the square and identity are about the same - I'd go with the identity.  For grading purposes, the centile command would give me a simple way of finding cutoffs - in fact, I had gone through the grades manually and came up with a set of letter grades that seemed to match the centiles pretty well.   In my experience, students sort themselves into natural groups.


Peter A. Lachenbruch
Department of Public Health
Oregon State University
Corvallis, OR 97330
Phone: 541-737-3832
FAX: 541-737-4001

-----Original Message-----
From: [] On Behalf Of Maarten buis
Sent: Thursday, June 10, 2010 8:51 AM
Subject: Re: st: Right skewed (positive) dependent variable

--- On Thu, 10/6/10, SURYADIPTA ROY wrote:
> However, as I look at my program now, I discover
> the source of the anomaly- my transformatrion
> was newvar=ln(1+oldvar).. that explains.

Are there 0s in your dependent variable (oldvar)?
If there are, then you really have no choice other
than go the -glm- route. There are ways of getting 
a meaningfull interpretation out of a log transformed
dependent variable, but no such way exists for the
transformation log(oldvar + some constant), and 
leaving the constant out is no sollution either, as
that means that he 0s will be recoded to missing
values. This may also explain your non-normality:
is there a spike at 0. If that is the case, than
there can be no transformation that will lead to
a normal distribution. In that case you could 
consider modeling the zero separately using -zip-.
It is usually used for counts, but can also be
used for continuous variables in a Quasi-likelihood
kind of way, by specifying the -robust- option.

Hope this helps,

Maarten L. Buis
Institut fuer Soziologie
Universitaet Tuebingen
Wilhelmstrasse 36
72074 Tuebingen


*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2016 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index