Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: ladder question for right-skewed variable


From   Nick Cox <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: ladder question for right-skewed variable
Date   Fri, 26 Apr 2013 08:49:03 +0100

In addition to David's good advice --

everyone should read his classic exposition

Hoaglin, D.C. 1988. Transformations in everyday experience. Chance
1(4): 40--45  --

a rough analysis is possible just from the nine quantiles shown by -summarize-.

For something like this I fire up Mata as a friendly calculator

: y = (1,2,3,6,15.5, 82, 436.5, 1251,5953)'

: strofreal((y , sqrt(y) , ln(y), -1:/y), "%3.2f")
             1         2         3         4
    +-----------------------------------------+
  1 |     1.00      1.00      0.00     -1.00  |
  2 |     2.00      1.41      0.69     -0.50  |
  3 |     3.00      1.73      1.10     -0.33  |
  4 |     6.00      2.45      1.79     -0.17  |
  5 |    15.50      3.94      2.74     -0.06  |
  6 |    82.00      9.06      4.41     -0.01  |
  7 |   436.50     20.89      6.08     -0.00  |
  8 |  1251.00     35.37      7.13     -0.00  |
  9 |  5953.00     77.16      8.69     -0.00  |
    +-----------------------------------------+

The analysis could be extended by adding in the 4 smallest and 4
largest too, but this is enough to give a hint. The data are all
positive, so all the standard transformations are candidates.

The results underline what could be guessed just by looking at the
output of -summarize-.

1. Square root reduces skewness, but not by much.

2. (Negative) reciprocal just reverses the problem.

3. Logarithmic transformation looks the best bet, even though the
distribution remains right skewed. The evidence of the 4 largest
values is that you have some outliers that are likely to remain
moderate outliers on any reasonable transformation.

Caveats on various levels:

1. The assumption here is that transform of quantile = quantile of
transform, which is solid in principle for monotonic transforms, but
the small detail is that Stata averages adjacent order statistics to
estimate quantiles, so you might see some small discrepancies.

2. I've not shown you reciprocal square root, not a transformation I
find attractive, _unless_ there are dimensional grounds (from physics,
engineering, ...) for square rooting. The variable sounds like a
count, so that is ruled out if so.

3. Symmetry of marginal distribution is not a direct assumption for
much, but in practice you are likely to find analyses easier if you
transform a skew variable....

4. ... or analyse it using an appropriate -glm-.. You don't say what
follows this, but -glm, link(log)- is what springs to mind.

There remains a mystery of why -ladder- didn't perform for you. You
don't show for -ladder- _exactly_ what you typed or _exactly_ what
Stata showed by way of results, but I can't see any reason for
-ladder- not to perform here.

Nick
[email protected]


On 26 April 2013 01:44, David Hoaglin <[email protected]> wrote:
> Gabriel,
>
> The ratio of the largest value to the smallest value is quite large,
> so a transformation is likely to be useful.  As a first step ("first
> aid"), I suggest that you try the logarithm (base 10).
>
> Usually the context of the data plays a role in the choice of a
> transformation, so that the result is meaningful.  What is the nature
> of disp_2000?
>
> With 1010 observations you should check whether the data has some
> structure (e.g., two or more modes or groups), for example, by making
> a histogram with a sizable number of bins (say 25 or so).  If you find
> structure, you will need to deal with that also.
>
> David Hoaglin
>
> On Thu, Apr 25, 2013 at 8:11 PM, Gabriel Nelson
> <[email protected]> wrote:

>> I have a variable that is right-skewed.  I used the the ladder command
>> to see suggested transformations. However, no transformations appeared
>> in the output. I'm guessing that this does not mean the raw form is
>> better, because there is an option for 'raw' on this list.
>>
>> Here is the output for the sum, detail command for the variable:
>>
>>
>>
>> sum disp_2000, detail
>>
>>       Number displaced 2000 (if data unavailable go up
>>                            to 2003
>> -------------------------------------------------------------
>>
>>       Percentiles      Smallest
>>  1%            1              1
>>  5%            2              1
>> 10%            3              1       Obs                1010
>> 25%            6              1       Sum of Wgt.        1010
>>
>>
>> 50%         15.5                      Mean           281.5297
>>                         Largest       Std. Dev.      1217.168
>> 75%           82           9421
>> 90%        436.5           9505       Variance        1481497
>>
>> 95%         1251          16255       Skewness       9.012044
>> 99%         5953          19569       Kurtosis       108.8061
>>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index