Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: RE: rowskew?


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: RE: rowskew?
Date   Tue, 14 Oct 2008 17:47:43 +0100

This thread raises questions on several different levels. 

First, I've spent some time looking at the literature, partly in pursuit
of a longer-term project, and I'd underline that there are various
different formulae for moment measures of skewness and kurtosis. The
situation is not like that for s.d. and variance where in essence there
are only two defensible formulae. There are in this case several
co-existing, and I don't count algebraical equivalents as separate. I
don't think there is a clear-cut case for saying that one is right
rather than another. 

Two advantages of the formulae used by Matt are that they are the
simplest and that they are those used by [R] summarize, so that it can
more easily be checked whether results square with those of official
Stata. 

Second, I'd recommend using -double-s. In examples I've checked there is
no substantial difference, but using doubles is at least a nod to
numerical issues. 

Third, Matt's toy example threw up a kurtosis curiosum: 

With the formula used by Stata, the kurtosis of 3 values not all equal
to each other is always 1.5. 

Presumably the result is a direct consequence of the formula and obvious
when looked at the right way, but it was a surprise to me. (If all
values tie, then the variance is clearly zero, so that case is
indeterminate. But two tied values is not a problem.) 

This is not a proof, naturally, but an example: 

sysuse auto 
gen group = ceil(_n/3)

egen kurt = kurt(mpg) in 1/72 , by(group)
(8 missing values generated)

su kurt

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        kurt |        66         1.5           0        1.5        1.5

In practice, you shouldn't want to work with kurtosis of groups of 3, of
course. And, to be clear, I am sure Matt wasn't implying that at all. 

Fourth, an alternative solution is via a -reshape-, -egen- and -reshape-
back again. (This is likely to be much more practical than the -xpose-
suggested by Martin Weiss.) 

Fifth, and perhaps most important, who says that moment-based skewness
and kurtosis are the best measures, or indeed of much practical use? 

Why not (mean - median) / sd as a measure of skewness, which is based on
easy ingredients, bounded by [-1, 1] and in several ways easier to work
with? Why not L-moments (-findit lmoments-), known for almost 20 years
to behave much better? 

Nick 
n.j.cox@durham.ac.uk 

Steven Samuels

This version adds a -qui- prefix to the replace statements and  
removes an unneeded -di- statement. This will remove uninformative   
output lines.  Note that statistics are not computed for observations  
with a missing value for any row variable.  To override this  
behavior, add "if !missing(`v')" to the end of the first -replace-  
statement.

> Here is Matt's code with a couple of macros to reduce typing.  I've  
> subtracted 1 from the row mean and also subtracted 3 from the  
> kurtosis formula Matt used.  See:  http://www.itl.nist.gov/div898/ 
> handbook/eda/section3/eda35b.htm
>


**************************CODE BEGINS**************************
sysuse auto,clear

****************************************************
* Create local macro vlist with variables to analyze
****************************************************
local vlist "mpg rep78 trunk turn"

egen rowmean = rowmean(`vlist')
egen rowN =rownonmiss(`vlist')

forvalues i=2/4{
gen m`i'=0
foreach v of local vlist {
qui replace m`i' = m`i' + (`v'-rowmean)^`i'
}
qui replace m`i' = m`i'/(rowN-1)
}


gen rowskew = m3*m2^(-3/2)
gen rowkurt = m4*m2^(-2) -3

list `vlist' row* in 1/5
***************************CODE ENDS***************************

On Oct 13, 2008, at 11:21 PM, Matt Spittal wrote:

> You can use the moments about the mean to calculate skew and  
> kurtosis for a row of variables.  Imagine that you want to do this  
> for the variables weight, length and price from the auto dataset.
>
> 	sysuse auto, clear
>
> 	// get mean and N
> 	egen rowmean = rowmean(weight length price)
> 	egen rowN = rownonmiss(weight length price)
>
> 	// calculate 2, 3, and 4th moments about the mean
> 	gen m2 = 1/rowN * ((weight - rowmean)^2 + (length - rowmean)^2 +

> (price - rowmean)^2)
> 	gen m3 = 1/rowN * ((weight - rowmean)^3 + (length - rowmean)^3 +

> (price - rowmean)^3)
> 	gen m4 = 1/rowN * ((weight - rowmean)^4 + (length - rowmean)^4 +

> (price - rowmean)^4)
>
> 	// calculate skew and kurtosis
> 	gen rowskew = m3*m2^(-3/2)
> 	gen rowkurt = m4*m2^(-2)
>
> 	list weight length price rowskew rowkurt in 1/10

jeheyman

> Is it possible to calculate essentially a rowskew and rowkurtosis in
> the same way that egen calculates rowmean?
>
> For each observation I have 18 variables and I need, obviously, the
> three distribution measures.  Mean is trivial but the other two are
> proving elusive. 

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index