[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Matt Spittal" <Matt.Spittal@cancervic.org.au> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: RE: rowskew? |

Date |
Wed, 15 Oct 2008 10:50:04 +1100 |

Thanks to Steve and Nick for their additional thoughts and improvements on calculating skew and kurtosis for a row of observations, and for highlighting the broader issues. I am sure jeheyman probably has many good reasons for doing these calculations on a row of observations, and I wanted to give an answer that helped them do this. My preference, however, would be to follow Nick's advice and -reshape- the data, perform the calculations, then (if necessary) -reshape- it back again. Working with 3, 5 or even 10 variables in a row is probably okay, but it seems to me that things could become quite cumbersome as more variables were added. A major advantage of reshaping the data is that you can also easily graph the distributions and quickly get a feel for the issues you are trying to summarise with the skew and kurtosis statistics. -- Matt matt.spittal@cancervic.org.au -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu]On Behalf Of Nick Cox Sent: Wednesday, 15 October 2008 3:48 AM To: statalist@hsphsun2.harvard.edu Subject: RE: st: RE: rowskew? This thread raises questions on several different levels. First, I've spent some time looking at the literature, partly in pursuit of a longer-term project, and I'd underline that there are various different formulae for moment measures of skewness and kurtosis. The situation is not like that for s.d. and variance where in essence there are only two defensible formulae. There are in this case several co-existing, and I don't count algebraical equivalents as separate. I don't think there is a clear-cut case for saying that one is right rather than another. Two advantages of the formulae used by Matt are that they are the simplest and that they are those used by [R] summarize, so that it can more easily be checked whether results square with those of official Stata. Second, I'd recommend using -double-s. In examples I've checked there is no substantial difference, but using doubles is at least a nod to numerical issues. Third, Matt's toy example threw up a kurtosis curiosum: With the formula used by Stata, the kurtosis of 3 values not all equal to each other is always 1.5. Presumably the result is a direct consequence of the formula and obvious when looked at the right way, but it was a surprise to me. (If all values tie, then the variance is clearly zero, so that case is indeterminate. But two tied values is not a problem.) This is not a proof, naturally, but an example: sysuse auto gen group = ceil(_n/3) egen kurt = kurt(mpg) in 1/72 , by(group) (8 missing values generated) su kurt Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- kurt | 66 1.5 0 1.5 1.5 In practice, you shouldn't want to work with kurtosis of groups of 3, of course. And, to be clear, I am sure Matt wasn't implying that at all. Fourth, an alternative solution is via a -reshape-, -egen- and -reshape- back again. (This is likely to be much more practical than the -xpose- suggested by Martin Weiss.) Fifth, and perhaps most important, who says that moment-based skewness and kurtosis are the best measures, or indeed of much practical use? Why not (mean - median) / sd as a measure of skewness, which is based on easy ingredients, bounded by [-1, 1] and in several ways easier to work with? Why not L-moments (-findit lmoments-), known for almost 20 years to behave much better? Nick n.j.cox@durham.ac.uk Steven Samuels This version adds a -qui- prefix to the replace statements and removes an unneeded -di- statement. This will remove uninformative output lines. Note that statistics are not computed for observations with a missing value for any row variable. To override this behavior, add "if !missing(`v')" to the end of the first -replace- statement. > Here is Matt's code with a couple of macros to reduce typing. I've > subtracted 1 from the row mean and also subtracted 3 from the > kurtosis formula Matt used. See: http://www.itl.nist.gov/div898/ > handbook/eda/section3/eda35b.htm > **************************CODE BEGINS************************** sysuse auto,clear **************************************************** * Create local macro vlist with variables to analyze **************************************************** local vlist "mpg rep78 trunk turn" egen rowmean = rowmean(`vlist') egen rowN =rownonmiss(`vlist') forvalues i=2/4{ gen m`i'=0 foreach v of local vlist { qui replace m`i' = m`i' + (`v'-rowmean)^`i' } qui replace m`i' = m`i'/(rowN-1) } gen rowskew = m3*m2^(-3/2) gen rowkurt = m4*m2^(-2) -3 list `vlist' row* in 1/5 ***************************CODE ENDS*************************** On Oct 13, 2008, at 11:21 PM, Matt Spittal wrote: > You can use the moments about the mean to calculate skew and > kurtosis for a row of variables. Imagine that you want to do this > for the variables weight, length and price from the auto dataset. > > sysuse auto, clear > > // get mean and N > egen rowmean = rowmean(weight length price) > egen rowN = rownonmiss(weight length price) > > // calculate 2, 3, and 4th moments about the mean > gen m2 = 1/rowN * ((weight - rowmean)^2 + (length - rowmean)^2 + > (price - rowmean)^2) > gen m3 = 1/rowN * ((weight - rowmean)^3 + (length - rowmean)^3 + > (price - rowmean)^3) > gen m4 = 1/rowN * ((weight - rowmean)^4 + (length - rowmean)^4 + > (price - rowmean)^4) > > // calculate skew and kurtosis > gen rowskew = m3*m2^(-3/2) > gen rowkurt = m4*m2^(-2) > > list weight length price rowskew rowkurt in 1/10 jeheyman > Is it possible to calculate essentially a rowskew and rowkurtosis in > the same way that egen calculates rowmean? > > For each observation I have 18 variables and I need, obviously, the > three distribution measures. Mean is trivial but the other two are > proving elusive. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**RE: st: RE: rowskew?***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

- Prev by Date:
**Re: st: tmap and point counts per spatial area** - Next by Date:
**st: export and use of queried data table** - Previous by thread:
**RE: st: RE: rowskew?** - Next by thread:
**Re: st: RE: RE: rowskew?** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |