[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Re: st: Principal Components Analysis with count data

From   "Verkuilen, Jay" <>
To   "''" <>
Subject   RE: Re: st: Principal Components Analysis with count data
Date   Fri, 14 Aug 2009 12:19:33 -0400

Nick Cox wrote:

>There are various unstated assumptions and criteria that need to be
>spelled out for a fruitful discussion. 

>1. Continuous versus discrete. I don't know any reason why PCA might not
be as helpful, or as useless, on discrete data (e.g. counts) as compared
with continuous data. 

Agreed. The main thing is that discrete variables tend to be quite skewed and thus have strongly attenuated correlations. Much of the dimensionality you find is created by this issue. The temptation is to assume that 

     dimension = substantively interesting variation, 

but sadly this is often wrong. Instead, 

     dimension = systematic variation, 

but that's far from the same thing. 

>I wouldn't think it useful for categorical
variables, which I take to be a quite different issue. <

Well correspondence analysis is, essentially, principal components for categorical variables in the sense that CA depends on the singular value decomposition of the indicator matrix for categorical data in essentially the same way that PCA (or biplotting) uses the SVD of the data matrix for continuous variables. There's a large literature on it and, indeed, Stata has some nice procedures for it already built in. See -mca- and then expect to do some reading. 

>2. Skewed versus symmetric. In principle, PCA might work very well even
if some of the variables were highly skewed. In practice, skewness quite
often goes together with nonlinearities, and a transformation might help
in either case. <



*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index