Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Stas Kolenikov <skolenik@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: polychoric for huge data sets |

Date |
Wed, 5 Sep 2012 08:54:02 -0500 |

I believe that a couple of weeks ago, I posted the code here on statalist that used -polychoric- together with -ssd- to come up with an SEM based on polychoric correlations. Without your specifying what exactly "does not work", there is little way to tell how to try to fix it. -polychoric- may run into some stupid length of a string expression limitations (244 symbols), and renaming your variables into -a1-, -a2-, ..., -a99- could fix it. -polychoric- can find itself in situations where it cannot produce a decent estimate (which probably means that your data does not fit the polychoric assumptions of an underlying normal distribution aggregated into ordinal categories). In such situations, it may fail to converge, or may return a missing value. Other than that, -polychoric- should work, but it will probably take a day or so on this data set. I have reservations about the utility of any complex model with 40 variables, especially if they don't come from a carefully designed psychometric instrument, but rather are kitchen-sunk into whatever statistical model is of interest, be that a xtabond model, a support-vector machine, a neural network, or a factor analysis model. Spearman correlations won't be very informative unless you have a dozen or so categories, at which point you can just as well consider your data continuous. David Roodman's -cmp- is an alternative. I know that he timed it against -gllamm-, and, as far as I can recall, performance was sort of similar in terms of time vs. precision obtained. (-cmp- is much faster per likelihood computation, but -gllamm- uses the most efficient scheme of allocating the points, so the two computational advantages sort of balance one another.) His paper provides an example of how to use -cmp- in place of -polychoric- (although I remember providing some updates to his syntax to make it work better). My testing of -cmp- against -polychoric- showed that the results match up perfectly. 40 variables is a significant stretch for -cmp-, as well, and you need to know quite well how the quasi-Monte Carlo methods work. David might be able to provide some hints; mine is that since the 40-th prime is 173, there is little point to even try this without scrambling. On Wed, Sep 5, 2012 at 3:59 AM, Nick Cox <njcoxstata@gmail.com> wrote: > Stas Kolenikov's -polychoric- package promises only principal > component analysis. Depending on how you were brought up, that is > distinct from factor analysis, or a limiting case of factor analysis, > or a subset of factor analysis. > > The problem you report as "just can't handle it" with no details > appears to be one of speed, rather than refusal or inability to > perform. > > That aside, what is "appropriate" is difficult to answer. A recent > thread indicated that many on this list are queasy about means or > t-tests for ordinal data, so that would presumably put factor analysis > or PCA of ordinal data beyond the pale. Nevertheless it remains > popular. > > You presumably have the option of taking a random sample from your > data and subjecting that to both (a) PCA of _ranked_ data (which is > equivalent to PCA based on Spearman correlation) and (b) polychoric > PCA. Then it would be good news for you if the substantive or > scientific conclusions were the same, and a difference you need to > think about otherwise. Here the random sample should be large enough > to be substantial, but small enough to get results in reasonable time. > > Alternatively, you could be ruthless about which of your variables are > most interesting or important. A preliminary correlation analysis > would show which variables could be excluded because they are poorly > correlated with anything else, and which could be excluded because > they are very highly correlated with anything else. Even if you can > get it, a PCA based on 40+ variables is often unwieldy to handle and > even more difficult to interpret than one based on say 10 or so > variables. > > Nick > > On Wed, Sep 5, 2012 at 3:37 AM, Timea Partos > <Timea.Partos@cancervic.org.au> wrote: > >> I need to run a factor analysis on ordinal data. My dataset is huge (7000+ cases with 40+ variables) so I can't run the polychoric.do program written by Stas Kolenikov, because it just can't handle it. >> >> Does anyone know of a fast way to obtain the polychoric correlation matrix for very large data sets? >> >> Alternatively, I was thinking of running the factor analysis using the Spearman rho (rank-order correlations) matrix instead. Would this be appropriate? > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ -- -- Stas Kolenikov, PhD, PStat (SSC) :: http://stas.kolenikov.name -- Senior Survey Statistician, Abt SRBI :: work email kolenikovs at srbi dot com -- Opinions stated in this email are mine only, and do not reflect the position of my employer * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: polychoric for huge data sets***From:*Timea Partos <Timea.Partos@cancervic.org.au>

**Re: st: polychoric for huge data sets***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: polychoric for huge data sets** - Next by Date:
**Re: st: polychoric for huge data sets** - Previous by thread:
**Re: st: polychoric for huge data sets** - Next by thread:
**st: Course Announcement: 25-29 Oct, 2012, Beijing** - Index(es):