Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: polychoric for huge data sets |

Date |
Wed, 5 Sep 2012 15:15:44 +0100 |

I don't know what is obvious to anyone else, but clearly as author you know your code, which is based on calculating correlations one at a time. Nevertheless my very limited experiments show less than quadratic dependence on the number of variables. Nick On Wed, Sep 5, 2012 at 3:05 PM, Stas Kolenikov <skolenik@gmail.com> wrote: > Obviously, -polychoric- computing time is quadratic in the number of > variables, but linear (or may be even faster) in the number of > observations. There's also the curse of large sample sizes: most of > the time, the underlying bivariate normality will be considered > violated by -polychoric-, and that may create computational > difficulties, such as flat regions, ridges, and multiple local optima. > > On Wed, Sep 5, 2012 at 8:54 AM, Nick Cox <njcoxstata@gmail.com> wrote: >> Experiment supports intuition in suggesting that the number of >> variables is a bigger deal for -polychoric- than the number of >> observations, and also that you can get results for 8000 obs and 40 >> variables in several minutes on a mundane computer. That's tedious >> interactively but doesn't support the claim that Timea made. Best >> just to write a do-file and let it run while you are doing something >> else. >> >> Nick >> >> On Wed, Sep 5, 2012 at 9:59 AM, Nick Cox <njcoxstata@gmail.com> wrote: >>> Stas Kolenikov's -polychoric- package promises only principal >>> component analysis. Depending on how you were brought up, that is >>> distinct from factor analysis, or a limiting case of factor analysis, >>> or a subset of factor analysis. >>> >>> The problem you report as "just can't handle it" with no details >>> appears to be one of speed, rather than refusal or inability to >>> perform. >>> >>> That aside, what is "appropriate" is difficult to answer. A recent >>> thread indicated that many on this list are queasy about means or >>> t-tests for ordinal data, so that would presumably put factor analysis >>> or PCA of ordinal data beyond the pale. Nevertheless it remains >>> popular. >>> >>> You presumably have the option of taking a random sample from your >>> data and subjecting that to both (a) PCA of _ranked_ data (which is >>> equivalent to PCA based on Spearman correlation) and (b) polychoric >>> PCA. Then it would be good news for you if the substantive or >>> scientific conclusions were the same, and a difference you need to >>> think about otherwise. Here the random sample should be large enough >>> to be substantial, but small enough to get results in reasonable time. >>> >>> Alternatively, you could be ruthless about which of your variables are >>> most interesting or important. A preliminary correlation analysis >>> would show which variables could be excluded because they are poorly >>> correlated with anything else, and which could be excluded because >>> they are very highly correlated with anything else. Even if you can >>> get it, a PCA based on 40+ variables is often unwieldy to handle and >>> even more difficult to interpret than one based on say 10 or so >>> variables. >>> >>> Nick >>> >>> On Wed, Sep 5, 2012 at 3:37 AM, Timea Partos >>> <Timea.Partos@cancervic.org.au> wrote: >>> >>>> I need to run a factor analysis on ordinal data. My dataset is huge (7000+ cases with 40+ variables) so I can't run the polychoric.do program written by Stas Kolenikov, because it just can't handle it. >>>> >>>> Does anyone know of a fast way to obtain the polychoric correlation matrix for very large data sets? >>>> >>>> Alternatively, I was thinking of running the factor analysis using the Spearman rho (rank-order correlations) matrix instead. Would this be appropriate? >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ > > > > -- > -- Stas Kolenikov, PhD, PStat (SSC) :: http://stas.kolenikov.name > -- Senior Survey Statistician, Abt SRBI :: work email kolenikovs at > srbi dot com > -- Opinions stated in this email are mine only, and do not reflect the > position of my employer > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: polychoric for huge data sets***From:*Stas Kolenikov <skolenik@gmail.com>

**References**:**st: polychoric for huge data sets***From:*Timea Partos <Timea.Partos@cancervic.org.au>

**Re: st: polychoric for huge data sets***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: polychoric for huge data sets***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: polychoric for huge data sets***From:*Stas Kolenikov <skolenik@gmail.com>

- Prev by Date:
**Re: st: polychoric for huge data sets** - Next by Date:
**st: Using ksmirnov** - Previous by thread:
**Re: st: polychoric for huge data sets** - Next by thread:
**Re: st: polychoric for huge data sets** - Index(es):