Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Stas Kolenikov <skolenik@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: polychoric for huge data sets |

Date |
Wed, 5 Sep 2012 12:19:06 -0500 |

Well, clearly there's some overhead that hardly depends on the number of variables (parsing, populating the matrices, etc.), but that should be much faster than the iterative optimization. It may well be that with some setups, the time may be somewhat faster than quadratic, but I'd be surprised if it were as fast as linear: -polychoric- literally computes the correlations one by one, so I thought that the quadratic is unavoidable. -- -- Stas Kolenikov, PhD, PStat (SSC) :: http://stas.kolenikov.name -- Senior Survey Statistician, Abt SRBI :: work email kolenikovs at srbi dot com -- Opinions stated in this email are mine only, and do not reflect the position of my employer On Wed, Sep 5, 2012 at 9:15 AM, Nick Cox <njcoxstata@gmail.com> wrote: > I don't know what is obvious to anyone else, but clearly as author you > know your code, which is based on calculating correlations one at a > time. Nevertheless my very limited experiments show less than > quadratic dependence on the number of variables. > > Nick > > On Wed, Sep 5, 2012 at 3:05 PM, Stas Kolenikov <skolenik@gmail.com> wrote: >> Obviously, -polychoric- computing time is quadratic in the number of >> variables, but linear (or may be even faster) in the number of >> observations. There's also the curse of large sample sizes: most of >> the time, the underlying bivariate normality will be considered >> violated by -polychoric-, and that may create computational >> difficulties, such as flat regions, ridges, and multiple local optima. >> >> On Wed, Sep 5, 2012 at 8:54 AM, Nick Cox <njcoxstata@gmail.com> wrote: >>> Experiment supports intuition in suggesting that the number of >>> variables is a bigger deal for -polychoric- than the number of >>> observations, and also that you can get results for 8000 obs and 40 >>> variables in several minutes on a mundane computer. That's tedious >>> interactively but doesn't support the claim that Timea made. Best >>> just to write a do-file and let it run while you are doing something >>> else. >>> >>> Nick >>> >>> On Wed, Sep 5, 2012 at 9:59 AM, Nick Cox <njcoxstata@gmail.com> wrote: >>>> Stas Kolenikov's -polychoric- package promises only principal >>>> component analysis. Depending on how you were brought up, that is >>>> distinct from factor analysis, or a limiting case of factor analysis, >>>> or a subset of factor analysis. >>>> >>>> The problem you report as "just can't handle it" with no details >>>> appears to be one of speed, rather than refusal or inability to >>>> perform. >>>> >>>> That aside, what is "appropriate" is difficult to answer. A recent >>>> thread indicated that many on this list are queasy about means or >>>> t-tests for ordinal data, so that would presumably put factor analysis >>>> or PCA of ordinal data beyond the pale. Nevertheless it remains >>>> popular. >>>> >>>> You presumably have the option of taking a random sample from your >>>> data and subjecting that to both (a) PCA of _ranked_ data (which is >>>> equivalent to PCA based on Spearman correlation) and (b) polychoric >>>> PCA. Then it would be good news for you if the substantive or >>>> scientific conclusions were the same, and a difference you need to >>>> think about otherwise. Here the random sample should be large enough >>>> to be substantial, but small enough to get results in reasonable time. >>>> >>>> Alternatively, you could be ruthless about which of your variables are >>>> most interesting or important. A preliminary correlation analysis >>>> would show which variables could be excluded because they are poorly >>>> correlated with anything else, and which could be excluded because >>>> they are very highly correlated with anything else. Even if you can >>>> get it, a PCA based on 40+ variables is often unwieldy to handle and >>>> even more difficult to interpret than one based on say 10 or so >>>> variables. >>>> >>>> Nick >>>> >>>> On Wed, Sep 5, 2012 at 3:37 AM, Timea Partos >>>> <Timea.Partos@cancervic.org.au> wrote: >>>> >>>>> I need to run a factor analysis on ordinal data. My dataset is huge (7000+ cases with 40+ variables) so I can't run the polychoric.do program written by Stas Kolenikov, because it just can't handle it. >>>>> >>>>> Does anyone know of a fast way to obtain the polychoric correlation matrix for very large data sets? >>>>> >>>>> Alternatively, I was thinking of running the factor analysis using the Spearman rho (rank-order correlations) matrix instead. Would this be appropriate? >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/statalist/faq >>> * http://www.ats.ucla.edu/stat/stata/ >> >> >> >> -- >> -- Stas Kolenikov, PhD, PStat (SSC) :: http://stas.kolenikov.name >> -- Senior Survey Statistician, Abt SRBI :: work email kolenikovs at >> srbi dot com >> -- Opinions stated in this email are mine only, and do not reflect the >> position of my employer >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**RE: st: polychoric for huge data sets***From:*Timea Partos <Timea.Partos@cancervic.org.au>

**References**:**st: polychoric for huge data sets***From:*Timea Partos <Timea.Partos@cancervic.org.au>

**Re: st: polychoric for huge data sets***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: polychoric for huge data sets***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: polychoric for huge data sets***From:*Stas Kolenikov <skolenik@gmail.com>

**Re: st: polychoric for huge data sets***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: Editing values in postestimation coefficient matrix** - Next by Date:
**Re: st: factor command - Factor analysis of data** - Previous by thread:
**Re: st: polychoric for huge data sets** - Next by thread:
**RE: st: polychoric for huge data sets** - Index(es):