Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: polychoric for huge data sets

From	Nick Cox <[email protected]>
To	[email protected]
Subject	Re: st: polychoric for huge data sets
Date	Wed, 5 Sep 2012 15:15:44 +0100

I don't know what is obvious to anyone else, but clearly as author you
know your code, which is based on calculating correlations one at a
time. Nevertheless my very limited experiments show less than
quadratic dependence on the number of variables.

Nick

On Wed, Sep 5, 2012 at 3:05 PM, Stas Kolenikov <[email protected]> wrote:
> Obviously, -polychoric- computing time is quadratic in the number of
> variables, but linear (or may be even faster) in the number of
> observations. There's also the curse of large sample sizes: most of
> the time, the underlying bivariate normality will be considered
> violated by -polychoric-, and that may create computational
> difficulties, such as flat regions, ridges, and multiple local optima.
>
> On Wed, Sep 5, 2012 at 8:54 AM, Nick Cox <[email protected]> wrote:
>> Experiment supports intuition in suggesting that the number of
>> variables is a bigger deal for -polychoric- than the number of
>> observations, and also that you can get results for 8000 obs and 40
>> variables in several minutes on a mundane computer. That's tedious
>> interactively but  doesn't support the claim that Timea made. Best
>> just to write a do-file and let it run while you are doing something
>> else.
>>
>> Nick
>>
>> On Wed, Sep 5, 2012 at 9:59 AM, Nick Cox <[email protected]> wrote:
>>> Stas Kolenikov's -polychoric- package promises only principal
>>> component analysis. Depending on how you were brought up, that is
>>> distinct from factor analysis, or a limiting case of factor analysis,
>>> or a subset of factor analysis.
>>>
>>> The problem you report as "just can't handle it" with no details
>>> appears to be one of speed, rather than refusal or inability to
>>> perform.
>>>
>>> That aside, what is "appropriate" is difficult to answer.  A recent
>>> thread indicated that many on this list are queasy about means or
>>> t-tests for ordinal data, so that would presumably put factor analysis
>>> or PCA of ordinal data beyond the pale. Nevertheless it remains
>>> popular.
>>>
>>> You presumably have the option of taking a random sample from your
>>> data and subjecting that to both (a) PCA of _ranked_ data (which is
>>> equivalent to PCA based on Spearman correlation) and (b) polychoric
>>> PCA. Then it would be good news for you if the substantive or
>>> scientific conclusions were the same, and a difference you need to
>>> think about otherwise. Here the random sample should be large enough
>>> to be substantial, but small enough to get results in reasonable time.
>>>
>>> Alternatively, you could be ruthless about which of your variables are
>>> most interesting or important. A preliminary correlation analysis
>>> would show which variables could be excluded because they are poorly
>>> correlated with anything else, and which could be excluded because
>>> they are very highly correlated with anything else. Even if you can
>>> get it, a PCA based on 40+ variables is often unwieldy to handle and
>>> even more difficult to interpret than one based on say 10 or so
>>> variables.
>>>
>>> Nick
>>>
>>> On Wed, Sep 5, 2012 at 3:37 AM, Timea Partos
>>> <[email protected]> wrote:
>>>
>>>> I need to run a factor analysis on ordinal data.  My dataset is huge (7000+ cases with 40+ variables) so I can't run the polychoric.do program written by Stas Kolenikov, because it just can't handle it.
>>>>
>>>> Does anyone know of a fast way to obtain the polychoric correlation matrix for very large data sets?
>>>>
>>>> Alternatively, I was thinking of running the factor analysis using the Spearman rho (rank-order correlations) matrix instead.  Would this be appropriate?
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>
>
>
> --
> -- Stas Kolenikov, PhD, PStat (SSC)  ::  http://stas.kolenikov.name
> -- Senior Survey Statistician, Abt SRBI  ::  work email kolenikovs at
> srbi dot com
> -- Opinions stated in this email are mine only, and do not reflect the
> position of my employer
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: polychoric for huge data sets
  - From: Stas Kolenikov <[email protected]>

References:
- st: polychoric for huge data sets
  - From: Timea Partos <[email protected]>
- Re: st: polychoric for huge data sets
  - From: Nick Cox <[email protected]>
- Re: st: polychoric for huge data sets
  - From: Nick Cox <[email protected]>
- Re: st: polychoric for huge data sets
  - From: Stas Kolenikov <[email protected]>

Prev by Date: Re: st: polychoric for huge data sets
Next by Date: st: Using ksmirnov
Previous by thread: Re: st: polychoric for huge data sets
Next by thread: Re: st: polychoric for huge data sets
Index(es):
- Date
- Thread