Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: st: polychoric for huge data sets
Timea Partos <Timea.Partos@cancervic.org.au>
RE: st: polychoric for huge data sets
Thu, 6 Sep 2012 00:39:53 +0000
Thanks for the suggestions.
Just to clarify, my problem with using polychoric is that it is very slow. I waited over 2 hours and still no result.
So I ran some experiments: when I run polychoric with just two variables, it takes around 35 seconds. This is regardless of whether I do it with 800 cases or 8000. 3 variables is around 110 seconds, 4 variables is 220 or so, which suggests that the processing time increases about linearly with the pairs of correlations it needs to calculate (so probably quadratically with the number of variables), regardless of the number of cases.
So, with my data set this would be over 6 hours for each analysis, which given that this is exploratory and I would be wanting to run numerous analyses, its just not feasible for me.
Nick - you suggested that its possible to run polychoric on 8000 cases with 40 variables in a matter of minutes.
Can I ask how you did this? Is it possible that there is something wrong with my computer's settings then - or that its just too slow?
I am running Windows, Stata versions 12.1, 32 bits with an Intel i5-2500S 2.7 GHz processor and 4 GB of ram.
Re the variables - this is not some hairbrained data-mining or data-cleaning exercise. It's a well established and validated survey that is firmly based in theory and has been running internationally for over 8 years. The variables I am looking at should theoretically hang together and reduce to 2 or three factors - and this is what is suggested by the analyses that I have run just treating them as continuous. I would just like to confirm the results with a more appropriate statistical analysis if possible. I was hoping to use the matrix of correlations provided by polychoric and feed it into the factormat program rather than using polychoricpca (I definitely want to use factor analysis not pca - as I want to get at the broader theoretical factors rather than just explain the maximum variance in my data).
So, any suggestions for how to speed up polychoric would be much appreciated (as would a concrete "this is just not possible" comment, so that I don't waste too much time trying.)
From: firstname.lastname@example.org [mailto:email@example.com] On Behalf Of Stas Kolenikov
Sent: Thursday, 6 September 2012 3:19 AM
Subject: Re: st: polychoric for huge data sets
Well, clearly there's some overhead that hardly depends on the number of variables (parsing, populating the matrices, etc.), but that should be much faster than the iterative optimization. It may well be that with some setups, the time may be somewhat faster than quadratic, but I'd be surprised if it were as fast as linear: -polychoric- literally computes the correlations one by one, so I thought that the quadratic is unavoidable.
-- Stas Kolenikov, PhD, PStat (SSC) :: http://stas.kolenikov.name
-- Senior Survey Statistician, Abt SRBI :: work email kolenikovs at srbi dot com
-- Opinions stated in this email are mine only, and do not reflect the position of my employer
On Wed, Sep 5, 2012 at 9:15 AM, Nick Cox <firstname.lastname@example.org> wrote:
> I don't know what is obvious to anyone else, but clearly as author you
> know your code, which is based on calculating correlations one at a
> time. Nevertheless my very limited experiments show less than
> quadratic dependence on the number of variables.
> On Wed, Sep 5, 2012 at 3:05 PM, Stas Kolenikov <email@example.com> wrote:
>> Obviously, -polychoric- computing time is quadratic in the number of
>> variables, but linear (or may be even faster) in the number of
>> observations. There's also the curse of large sample sizes: most of
>> the time, the underlying bivariate normality will be considered
>> violated by -polychoric-, and that may create computational
>> difficulties, such as flat regions, ridges, and multiple local optima.
>> On Wed, Sep 5, 2012 at 8:54 AM, Nick Cox <firstname.lastname@example.org> wrote:
>>> Experiment supports intuition in suggesting that the number of
>>> variables is a bigger deal for -polychoric- than the number of
>>> observations, and also that you can get results for 8000 obs and 40
>>> variables in several minutes on a mundane computer. That's tedious
>>> interactively but doesn't support the claim that Timea made. Best
>>> just to write a do-file and let it run while you are doing something
>>> On Wed, Sep 5, 2012 at 9:59 AM, Nick Cox <email@example.com> wrote:
>>>> Stas Kolenikov's -polychoric- package promises only principal
>>>> component analysis. Depending on how you were brought up, that is
>>>> distinct from factor analysis, or a limiting case of factor
>>>> analysis, or a subset of factor analysis.
>>>> The problem you report as "just can't handle it" with no details
>>>> appears to be one of speed, rather than refusal or inability to
>>>> That aside, what is "appropriate" is difficult to answer. A recent
>>>> thread indicated that many on this list are queasy about means or
>>>> t-tests for ordinal data, so that would presumably put factor
>>>> analysis or PCA of ordinal data beyond the pale. Nevertheless it
>>>> remains popular.
>>>> You presumably have the option of taking a random sample from your
>>>> data and subjecting that to both (a) PCA of _ranked_ data (which is
>>>> equivalent to PCA based on Spearman correlation) and (b) polychoric
>>>> PCA. Then it would be good news for you if the substantive or
>>>> scientific conclusions were the same, and a difference you need to
>>>> think about otherwise. Here the random sample should be large
>>>> enough to be substantial, but small enough to get results in reasonable time.
>>>> Alternatively, you could be ruthless about which of your variables
>>>> are most interesting or important. A preliminary correlation
>>>> analysis would show which variables could be excluded because they
>>>> are poorly correlated with anything else, and which could be
>>>> excluded because they are very highly correlated with anything
>>>> else. Even if you can get it, a PCA based on 40+ variables is often
>>>> unwieldy to handle and even more difficult to interpret than one
>>>> based on say 10 or so variables.
>>>> On Wed, Sep 5, 2012 at 3:37 AM, Timea Partos
>>>> <Timea.Partos@cancervic.org.au> wrote:
>>>>> I need to run a factor analysis on ordinal data. My dataset is huge (7000+ cases with 40+ variables) so I can't run the polychoric.do program written by Stas Kolenikov, because it just can't handle it.
>>>>> Does anyone know of a fast way to obtain the polychoric correlation matrix for very large data sets?
>>>>> Alternatively, I was thinking of running the factor analysis using the Spearman rho (rank-order correlations) matrix instead. Would this be appropriate?
>>> * For searches and help try:
>>> * http://www.stata.com/help.cgi?search
>>> * http://www.stata.com/support/statalist/faq
>>> * http://www.ats.ucla.edu/stat/stata/
>> -- Stas Kolenikov, PhD, PStat (SSC) :: http://stas.kolenikov.name
>> -- Senior Survey Statistician, Abt SRBI :: work email kolenikovs at
>> srbi dot com
>> -- Opinions stated in this email are mine only, and do not reflect
>> the position of my employer
>> * For searches and help try:
>> * http://www.stata.com/help.cgi?search
>> * http://www.stata.com/support/statalist/faq
>> * http://www.ats.ucla.edu/stat/stata/
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
* For searches and help try:
* For searches and help try: