Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: polychoric for huge data sets

From	Stas Kolenikov <[email protected]>
To	[email protected]
Subject	Re: st: polychoric for huge data sets
Date	Thu, 6 Sep 2012 21:37:10 -0500

The really slow part of -polychoric- is due to -ml lf- method. It can
be rewritten into a -d2- method, at least for i.i.d. data. The
derivatives of the bivariate normal cdf are known. Debugging that will
definitely take more than 6 hours, and will likely provide for the
speed-up of a factor of 3-5, as far as I know the typical d2/lf
ratios.

As I suggested, -cmp- can be used instead, where all correlations are
estimated simultaneously. Ceteris paribus, the speed of -cmp- should
grow linearly rather than quadratically with the number of dimensions,
although I can be mistaken. It has its own computational complications
that may require more integration points per observation for larger
dimensions. I have not tried to time it against -pollychoric- in any
serious way; I don't know if David Roodman did that, either.

-- 
-- Stas Kolenikov, PhD, PStat (SSC)  ::  http://stas.kolenikov.name
-- Senior Survey Statistician, Abt SRBI  ::  work email kolenikovs at
srbi dot com
-- Opinions stated in this email are mine only, and do not reflect the
position of my employer

On Thu, Sep 6, 2012 at 2:40 AM, Nick Cox <[email protected]> wrote:
> I simulated some data, but evidently -polychoric- is sensitive to the
> data, and mine were "good" from its point of view.
>
> You could try cloning -polychoric- and rewrite slow bits in Mata. Odds
> are that would take longer than 6 hours.
>
> Nick
>
> On Thu, Sep 6, 2012 at 1:39 AM, Timea Partos
> <[email protected]> wrote:
>> Hi guys,
>>
>> Thanks for the suggestions.
>> Just to clarify, my problem with using polychoric is that it is very slow.  I waited over 2 hours and still no result.
>> So I ran some experiments:  when I run polychoric with just two variables, it takes around 35 seconds.  This is regardless of whether I do it with 800 cases or 8000.  3 variables is around 110 seconds, 4 variables is 220 or so, which suggests that the processing time increases about linearly with the pairs of correlations it needs to calculate (so probably quadratically with the number of variables), regardless of the number of cases.
>> So, with my data set this would be over 6 hours for each analysis, which given that this is exploratory and I would be wanting to run numerous analyses, its just not feasible for me.
>> Nick - you suggested that its possible to run polychoric on 8000 cases with 40 variables in a matter of minutes.
>> Can I ask how you did this?   Is it possible that there is something wrong with my computer's settings then - or that its just too slow?
>> I am running Windows, Stata versions 12.1, 32 bits with an Intel i5-2500S 2.7 GHz processor and 4 GB of ram.
>>
>> Re the variables - this is not some hairbrained data-mining or data-cleaning exercise.  It's a well established and validated survey that is firmly based in theory and has been running internationally for over 8 years.  The variables I am looking at should theoretically hang together and reduce to 2 or three factors - and this is what is suggested by the analyses that I have run just treating them as continuous.  I would just like to confirm the results with a more appropriate statistical analysis if possible.  I was hoping to use the matrix of correlations provided by polychoric and feed it into the factormat program rather than using polychoricpca (I definitely want to use factor analysis not pca - as I want to get at the broader theoretical factors rather than just explain the maximum variance in my data).
>>
>> So, any suggestions for how to speed up polychoric would be much appreciated (as would a concrete "this is just not possible" comment, so that I don't waste too much time trying.)
>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Stas Kolenikov
>> Sent: Thursday, 6 September 2012 3:19 AM
>> To: [email protected]
>> Subject: Re: st: polychoric for huge data sets
>>
>> Well, clearly there's some overhead that hardly depends on the number of variables (parsing, populating the matrices, etc.), but that should be much faster than the iterative optimization. It may well be that with some setups, the time may be somewhat faster than quadratic, but I'd be surprised if it were as fast as linear: -polychoric- literally computes the correlations one by one, so I thought that the quadratic is unavoidable.
>>
>> --
>> -- Stas Kolenikov, PhD, PStat (SSC)  ::  http://stas.kolenikov.name
>> -- Senior Survey Statistician, Abt SRBI  ::  work email kolenikovs at srbi dot com
>> -- Opinions stated in this email are mine only, and do not reflect the position of my employer
>>
>> On Wed, Sep 5, 2012 at 9:15 AM, Nick Cox <[email protected]> wrote:
>>> I don't know what is obvious to anyone else, but clearly as author you
>>> know your code, which is based on calculating correlations one at a
>>> time. Nevertheless my very limited experiments show less than
>>> quadratic dependence on the number of variables.
>>>
>>> Nick
>>>
>>> On Wed, Sep 5, 2012 at 3:05 PM, Stas Kolenikov <[email protected]> wrote:
>>>> Obviously, -polychoric- computing time is quadratic in the number of
>>>> variables, but linear (or may be even faster) in the number of
>>>> observations. There's also the curse of large sample sizes: most of
>>>> the time, the underlying bivariate normality will be considered
>>>> violated by -polychoric-, and that may create computational
>>>> difficulties, such as flat regions, ridges, and multiple local optima.
>>>>
>>>> On Wed, Sep 5, 2012 at 8:54 AM, Nick Cox <[email protected]> wrote:
>>>>> Experiment supports intuition in suggesting that the number of
>>>>> variables is a bigger deal for -polychoric- than the number of
>>>>> observations, and also that you can get results for 8000 obs and 40
>>>>> variables in several minutes on a mundane computer. That's tedious
>>>>> interactively but  doesn't support the claim that Timea made. Best
>>>>> just to write a do-file and let it run while you are doing something
>>>>> else.
>>>>>
>>>>> Nick
>>>>>
>>>>> On Wed, Sep 5, 2012 at 9:59 AM, Nick Cox <[email protected]> wrote:
>>>>>> Stas Kolenikov's -polychoric- package promises only principal
>>>>>> component analysis. Depending on how you were brought up, that is
>>>>>> distinct from factor analysis, or a limiting case of factor
>>>>>> analysis, or a subset of factor analysis.
>>>>>>
>>>>>> The problem you report as "just can't handle it" with no details
>>>>>> appears to be one of speed, rather than refusal or inability to
>>>>>> perform.
>>>>>>
>>>>>> That aside, what is "appropriate" is difficult to answer.  A recent
>>>>>> thread indicated that many on this list are queasy about means or
>>>>>> t-tests for ordinal data, so that would presumably put factor
>>>>>> analysis or PCA of ordinal data beyond the pale. Nevertheless it
>>>>>> remains popular.
>>>>>>
>>>>>> You presumably have the option of taking a random sample from your
>>>>>> data and subjecting that to both (a) PCA of _ranked_ data (which is
>>>>>> equivalent to PCA based on Spearman correlation) and (b) polychoric
>>>>>> PCA. Then it would be good news for you if the substantive or
>>>>>> scientific conclusions were the same, and a difference you need to
>>>>>> think about otherwise. Here the random sample should be large
>>>>>> enough to be substantial, but small enough to get results in reasonable time.
>>>>>>
>>>>>> Alternatively, you could be ruthless about which of your variables
>>>>>> are most interesting or important. A preliminary correlation
>>>>>> analysis would show which variables could be excluded because they
>>>>>> are poorly correlated with anything else, and which could be
>>>>>> excluded because they are very highly correlated with anything
>>>>>> else. Even if you can get it, a PCA based on 40+ variables is often
>>>>>> unwieldy to handle and even more difficult to interpret than one
>>>>>> based on say 10 or so variables.
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>> On Wed, Sep 5, 2012 at 3:37 AM, Timea Partos
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>> I need to run a factor analysis on ordinal data.  My dataset is huge (7000+ cases with 40+ variables) so I can't run the polychoric.do program written by Stas Kolenikov, because it just can't handle it.
>>>>>>>
>>>>>>> Does anyone know of a fast way to obtain the polychoric correlation matrix for very large data sets?
>>>>>>>
>>>>>>> Alternatively, I was thinking of running the factor analysis using the Spearman rho (rank-order correlations) matrix instead.  Would this be appropriate?
>>>>> *

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: polychoric for huge data sets
  - From: Timea Partos <[email protected]>
- Re: st: polychoric for huge data sets
  - From: Nick Cox <[email protected]>
- Re: st: polychoric for huge data sets
  - From: Nick Cox <[email protected]>
- Re: st: polychoric for huge data sets
  - From: Stas Kolenikov <[email protected]>
- Re: st: polychoric for huge data sets
  - From: Nick Cox <[email protected]>
- Re: st: polychoric for huge data sets
  - From: Stas Kolenikov <[email protected]>
- RE: st: polychoric for huge data sets
  - From: Timea Partos <[email protected]>
- Re: st: polychoric for huge data sets
  - From: Nick Cox <[email protected]>

Prev by Date: Re: st: Outreg margins error
Next by Date: st: ivreg2: Anderson-Rubin Wald significant, and Stock-Wright S not significant - explanation?
Previous by thread: RE: st: polychoric for huge data sets
Next by thread: Re: st: polychoric for huge data sets
Index(es):
- Date
- Thread