Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: polychoric for huge data sets
Stas Kolenikov <firstname.lastname@example.org>
Re: st: polychoric for huge data sets
Wed, 5 Sep 2012 08:54:02 -0500
I believe that a couple of weeks ago, I posted the code here on
statalist that used -polychoric- together with -ssd- to come up with
an SEM based on polychoric correlations. Without your specifying what
exactly "does not work", there is little way to tell how to try to fix
it. -polychoric- may run into some stupid length of a string
expression limitations (244 symbols), and renaming your variables into
-a1-, -a2-, ..., -a99- could fix it. -polychoric- can find itself in
situations where it cannot produce a decent estimate (which probably
means that your data does not fit the polychoric assumptions of an
underlying normal distribution aggregated into ordinal categories). In
such situations, it may fail to converge, or may return a missing
value. Other than that, -polychoric- should work, but it will probably
take a day or so on this data set.
I have reservations about the utility of any complex model with 40
variables, especially if they don't come from a carefully designed
psychometric instrument, but rather are kitchen-sunk into whatever
statistical model is of interest, be that a xtabond model, a
support-vector machine, a neural network, or a factor analysis model.
Spearman correlations won't be very informative unless you have a
dozen or so categories, at which point you can just as well consider
your data continuous.
David Roodman's -cmp- is an alternative. I know that he timed it
against -gllamm-, and, as far as I can recall, performance was sort of
similar in terms of time vs. precision obtained. (-cmp- is much faster
per likelihood computation, but -gllamm- uses the most efficient
scheme of allocating the points, so the two computational advantages
sort of balance one another.) His paper provides an example of how to
use -cmp- in place of -polychoric- (although I remember providing some
updates to his syntax to make it work better). My testing of -cmp-
against -polychoric- showed that the results match up perfectly. 40
variables is a significant stretch for -cmp-, as well, and you need to
know quite well how the quasi-Monte Carlo methods work. David might be
able to provide some hints; mine is that since the 40-th prime is 173,
there is little point to even try this without scrambling.
On Wed, Sep 5, 2012 at 3:59 AM, Nick Cox <email@example.com> wrote:
> Stas Kolenikov's -polychoric- package promises only principal
> component analysis. Depending on how you were brought up, that is
> distinct from factor analysis, or a limiting case of factor analysis,
> or a subset of factor analysis.
> The problem you report as "just can't handle it" with no details
> appears to be one of speed, rather than refusal or inability to
> That aside, what is "appropriate" is difficult to answer. A recent
> thread indicated that many on this list are queasy about means or
> t-tests for ordinal data, so that would presumably put factor analysis
> or PCA of ordinal data beyond the pale. Nevertheless it remains
> You presumably have the option of taking a random sample from your
> data and subjecting that to both (a) PCA of _ranked_ data (which is
> equivalent to PCA based on Spearman correlation) and (b) polychoric
> PCA. Then it would be good news for you if the substantive or
> scientific conclusions were the same, and a difference you need to
> think about otherwise. Here the random sample should be large enough
> to be substantial, but small enough to get results in reasonable time.
> Alternatively, you could be ruthless about which of your variables are
> most interesting or important. A preliminary correlation analysis
> would show which variables could be excluded because they are poorly
> correlated with anything else, and which could be excluded because
> they are very highly correlated with anything else. Even if you can
> get it, a PCA based on 40+ variables is often unwieldy to handle and
> even more difficult to interpret than one based on say 10 or so
> On Wed, Sep 5, 2012 at 3:37 AM, Timea Partos
> <Timea.Partos@cancervic.org.au> wrote:
>> I need to run a factor analysis on ordinal data. My dataset is huge (7000+ cases with 40+ variables) so I can't run the polychoric.do program written by Stas Kolenikov, because it just can't handle it.
>> Does anyone know of a fast way to obtain the polychoric correlation matrix for very large data sets?
>> Alternatively, I was thinking of running the factor analysis using the Spearman rho (rank-order correlations) matrix instead. Would this be appropriate?
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
-- Stas Kolenikov, PhD, PStat (SSC) :: http://stas.kolenikov.name
-- Senior Survey Statistician, Abt SRBI :: work email kolenikovs at
srbi dot com
-- Opinions stated in this email are mine only, and do not reflect the
position of my employer
* For searches and help try: