Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# Re: st: wealth score using principal component analysis (PCA)

 From Nick Cox To statalist@hsphsun2.harvard.edu Subject Re: st: wealth score using principal component analysis (PCA) Date Thu, 27 Sep 2012 01:49:20 +0100

```You are confusing two different questions. Throughout I focus on the
case you are looking at where PCA is based on the correlation matrix.

If the aim is to use the most important PC, then that is labelled 1,
but even if it weren't we could identify it by its having the largest
eigenvalue attached and no extra considerations arise.

If the aim is to identify which PCs are "important" or "worthy of use"
(typically one or more) and should be used in later analyses, then
this is necessarily a looser, more open question and the best art is a
darker matter. There can't be an answer independent of what you are
trying to do. Some people do stress a rule of thumb such as
eigenvalues > 1 and some people look for a break in the eigenvalues
using a scree plot. In some projects PCs that are used later are good
if interpretable as having high correlations with particular
variables; in other projects the PCs are just composite variables with
the properties assigned to them and interpretability is less material.

Every book I know on PCA stresses this open aspect of the method. The
books by Jolliffe and Jackson referenced in the -pca- documentation
certainly do.

It's not clear exactly why you feel committed in advance to using PCA
like this. I sympathise with the advice given earlier by Stas
Kolenikov to consider something more like an SEM.

Nick

On Wed, Sep 26, 2012 at 9:33 PM, Shikha Sinha <shikha.sinha414@gmail.com> wrote:
> Ok, I got it now that if I want to use one score, then PC1 is the most
> relevant one, and then for further distinction between financial vs
> social, we need to look at factor loadings in each PC2, PC3 , to
> figure out if PC2 is better than PC1 if the focus is on social or
> financial autonomy.
>
> Then I am struggling to understand the use of selecting components
> based on eigenvalues. What is the use of selecting PC based either on
> eigenvalues or screeplot, if we are always (most of the time) going to
> use the 1st component. An example on the importance of eigenvalues in
> selecting components would be very helpful ( or any ref.)
>
> Thanks,
> Shikha
>
> On Wed, Sep 26, 2012 at 6:39 AM, Stas Kolenikov <skolenik@gmail.com> wrote:
>> Often, the 1st PC works as a measure of "overall size", while the
>> subsequent components, as measures of "structure". So the 1st
>> component might be the degree of overall autonomy, while the 2nd
>> component might distinguish say between financial autonomy and social
>> interactions autonomy.
>>
>> --
>> -- Stas Kolenikov, PhD, PStat (SSC)  ::  http://stas.kolenikov.name
>> -- Senior Survey Statistician, Abt SRBI  ::  work email kolenikovs at
>> srbi dot com
>> -- Opinions stated in this email are mine only, and do not reflect the
>> position of my employer
>>
>>
>>
>> On Tue, Sep 25, 2012 at 6:34 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>>> If you want just one index, you can't improve on the first PC if you
>>> are using the criteria of PCA. That's a central idea of PCA.
>>>
>>> Nick
>>>
>>> On Wed, Sep 26, 2012 at 12:22 AM, Shikha Sinha
>>> <shikha.sinha414@gmail.com> wrote:
>>>> Thanks for your response Nick and stat!
>>>>
>>>> I think I am struggling with how to create one scores from two
>>>> components. Let me pose my question again.
>>>>
>>>> Suppose I want to create one index out of six variables. For example,
>>>> I want to create a  "women autonomy index". The index would be one
>>>> number for every households. The Demographic and health survey (DHS)
>>>> ask 10 different questions related to women autonomy and instead of
>>>> using the information in all the 10 questions, I just want to use an
>>>> index that contains the summary information of all the 10
>>>> questions/variables. I can use -pca to create the index. Once I use
>>>> -pca x1-x10, I can choose number of principal components (pc) to
>>>> retain based on eigenvalues or screeplot. Let assume that there are
>>>> three pc that have eigenvalues > 1 and I want to retain all these
>>>> components, though the first component has the highest variation.
>>>>
>>>> Now, I want to create a "women autonomy index" based on these three
>>>> pc. How can I do that? If I use -predict p1 p2 p3, scores; it gives
>>>> three different scores, all unrelated. However, I want just one index,
>>>> kindly suggest how to do this.
>>>>
>>>> Thanks,
>>>> Shikha
>>>>
>>>>
>>>>
>>>> On Tue, Sep 25, 2012 at 9:05 AM, Stas Kolenikov <skolenik@gmail.com> wrote:
>>>>> Regarding (c), you would be best off with a structural equations model
>>>>> (-sem- module), and forgo the PCA whatsoever.
>>>>>
>>>>> --
>>>>> -- Stas Kolenikov, PhD, PStat (SSC)  ::  http://stas.kolenikov.name
>>>>> -- Senior Survey Statistician, Abt SRBI  ::  work email kolenikovs at
>>>>> srbi dot com
>>>>> -- Opinions stated in this email are mine only, and do not reflect the
>>>>> position of my employer
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Sep 24, 2012 at 7:07 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>>>>>> You seem to be misunderstanding both PCA and the syntax of -predict-
>>>>>> after -pca-.
>>>>>>
>>>>>> To take the second first, -predict- just gives you as many components
>>>>>> as you ask for. Ask for one by giving one variable name and you get
>>>>>> scores for the first PC, regardless of what name you give. Stata's
>>>>>> indifferent to what name you give (so long as it is new and legal) and
>>>>>> indeed
>>>>>>
>>>>>> predict p3
>>>>>> predict p777
>>>>>>
>>>>>> would give you further identical copies of the first PC.
>>>>>>
>>>>>> predict P1 P2
>>>>>>
>>>>>> would give you scores for the first two PCs.
>>>>>>
>>>>>> As for PCA there are potentially as many PCs as variables: although
>>>>>> the -components()- option puts a self-defined limit on how many you
>>>>>> can calculate the main purpose of this option appears to be to let
>>>>>> -pca- behave more like -factor-.
>>>>>>
>>>>>> Even if your purpose is to use just one PC, it usually makes sense to
>>>>>> look at several and the relationships of those PCs to your original
>>>>>> variables. Sometimes the second, third, ... PC pick up important parts
>>>>>> of the variation and it is a good idea to look at those too to see
>>>>>> what the first PC is missing. In the case of wealth variables it might
>>>>>> be a good idea to think about using PCA on logarithmic transformations
>>>>>> of the variables too (assuming all values are strictly positive).
>>>>>>
>>>>>> Note that the audience of Statalist is very international and
>>>>>> interdisciplinary, so that assuming that "DHS" is self-evident is
>>>>>> likely to be wrong in many cases.
>>>>>>
>>>>>> Your last question (c) is unanswerable. Many people do it, but how far
>>>>>> we can't see.
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>> On Mon, Sep 24, 2012 at 9:20 PM, Shikha Sinha <shikha.sinha414@gmail.com> wrote:
>>>>>>
>>>>>>> I am trying to create a wealth score using the ownership of different
>>>>>>> assets in the DHS survey.  I am suing -pca but I am not sure how to
>>>>>>> estimate the score as I want to use the wealth score as one of the
>>>>>>> independent variables.
>>>>>>>
>>>>>>> pca x1-x4
>>>>>>> predict p1,score
>>>>>>>
>>>>>>> but -predict only generates score from first component.
>>>>>>>
>>>>>>> I also tried the following,
>>>>>>>
>>>>>>> -pca x1-x4, components (2)
>>>>>>> predict p2, score
>>>>>>>
>>>>>>> However, p1 and p2 are same.
>>>>>>>
>>>>>>> My questions are, (a) why there is no difference between p1 and p2?
>>>>>>> (b) How can I generate score by using first 2 components only?
>>>>>>> (c) Is it ok to use continuous pca score as an independent variable?
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```