Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Constructing socio-economic status scale using Principal Components Analysis


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Constructing socio-economic status scale using Principal Components Analysis
Date   Wed, 28 Nov 2012 17:16:00 +0000

It's your project, but I am not at all clear about what you want to do
on the right-hand side of your model.

Sometimes a factor analysis or PCA is a round-about way of looking at
correlation structure. Otherwise there is little point in doing either
unless you are going to use the factors or PCs somehow. I am not
advocating that, in any case.

Reducing a factor that accounts for a small fraction of variation to a
categorical or binary variable is going to throw away even more
information.

You have 6000 households which leaves plenty of scope for models with
multiple predictors. It is those predictors that you need to choose
from those you have available, and it's not for me to say how you
should do it.

Others have commented (and likely will comment) on the social
measurement aspects of this. But from what you've said constructing a
single index of SES does not sound either necessary or sufficient
towards developing a decent predictive model.

Nick

On Wed, Nov 28, 2012 at 2:48 PM, Ameya Bondre
<ameyabondre.jhsph@gmail.com> wrote:
> thanks Nick and Maarten for the inputs...
>
> The aim here is to assess the effect of SES on the probability that a
> child would be malnourished, or a mother would feed a more diversified
> diet to her children or attend a growth monitoring session. So, I want
> to use SES as a binary or categorical variable in logistic
> regressions. In some regressions, I would use it to control for SES,
> caste and other such "background variables". I am sorry, the data set
> has many more variables, 37 of those can potentially measure SES.
>
> So, can factor1 as a continuous variable (ranging from -2 to 1.8) be
> used in the regression? I am finding that a bit difficult to
> interpret, so I thought I would have a SES scale instead that can be
> constructed from factor1?
>
> Regarding "selecting a group of variables" which can predict SES, do
> you mean I can do that by just looking at the factor1 variable
> weights/scores? Is there a criteria to choose weights -  like
> variables with weights more than 0.10 point to a high SES, so that
> would make SES a binary variable?
>
> Thank you,
> Ameya
>
>
> On Wed, Nov 28, 2012 at 2:03 AM, Nick Cox <njcoxstata@gmail.com> wrote:
>> For once I disagree partially with Maarten.
>>
>> On reading this again I have further comments:
>>
>> 1. The difference between -factor, pcf- and -pca- is small and
>> arguably immaterial as far as the results here are concerned. In
>> practice, the techniques are associated, however,  with very different
>> attitudes, -factor- often with a theology of latent variables and
>> -pca- often with a mechanistic aim of data reduction.
>>
>> 2. However, it doesn't seem much of a gain for interpretation to
>> discard interpretable variables and replace them with a very fuzzy
>> concept of socio-economic status (SES), even if numbers are attached.
>>
>> 3. This is not just an attitude, as the factor analysis results show
>> that the technique has not been especially successful (18% of variance
>> captured by first factor).
>>
>> 4. If Ameya's variables are typical of data like this that I have
>> seen, most marginal distributions will be skewed and clumpy and the
>> correlation structure extremely sensitive to whether data are left as
>> they come or transformed in some suitable way(s).
>>
>> 5. Ameya's main concern is presumably to do the best job with the
>> dataset in hand, but this kind of procedure is not highly reproducible
>> by others working in similar territory, except naturally with the same
>> dataset of "about 37 variables". It is usually better to try to
>> identify say 5-10 socio-economic variables and use those as predictors
>> in a regression-like model.
>>
>> That said, much depends on the main aim of this project, which is not
>> clear. (Presumably, the measure of SES is not an end in itself.)
>>
>> Nick
>>
>> On Wed, Nov 28, 2012 at 9:18 AM, Maarten Buis <maartenlbuis@gmail.com> wrote:
>>> On Wed, Nov 28, 2012 at 3:59 AM, Ameya Bondre wrote:
>>>> I have a data-set with about 37 variables that can assess household
>>>> socio-economic status in a sample of about 6000 households. These
>>>> include variables measuring household wealth, access to water and
>>>> sanitation, rural households owning animals, etc.
>>>>
>>>> I used factor analysis (factor var1, var2, ...., pcf)
>>>
>>> I would say that factor analysis is incorrect for this problem. Factor
>>> analysis assumes that the latent concepts influence the observed
>>> variables. This makes sense for something like an intelligence test:
>>> someone is more or less smart (the latent variable) and that
>>> influences the probability of answering a set of questions correctly
>>> (the observed variables). Conceptually, socio-economic status is just
>>> a pool of resources available to a person, family, or household: so it
>>> is the number and kind of animals, the wealth, a house with a concrete
>>> floor, etc. (the observed variables) that influence, or add up to, the
>>> socio-economic status (the latent variable).
>>>
>>> Some of the possible solutions available in Stata are discussed here:
>>> <http://www.maartenbuis.nl/wp/prop.html>.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index