[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: RE: Cluster analysis on survey data

From   "Tullar, Jessica M" <[email protected]>
To   <[email protected]>
Subject   Re: st: RE: Cluster analysis on survey data
Date   Fri, 29 Aug 2008 10:58:41 -0500

First thank you both for responding.

To answer your questions... 

As far as the method to describe the kinds of people that report medical debt and medical bankruptcy... It was explained to me that regression (logit with survey weights was my first choice for how to answer this question, mlogit is an even better idea) looked at means while cluster analysis (the alternative method) looked at "medians". The implication of compared to... did not come into the discussion but makes a good point which I will bring back to my group.

As to the other suggestion that cluster analysis did not need survey weights. You are correct that if all we are concerned about is description then the comparison of closeness of observations then whether they represent more or less individuals doesn't seem particularly concerning. However, my concerns arose from reading the chapter from Reading and Understanding More Multivariate Statistics (Grimm and Yarnold 2000) they discuss the importance of the representativeness of your sample (survey data would not be representative unless weighted) and that some cluster analysis methods are interested in equal sized groups (again dependent upon analyzing a true representative sample). However if we don't use methods that look at the size of groups and focus on the distance between observations then describing those groups using cluster analysis without weights does not seem too inappropriate as long as the focus and description are clear about the non-representativeness.

As an aside, sorry about the unexplained reference in the original request. BRFSS is the Behavioral Risk Factor Surveillance System, a large ongoing telephone survey run through the U.S. Centers for Disease Control.

Thanks again for your help and explanations.


On Aug 28, 2008, at 3:01 PM, Steven Samuels wrote:

It appears to me that a cluster analysis will not serve Jessica's purpose: "to describe who are the kinds of people that report medical debt and medical bankruptcy". Implied in this is "compared to people who do not report these events". (If Jessica does not think so, I hope that she will show an example of what a cluster-analysis might find.)

Better I think would be a discriminant analysis to describe the differences between the two groups (perhaps three, if Jessica considers medical debt without bankruptcy and medical bankruptcy to be different). This could be be done with -logit- and ---probit- (- mlogit- and -mprobit- for three groups), all survey-enabled. (Stata has other kinds of discriminant analysis-see help for -discrim- and - candisc-, but these take no survey features except pweights.) Such analyses could include interactions and might show, for example, that the odds of being older and male are greater for the debt/ bankruptcy group than for the comparison group.

The most flexible way of describing group differences, to my mind, is Classification and Regression Trees (CART); The only implementation in STATA that I know of is the user-contributed -cart- command, but it applies only to Cox regression and does not take weights.


On Aug 28, 2008, at 11:55 AM, Nick Cox wrote:

What is BRFSS? 

On the main question, it is evident that -cluster- does not support any
kind of weights, so that is one short answer. 

I am unclear on how in principle any kind of weights could inform
cluster analysis. Although there are different recipes, cluster analysis
as implemented in Stata is in essence a more or less elaborate way of
quantifying information on similarity or differences between
observations in a multivariate space. 

Suppose for example that I am in a survey, you are too, and several
other people are as well. Cluster analysis offers methods for plotting
me, you and the others in a space. How are those differences affected by
the sampling design behind who is and who isn't in the dataset,
particularly as no parameter estimation or hypothesis testing is

[email protected] 

On Aug 28, 2008, at 10:22 AM, Jessica Tullar wrote:

I am using a BRFSS dataset and therefore it has a complex sampling
design. I would like to describe who are the kinds of people that report
medical debt and medical bankruptcy and therefore thought cluster
analysis might be appropriate. 

I've looked through all the manuals and even searched survey analysis
and on cluster analyses and can't find the answer. Is there a way to
perform a cluster analysis and account for the survey weights? 

A second possibility would involve creating a new representative dataset
on which to perform the standard cluster analysis. How best would one
create a new dataset using the survey weighting which would approximate
the population?

Thanks for your consideration.


[email protected]

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index