[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Gabi Huiber <ghuiber@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
st: three questions about cluster analysis ties(), cutv() and _hgt |

Date |
Wed, 22 Jul 2009 18:23:30 -0400 |

Hello, I have a massive data set that contains demographic data at the census block level on a bunch of people -- things like the median income, age, number of people with a given schooling level, number of people within various age brackets, etc. in everybody's census block. The job is to look for any kind of clusters based on a variety of criteria that these variables might suggest. I am at a very early exploratory stage here, and I have a very rudimentary understanding of how the cluster set of commands works. I use Stata 10. My problem is this: I took a stratified random sample of the original data set so I'd get a manageable number of observations reasonably well scattered across all of the subsets of interest. I then did hierarchical clustering over this sample, so I would get an initial idea of the k, number of clusters, that I will want to request next, when I try k-means clustering over the entire data set. This is a simplification. I have several data sets, and I want to try several types of linkage. So I wrote a wrapper where I can set these options, but the core of it does this: cluster `linkage' `groupby', name(_`linkage'_groupby) cluster tree _`linkage'_groupby, cutn(`howmany') My three questions: 1. This sometimes produces the error message "cannot cut exactly `howmany' groups due to ties in the dendrogram". I tried varying the `howmany'. Went through 50, 30 and 10 -- no luck. I also tried varying `linkage', but complete and ward both produced the same error message. I am not sure how to fix it. The ties() option is not available for cluster tree -- it's only available for cluster gen. So, how do you go about resolving ties in the cluster tree command? 2. Is there a way to back into the dissimilarity coefficient value that corresponds to a given number of stems? Say I want to use cutv() instead of cutn(), and set the value for the dissimilarity coefficient that corresponds to about 10 stems. How do I go about it? 3. The command cluster `linkage' `groupby' produces three new variables, with names starting in _`linkage'_`groupby' and ending in _id, _ord and _hgt. Is _`linkage'_`groupby'_hgt equal to the dissimilarity coefficient value computed by this command, and shown on the y-axis of the dendrogram? How about _ord? Thank you, Gabi * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**RE: st: RE: frequency bar charts of string variables** - Next by Date:
**Re: st: RE: frequency bar charts of string variables** - Previous by thread:
**st: Prediction after xtpoisson, fe,** - Next by thread:
**st: chi squared test for trend** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |