Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: three questions about cluster analysis ties(), cutv() and _hgt


From   Gabi Huiber <ghuiber@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   st: three questions about cluster analysis ties(), cutv() and _hgt
Date   Wed, 22 Jul 2009 18:23:30 -0400

Hello,

I have a massive data set that contains demographic data at the census
block level on a bunch of people -- things like the median income,
age, number of people with a given schooling level, number of people
within various age brackets, etc. in everybody's census block.

The job is to look for any kind of clusters based on a variety of
criteria that these variables might suggest. I am at a very early
exploratory stage here, and I have a very rudimentary understanding of
how the cluster set of commands works. I use Stata 10.

My problem is this: I took a stratified random sample of the original
data set so I'd get a manageable number of observations reasonably
well scattered across all of the subsets of interest. I then did
hierarchical clustering over this sample, so I would get an initial
idea of the k, number of clusters, that I will want to request next,
when I try k-means clustering over the entire data set.

This is a simplification. I have several data sets, and I want to try
several types of linkage. So I wrote a wrapper where I can set these
options, but the core of it does this:

cluster `linkage' `groupby', name(_`linkage'_groupby)
cluster tree _`linkage'_groupby, cutn(`howmany')

My three questions:

1. This sometimes produces the error message "cannot cut exactly
`howmany' groups due to ties in the dendrogram". I tried varying the
`howmany'. Went through 50, 30 and 10 -- no luck. I also tried varying
`linkage', but complete and ward both produced the same error message.
I am not sure how to fix it. The ties() option is not available for
cluster tree -- it's only available for cluster gen. So, how do you go
about resolving ties in the cluster tree command?

2. Is there a way to back into the dissimilarity coefficient value
that corresponds to a given number of stems? Say I want to use cutv()
instead of cutn(), and set the value for the dissimilarity coefficient
that corresponds to about 10 stems. How do I go about it?

3. The command cluster `linkage' `groupby' produces three new
variables, with names starting in _`linkage'_`groupby' and ending in
_id, _ord and _hgt. Is _`linkage'_`groupby'_hgt equal to the
dissimilarity coefficient value computed by this command, and shown on
the y-axis of the dendrogram? How about _ord?

Thank you,

Gabi
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index