Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Clustering help |

Date |
Tue, 19 Mar 2013 12:21:28 +0000 |

This is good advice in general, but the original problem as posed is to cluster on two variables, latitude and longitude, so a scatter plot, also known as a map, is readily to hand. Researchers could waste a lot of time trying to solve this problem by cluster analysis when there is a much easier way. (Also, to be frank, if you choose cluster analysis, some fraction of your readership will be irritated because you used a method they dislike, and some fraction will be laughing quietly at technical overkill. I make that lose-lose.) Also in practice schools tend to be where the pupils live, so there is likely to be clustering depending on the population, so the clusters chosen by inspection should make sense on other grounds. It's true that private schools in some systems are often in remote properties with extensive grounds (e.g. Hogwarts) but I will assume that the real problem is not too complicated unless told otherwise. Nick On Tue, Mar 19, 2013 at 11:15 AM, Simon Falck <simon.falck@abe.kth.se> wrote: > I think Nick´s suggesting is reasonable. > > However, you could also consult theory on how clusters can be defined and how the number of clusters can be determined. In principle, there is no optimal number of clusters. According to Mardia et al (see reference below) the number of clusters k can be estimated as k=(n/2)^1/2. Thus if you have 35 schools the number of clusters is (35/2)^(1/2) = 4. > > How many schools each clusters should contain can be determined using a range of (statistical) methods. For instance, you could use the Ward method which minimizes the variance within each cluster and thus maximizes the (empirical) homogeneity within each cluster of schools. This method implies you that your schools within each clusters will be relatively "similar" and that you do not interfere in the "selection procedure" and thus in choosing how many schools there "should be" in each cluster. > > For more information, see e.g. > > Mardia, Kenb, and Bibby (1979) Multivariate Analysis. Academic Press. London. Pages 360-384. > Romesburg (2004) Cluster Analysis for Researchers. Lulu press. North Carolina. Pages 31-34. > > Simon > > > On 18 mar 2013, at 18:37, Nick Cox <njcoxstata@gmail.com> wrote: > >> I'd plot a map and identify clusters by eye. (Seriously.) >> >> Nick >> >> On Mon, Mar 18, 2013 at 7:15 AM, Ron Wendt <rnldwendt@gmail.com> wrote: >> >>> I'm looking to cluster some geocoded data into a specific >>> number of clusters all of the same size. For example, I want to make >>> 7 clusters of 5 schools each. >>> The best I've found so far is: cluster kmeans lat long, k(7). >>> However, this doesn't let me specify the number of schools that should >>> be in each cluster. Is there another/better way to do this? * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Clustering help***From:*Ron Wendt <rnldwendt@gmail.com>

**Re: st: Clustering help***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: Clustering help***From:*Simon Falck <simon.falck@abe.kth.se>

- Prev by Date:
**Re: st: Clustering help** - Next by Date:
**Re: st: drop range of variables meeting condition in another variable** - Previous by thread:
**Re: st: Clustering help** - Next by thread:
**st: Problem viewing log files (.smcl) created from do-files (dissimilarity matrices) in Stata 12.1** - Index(es):