[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
khigbee@stata.com |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Commands for cluster analysis |

Date |
Tue, 13 Sep 2005 11:38:07 -0500 |

A couple of days ago Ping Zheng <P.Zheng@leeds.ac.uk> asked: >We are conducting inward FDI locational determinants by using a panel >data set with 28 home countries. we'd like to try a country cluster >analysis to cluster the home countries into different groups naturally >by Stata. What are the commands for this and what are the commands for >regressing the different groups obtained from clustering by using OLS >and Random Effects GLS? And Rose Medeiros <rosem@cisunix.unh.edu> gave some good advice concerning the many things you have to consider when doing a cluster analysis. I would like to add a cautionary note. It sounds like you are going to use the groups produced by cluster analysis as groupings in later statistical analyses. More often than not, this is a statistically dangerous thing to do. Let me illustrate. First I generate 300 random uniform observations (over the range -0.5 to 0.5) for 6 variables clear set obs 300 set seed 11313 forvalues i = 1/6 { gen x`i' = uniform() - 0.5 } And I create a grouping variable by blindly dividing the data into thirds. gen rand3 = 1 in 1/100 replace rand3 = 2 in 101/200 replace rand3 = 3 in 201/300 I look at the means of the 6 variables over these random 3 groups tabstat x* , by(rand3) and do a -manova- to see if these 3 groups are significantly different. manova x1 x2 x3 x4 x5 x6 = rand3 They are not. Which is what we all expect. Now, what happens if I use a cluster analysis routine to go searching for 3 groups in the data? Here I picked K-means clustering, but the concept is the same for the other clustering methods. cluster kmeans x* , k(3) name(g3) What do the data show for these 3 groups? tab g3 tabstat x* , by(g3) Compare the output of -tabstat- here with that from the random groupings. What does -manova- say about the grouping created by -cluster-? manova x1 x2 x3 x4 x5 x6 = g3 We have found that the groups are statistically different. But if you tried to publish results like these, a knowledgable Journal reviewer would reject your paper. The result is significant only because the cluster analysis went searching (as hard as it could) for groupings that best separate the data. In random data there are bound to be groupings that will separate the data enough to cause follow-on statistical tests to show significance. By the way, you could also run cluster stop and get a Pseudo-F statistic indicating how good the 3 groups split the data. Notice the word "Pseudo" and that no p-values are provided in the output of -cluster stop-. You can read more about cluster stopping rules in [MV] cluster stop. In particular read the first Technical Note on page 186 concerning why the stoping rule statistics have the word "Pseudo" in them. Also read the Technical Note on page 74 of [MV] cluster. It warns against using the groups produced by the -cluster- command in the -cluster()- option of an estimation command. Ken Higbee khigbee@stata.com StataCorp 1-800-STATAPC * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Commands for cluster analysis***From:*Ed Blackburne <blackburne@shsu.edu>

- Prev by Date:
**st: calculating SMRs, examples?** - Next by Date:
**st: Confidence Intervals plot** - Previous by thread:
**Re: st: Commands for cluster analysis** - Next by thread:
**Re: st: Commands for cluster analysis** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |