Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Commands for cluster analysis


From   Ed Blackburne <blackburne@shsu.edu>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Commands for cluster analysis
Date   Tue, 13 Sep 2005 12:40:19 -0500

Ahh, Ken makes an excellent point, but...

What if (and given Ping's panel dataset this is not a big if) a
researcher is worried that classic poolability tests (i.e. parameter
homogeneity) are rejected. Baltagi (and others) argue that many times
the bias introduced from pooling the data under these circumstances is
more than offset by efficiency gains and so, on a MSE basis, pooling the
data is acceptable.

My guess is Ping has rejected slope homogeneity of all n=28 countries,
but believes (or hopes?) there are subsets of countries that are
homogeneous.

In that case, I suspect Ping is looking for the method described in 
"Partial Pooling: a Possible Answer to 'To Pool or not to Pool'" by
Farshid Vahid

This article can be found in the Oxford University Press Book:
Cointegration, Causality, and Forecasting
A Festschrift in Honour of Clive W.J. Granger 
Edited by Robert F. Engle and  Halbert White 


-Ed


Edward F. Blackburne III
Associate Professor
Economics and International Business
Sam Houston State University
blackburne@shsu.edu
 




On Tue, 2005-09-13 at 11:38 -0500, khigbee@stata.com wrote:
> A couple of days ago Ping Zheng <P.Zheng@leeds.ac.uk> asked:
> 
> >We are conducting inward FDI locational determinants by using a panel
> >data set with 28 home countries. we'd like to try a country cluster
> >analysis to cluster the home countries into different groups naturally
> >by Stata. What are the commands for this and what are the commands for
> >regressing the different groups obtained from clustering by using OLS
> >and Random Effects GLS?
> 
> And Rose Medeiros <rosem@cisunix.unh.edu> gave some good advice
> concerning the many things you have to consider when doing a
> cluster analysis.
> 
> I would like to add a cautionary note.  It sounds like you are
> going to use the groups produced by cluster analysis as groupings
> in later statistical analyses.  More often than not, this is a
> statistically dangerous thing to do.  Let me illustrate.
> 
> First I generate 300 random uniform observations (over the range
> -0.5 to 0.5) for 6 variables
> 
>     clear
>     set obs 300
>     set seed 11313
>     forvalues i = 1/6 {
>         gen x`i' = uniform() - 0.5
>     }
> 
> And I create a grouping variable by blindly dividing the data
> into thirds.
> 
>     gen rand3 = 1 in 1/100
>     replace rand3 = 2 in 101/200
>     replace rand3 = 3 in 201/300
> 
> I look at the means of the 6 variables over these random 3 groups
> 
>     tabstat x* , by(rand3)
> 
> and do a -manova- to see if these 3 groups are significantly
> different.
> 
>     manova x1 x2 x3 x4 x5 x6 = rand3
> 
> They are not.  Which is what we all expect.
> 
> Now, what happens if I use a cluster analysis routine to go
> searching for 3 groups in the data?  Here I picked K-means
> clustering, but the concept is the same for the other clustering
> methods.
> 
>     cluster kmeans x* , k(3) name(g3)
> 
> What do the data show for these 3 groups?
> 
>     tab g3
>     tabstat x* , by(g3)
> 
> Compare the output of -tabstat- here with that from the random
> groupings.
> 
> What does -manova- say about the grouping created by -cluster-?
> 
>     manova x1 x2 x3 x4 x5 x6 = g3
> 
> We have found that the groups are statistically different.  But
> if you tried to publish results like these, a knowledgable
> Journal reviewer would reject your paper.  The result is
> significant only because the cluster analysis went searching (as
> hard as it could) for groupings that best separate the data.  In
> random data there are bound to be groupings that will separate
> the data enough to cause follow-on statistical tests to show
> significance.
> 
> By the way, you could also run
> 
>    cluster stop
> 
> and get a Pseudo-F statistic indicating how good the 3 groups
> split the data.  Notice the word "Pseudo" and that no p-values
> are provided in the output of -cluster stop-.  You can read more
> about cluster stopping rules in [MV] cluster stop.  In particular
> read the first Technical Note on page 186 concerning why the
> stoping rule statistics have the word "Pseudo" in them.
> 
> Also read the Technical Note on page 74 of [MV] cluster.  It
> warns against using the groups produced by the -cluster- command
> in the -cluster()- option of an estimation command.
> 
> Ken Higbee    khigbee@stata.com
> StataCorp     1-800-STATAPC
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index