Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Cluster analysis - cluster kmeans-


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: Cluster analysis - cluster kmeans-
Date   Sun, 13 May 2007 17:53:14 +0100

Cluster analysis with a single variable makes perfect sense
whenever there is some dimension along which values can be
arranged. This could be a measurement scale, time or space. Thus a
time series could be divided into spells, epochs, periods,
whatever, ideally with relatively small differences within
subseries and relative large differences between subseries.  The
same problem arises whenever a single spatial dimension
(horizontal or vertical) is to be subdivided.  In geological and
other sciences, this is often studied under the heading of
zonation. 

In this thread, the point seems to be to look for breaks within a
frequency distribution (antimodes, in one terminology). With just
21 values there may be little to be gained by an overly formal
approach. Just plot the data (-dotplot-, -beamplot- (SSC), -qplot-
(SJ), etc.) and look for any major gaps. As Ken Higbee emphasised,
a graph will also help make sense of any formal approach. 

However, there is a literature on more formal solutions going back
almost 50 years, although it is evidently unfamiliar to Statalist
members. In particular, see

W.D. Fisher. 1958. On grouping for maximum homogeneity. Journal,
American Statistical Association 53: 789-98 [will be accessible to
some via www.jstor.org]

J.A. Hartigan. 1975. Clustering algorithms.  New York: John Wiley.
Ch.6. 

Consider a toy example: 

14 15 16 23 24 25 56 57 58 

Here it is evident that a three-group clustering 

14 15 16 | 23 24 25 | 56 57 58 

is sensible. Whether the ordering is on the values themselves, or
in time or in space, the data can always be laid out in one
dimension, which gives special structure to the problem. Thus,
although more general clustering methods can be used, that special
structure ideally should be exploited. k groups devised for n
values are defined by placing (k - 1) markers (in the example
above, k - 1 = 2); there are (n - 1) possible places to put them.
The combinatorics of this kind of problem are thus less scary than
for clustering in general.  However, if k is free to vary, then
the number of possible partitions is 2^(n - 1), as each value can
be in the same group as each neighbour, or not. For even modest n,
that is a large number. 

The problem can be made simpler by placing markers to minimise the 

(*) sum over groups of variability around group centres

A sum of squared deviations from group means will spring to mind
as the most obvious possibility, although sum of absolute
deviations from group medians, and other measures, might well be
entertained. 

Hartigan shows how a dynamic programming approach makes such
computation pretty straightforward. This Stata code is based on my
translation into Basic of my modification in Fortran of Hartigan's
original Fortran. A Mata-based version should follow. 

In principle, and indeed in the toy example, there is the
possibility that two or more clusterings into the same number k
of groups could tie on the minimand (*). 

-------------------------------------------- group1d.ado 
*! 2.0.0 NJC 13 May 2007 
*! 1.0.0 NJC 16 July 1997
* translation of Basic program NJC 1 March 1990
program group1d, sortpreserve       
	version 8    
	syntax varname(numeric) [if] [in] , groups(int) 

	quietly { 
		marksample touse 
		count if `touse' 
		if r(N) == 0 error 2000 
		local N = r(N) 
		replace `touse' = -`touse' 
		sort `touse', stable 
		local v "`varlist'" 
	}	

	quietly forval j = 1/`groups' {
		tempvar mg`j' sg`j'
		gen `mg`j'' = .
		gen `sg`j'' = .
		replace `mg`j'' = 1 in 1
		replace `sg`j'' = 0 in 1
	}

	tempname ss s var
	quietly forval i = 2/`N' {
		scalar `s' = 0
		scalar `ss' = 0
		forval ii = 1/`i'{
			local iii = `i' - `ii' + 1
			scalar `s' = `s' + `v'[`iii']
			scalar `ss' = `ss' + `v'[`iii']^2
			scalar `var' = `ss' - `s'^2 / `ii'
			local ik = `iii' - 1
			if `ik' != 0 {
				forval j = 2/`groups' {
					local jm1 = `j' - 1
		if `sg`j''[`i'] >= `var' + `sg`jm1''[`ik'] {
			replace `mg`j'' = `iii' in `i'
			replace `sg`j'' = `var' + `sg`jm1''[`ik'] in `i'
			}
				}
		        }
		}
		replace `sg1' = `var' in `i'
		replace `mg1' = 1 in `i'
	}

	di _n as txt "Partition of " as res `N' as txt " data up to " ///
	as res `groups' as txt " groups"

	forval j = `groups'(-1)1 {
		local jj = `groups' - `j' + 1
		di _n as txt "The " as res `jj' ///
		as txt " group partition with sum of squares " ///
     		as res %3.2f `sg`jj''[`N']
		di as txt ///
		"Group    Size   First    Last          Mean        SD"
		local il = `N' + 1
		forval l = 1/`jj' {
			local ll = `jj' - `l' + 1
			scalar `s' = 0
			scalar `ss' = 0
			local iu = `il' - 1
			local il = `mg`ll''[`iu']
			if missing(`il') continue 
			forval ii = `il'/`iu' {
				scalar `s' = `s' + `v'[`ii']
				scalar `ss' = `ss' + `v'[`ii']^2
        		}
			local sn = `iu' - `il' + 1
			scalar `s' = `s' / `sn'
			scalar `ss' = sqrt(`ss' / `sn' - `s'^2)
			di as res %2.0f `ll' %11.0f `sn' %8.0f ///
			`v'[`il'] %8.0f `v'[`iu'] %14.2f `s' %10.2f `ss'
    		}

		if `sg`jj''[`N'] == 0 continue, break 
	}
end
-------------------------------

Here is the output for that toy example. 
Given a (maximum) possible number of groups, 
the program prints out details of a partition that minimises
the total within-group sum of squares for each possible
number of groups. As mentioned, such a partition need
not be unique. 

. group1d var1, groups(9)

Partition of 9 data up to 9 groups

The 1 group partition with sum of squares 3044.00
Group    Size   First    Last          Mean        SD
 1          9      14      58         31.33     18.39

The 2 group partition with sum of squares 79.50
Group    Size   First    Last          Mean        SD
 2          3      56      58         57.00      0.82
 1          6      14      23         18.50      3.59

The 3 group partition with sum of squares 6.00
Group    Size   First    Last          Mean        SD
 3          3      56      58         57.00      0.82
 2          3      21      23         22.00      0.82
 1          3      14      16         15.00      0.82

The 4 group partition with sum of squares 4.50
Group    Size   First    Last          Mean        SD
 4          3      56      58         57.00      0.82
 3          3      21      23         22.00      0.82
 2          2      15      16         15.50      0.50
 1          1      14      14         14.00      0.00

The 5 group partition with sum of squares 3.00
Group    Size   First    Last          Mean        SD
 5          3      56      58         57.00      0.82
 4          2      22      23         22.50      0.50
 3          1      21      21         21.00      0.00
 2          2      15      16         15.50      0.50
 1          1      14      14         14.00      0.00

The 6 group partition with sum of squares 1.50
Group    Size   First    Last          Mean        SD
 6          2      57      58         57.50      0.50
 5          1      56      56         56.00      0.00
 4          2      22      23         22.50      0.50
 3          1      21      21         21.00      0.00
 2          2      15      16         15.50      0.50
 1          1      14      14         14.00      0.00

The 7 group partition with sum of squares 1.00
Group    Size   First    Last          Mean        SD
 7          2      57      58         57.50      0.50
 6          1      56      56         56.00      0.00
 5          2      22      23         22.50      0.50
 4          1      21      21         21.00      0.00
 3          1      16      16         16.00      0.00
 2          1      15      15         15.00      0.00
 1          1      14      14         14.00      0.00

The 8 group partition with sum of squares 0.50
Group    Size   First    Last          Mean        SD
 8          2      57      58         57.50      0.50
 7          1      56      56         56.00      0.00
 6          1      23      23         23.00      0.00
 5          1      22      22         22.00      0.00
 4          1      21      21         21.00      0.00
 3          1      16      16         16.00      0.00
 2          1      15      15         15.00      0.00
 1          1      14      14         14.00      0.00

The 9 group partition with sum of squares 0.00
Group    Size   First    Last          Mean        SD
 9          1      58      58         58.00      0.00
 8          1      57      57         57.00      0.00
 7          1      56      56         56.00      0.00
 6          1      23      23         23.00      0.00
 5          1      22      22         22.00      0.00
 4          1      21      21         21.00      0.00
 3          1      16      16         16.00      0.00
 2          1      15      15         15.00      0.00
 1          1      14      14         14.00      0.00

. 

Nick 
n.j.cox@durham.ac.uk 

SR Millis

> Joseph,
> 
> I've never encountered cluster analysis performed with
> a single variable, so I'm interested in seeing how it
> was done in the context of a paper/study making it
> into the literature. Parlor tricks aside, just because
> you can do something with a statistical procedure,
> doesn't mean that it's advisable to do so.  Witness
> stepwise variable selection techniques.
> 
> As an alternative to cluster analysis, in terms of
> discerning categories from continua, the simplest of
> the taxometric procedures, MAXSLOPE, requires 2
> indicators of the target construct--but it won't work
> with 1.
> 
> Regarding sample size, Meehl (1995) recommends a
> minimum of 300 cases when doing taxometric analyses on
> the basis of Monte Carlo research.
> 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index