Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: cluster analysis: differences between clusters |
Date | Fri, 7 Mar 2014 12:02:40 +0000 |
I should also remind you of the request to explain the provenance of user-written commands you refer to, in this case -sqom-. Nick njcoxstata@gmail.com On 7 March 2014 11:47, Nick Cox <njcoxstata@gmail.com> wrote: > I don't think this would mean very much. Once clusters are identified > from the data you are not then really in a sound position taking them > back into a significance test. > > Here's an analogue. Suppose I split -mpg- from the auto data at the > mean and then compare means for higher and lower values. (This isn't > an especially good clustering method, but let that pass.) > > . sysuse auto > (1978 Automobile Data) > > . su mpg, meanonly > > . gen highlow = mpg > r(mean) > > . ttest mpg, by(highlow) > > Two-sample t test with equal variances > ------------------------------------------------------------------------------ > Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] > ---------+-------------------------------------------------------------------- > 0 | 43 17.4186 .3768385 2.471095 16.65811 18.1791 > 1 | 31 26.67742 .8313574 4.628802 24.97956 28.37528 > ---------+-------------------------------------------------------------------- > combined | 74 21.2973 .6725511 5.785503 19.9569 22.63769 > ---------+-------------------------------------------------------------------- > diff | -9.258815 .8326686 -10.91871 -7.59892 > ------------------------------------------------------------------------------ > diff = mean(0) - mean(1) t = -11.1194 > Ho: diff = 0 degrees of freedom = 72 > > Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 > Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000 > > Researchers are usually happy with that kind of t-test and P-value, > but it's worthless. I just showed that "higher mpg" cars are > systematically different from "lower mpg" cars, but that's inevitable. > I am just seeing the consequences of what I identified on purpose. I > can do that with random numbers too. > > . set seed 2803 > > . gen y = runiform() > > . gen high = y > 0.5 > > . ttest y, by(high) > > Two-sample t test with equal variances > ------------------------------------------------------------------------------ > Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] > ---------+-------------------------------------------------------------------- > 0 | 39 .229432 .022727 .1419298 .1834237 .2754404 > 1 | 35 .7746163 .0253076 .1497215 .7231851 .8260474 > ---------+-------------------------------------------------------------------- > combined | 74 .4872894 .0360238 .3098884 .4154941 .5590848 > ---------+-------------------------------------------------------------------- > diff | -.5451842 .0339151 -.6127928 -.4775757 > ------------------------------------------------------------------------------ > diff = mean(0) - mean(1) t = -16.0750 > Ho: diff = 0 degrees of freedom = 72 > > Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 > Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000 > > Now this is not what you're imagining, but I'd like to hear how your > procedure is justified otherwise. Naturally, I am not saying that > clusters from a cluster analysis might not be interesting or useful, > just that inference is not best done this way. > > Nor I am saying that there is no way of identifying how far clusters > can be trusted, but I think you need some simulations from a plausible > stochastic process to provide a benchmark. As shown, you can chop > random noise into clusters and the clusters will be different. > > Nick > njcoxstata@gmail.com > > > On 7 March 2014 11:32, Andrea Jaberg <andreauzh@gmail.com> wrote: >> Dear statalist-users >> >> I performed optimal matching for all sequences in the dataset against >> all others using -sqom- and the option -full-. Afterwards I grouped >> them using cluster analysis. Now I'd like to test whether the clusters >> are reliably different. My first thought was using ANOVA. However, >> this seems not possible since I compared all sequences against each >> other which results in a distance matrix. >> What do you suggest in order to test whether the differences between >> clusters are significant? >> >> Thank you for your help >> Andrea >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/