Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: cluster analysis: differences between clusters
From 
 
Nick Cox <[email protected]> 
To 
 
"[email protected]" <[email protected]> 
Subject 
 
Re: st: cluster analysis: differences between clusters 
Date 
 
Fri, 7 Mar 2014 12:02:40 +0000 
I should also remind you of the request to explain the provenance of
user-written commands you refer to, in this case -sqom-.
Nick
[email protected]
On 7 March 2014 11:47, Nick Cox <[email protected]> wrote:
> I don't think this would mean very much. Once clusters are identified
> from the data you are not then really in a sound position taking them
> back into a significance test.
>
> Here's an analogue. Suppose I split -mpg- from the auto data at the
> mean and then compare means for higher and lower values. (This isn't
> an especially good clustering method, but let that pass.)
>
> . sysuse auto
> (1978 Automobile Data)
>
> . su mpg, meanonly
>
> . gen highlow = mpg > r(mean)
>
> . ttest mpg, by(highlow)
>
> Two-sample t test with equal variances
> ------------------------------------------------------------------------------
>    Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
> ---------+--------------------------------------------------------------------
>        0 |      43     17.4186    .3768385    2.471095    16.65811     18.1791
>        1 |      31    26.67742    .8313574    4.628802    24.97956    28.37528
> ---------+--------------------------------------------------------------------
> combined |      74     21.2973    .6725511    5.785503     19.9569    22.63769
> ---------+--------------------------------------------------------------------
>     diff |           -9.258815    .8326686               -10.91871    -7.59892
> ------------------------------------------------------------------------------
>     diff = mean(0) - mean(1)                                      t = -11.1194
> Ho: diff = 0                                     degrees of freedom =       72
>
>     Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
>  Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000
>
> Researchers are usually happy with that kind of t-test and P-value,
> but it's worthless. I just showed that "higher mpg" cars are
> systematically different from "lower mpg" cars, but that's inevitable.
> I am just seeing the consequences of what I identified on purpose. I
> can do that with random numbers too.
>
> . set seed 2803
>
> . gen y = runiform()
>
> . gen high = y > 0.5
>
> . ttest y, by(high)
>
> Two-sample t test with equal variances
> ------------------------------------------------------------------------------
>    Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
> ---------+--------------------------------------------------------------------
>        0 |      39     .229432     .022727    .1419298    .1834237    .2754404
>        1 |      35    .7746163    .0253076    .1497215    .7231851    .8260474
> ---------+--------------------------------------------------------------------
> combined |      74    .4872894    .0360238    .3098884    .4154941    .5590848
> ---------+--------------------------------------------------------------------
>     diff |           -.5451842    .0339151               -.6127928   -.4775757
> ------------------------------------------------------------------------------
>     diff = mean(0) - mean(1)                                      t = -16.0750
> Ho: diff = 0                                     degrees of freedom =       72
>
>     Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
>  Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000
>
> Now this is not what you're imagining, but I'd like to hear how your
> procedure is justified otherwise. Naturally, I am not saying that
> clusters from a cluster analysis might not be interesting or useful,
> just that inference is not best done this way.
>
> Nor I am saying that there is no way of identifying how far clusters
> can be trusted, but I think you need some simulations from a plausible
> stochastic process to provide a benchmark. As shown, you can chop
> random noise into clusters and the clusters will be different.
>
> Nick
> [email protected]
>
>
> On 7 March 2014 11:32, Andrea Jaberg <[email protected]> wrote:
>> Dear statalist-users
>>
>> I performed optimal matching for all sequences in the dataset against
>> all others using -sqom- and the option -full-. Afterwards I grouped
>> them using cluster analysis. Now I'd like to test whether the clusters
>> are reliably different. My first thought was using ANOVA. However,
>> this seems not possible since I compared all sequences against each
>> other which results in a distance matrix.
>> What do you suggest in order to test whether the differences between
>> clusters are significant?
>>
>> Thank you for your help
>> Andrea
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/