# Re: st: somersd resampling question

 From Roger Newson
To "statalist@hsphsun2.harvard.edu"
Subject Re: st: somersd resampling question
Date Mon, 1 Nov 2010 11:33:03 +0000

Hi Al. As I understand it (correct me if I'm wrong), you have 2 multinomial lists of frequencies of an ordinal multinomial yariable for 2 groups of independent observations, and aim to measure ordinal correlation between membership of Group A (instead of Group B) and the ordinal variable. I will call the Group A membership indicator -groupa-, the ordinal variable -y-, and the cell frequency variable -cfreq-, and assume that you start with a dataset with 1 observation per table cell, sorted (and keyed uniquely) by -groupa- and -y-.
```
```
Normally, I would estimate Somers' D of -y- with respect to -groupa- by typing
```
somersd groupa y [fwei=cfreq], tdist transf(z)

```
which calculates a standard delta-jackknife asymmetric confidence interval, using the t-distribution and the Fisher z-transform. However, if you want to use the bootstrap or some other resampling method, then the -expgen- package, downloadable from SSC, can expand your dataset to have 1 observation per unit (whatever kind of unit -groupa- and -y- were measured on). As in:
```
expgen =cfreq, sortedby(group) copyseq(unit)

```
where -unit- is the sequence number of the unit within its cell. After -expgen- has run, the dataset in memory will have 1 observation per unit, and will be sorted (and keyed uniquely) by -groupa-, -y- and -unit-. You can then use the bootstrap, or any other resampling method. As in:
```
bootstrap, reps(1000): somersd groupa y

I hope this helps.

Best wishes

Roger

On 29/10/2010 20:41, Feiveson, Alan H. (JSC-SK311) wrote:
```
```Hi Roger, Thanks for the idea of setting up artificial clusters, but I don't see how this can be done with two multinomial lists. Anyway, for anyone who might be interested, I've done a small simulation with 23 categories (because that's what I have) and various combinations of sample sizes in each list. It turns out that the ratio of the empirical se to the somersd-calculated SE depends almost completely on the minimum of the two sample sizes and is closer to 1 when the minimum sample size is small.

Each row in the data below corresponds to 1000 simulated multinomial data sets with randomly generated independent cell probabilities - fixed over all 1000 data sets within a row, but varying from row to row.

Try plotting rat (= se_emp/se_calc) against nmin [= min(n1,n2)].

By the way, the purpose of all this is to come up with a quantifiable measure of how similar the distributions are with respect to their general patterns as opposed to actual values, such as might reflected by a chi-squared statistic.

Al Feiveson

n1    n2    se_calc     se_emp   nmin        rat   set
60    30   .1439008   .1321683     30   .9184684     1
120    30   .1519339   .1160367     30   .7637313     1
120    60   .1367752   .1034096     60   .7560548     1
240    30   .1501265    .120686     30   .8038954     1
240    60   .1672979   .0987834     60   .5904641     1
240   120   .1612942   .1094221    120   .6784011     1
480    30   .1448482    .121629     30   .8396998     1
480    60   .1544797   .1151996     60   .7457264     1
480   120    .157679   .1038079    120   .6583494     1
480   240   .1655068   .0882562    240   .5332483     1
960    30   .1471903   .1238696     30   .8415608     1
960    60   .1492855   .1071405     60   .7176883     1
960   120   .1490777   .1053668    120   .7067916     1
960   240    .144429   .0809639    240   .5605789     1
960   480   .1958908   .0645837    480   .3296922     1
60    30   .1457042   .1229061     30   .8435318     2
120    30   .1521924   .1159594     30   .7619262     2
120    60   .1486831   .1267989     60   .8528129     2
240    30   .1444352   .1168832     30   .8092432     2
240    60   .1460266   .1109937     60   .7600925     2
240   120   .1626369   .0910218    120   .5596629     2
480    30   .1431084    .127222     30   .8889909     2
480    60   .1533591     .10581     60   .6899495     2
480   120   .1673665   .0932405    120   .5571038     2
480   240   .1370986   .0833428    240   .6079037     2
960    30   .1434537   .1124708     30   .7840216     2
960    60   .1532602   .1213565     60   .7918329     2
960   120   .1626063   .0967448    120   .5949637     2
960   240   .1578968   .0861469    240   .5455902     2
960   480   .1544878   .0632528    480   .4094355     2

Resampling is valid with -somersd-, as long as the units resampled are
clusters rather than non-independent observations within clusters. In
resampling method, then you will presumably have to expand the dataset
(using -expgen-, -reshape- or some similar command) to get the units to
be resampled.

I hope this helps.

Best wishes

Roger

On 29/10/2010 17:04, Feiveson, Alan H. (JSC-SK311) wrote:
```
```Hi - I want to use Kendall's Tau-a to characterize similarity between two multinomial samples. My question is whether the resampling in -somersd- to get standard errors is valid when comparing two multinomial samples, since technically the "obervations" (i.e. frequency counts) are not mutually independent. Anyone have an opinion on this?

Thanks

Al Feiveson

