Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# RE: st: somersd resampling question

 From "Feiveson, Alan H. (JSC-SK311)" To "statalist@hsphsun2.harvard.edu" Subject RE: st: somersd resampling question Date Mon, 1 Nov 2010 08:31:26 -0500

```Hi Roger - I do have the two lists as you have outlined below, but my objective is not to do statistical inference on whether group A is different from group B, but instead to use Kendall's Tau-a with confidence limits as a measure of similarity based on rank ordering (instead of the actual frequency values), treating the frequencies as observations. So in the example below, A and B are my two multinomial lists

+---------------+
|   y    A    B |
|---------------|
|   1    0    0 |
|   2    1    1 |
|   3   17   20 |
|   4    2    8 |
|   5    1    3 |
|---------------|
|   6    1   10 |
|   7    2    1 |
|   8    1   14 |
|   9    3    7 |
|  10    2    3 |
|---------------|
|  11    4    4 |
|  12    4    6 |
|  13    0    4 |
|  14    1    1 |
|  15    0    4 |
|---------------|
|  16    1    5 |
|  17    3    1 |
|  18    2    2 |
|  19    3    2 |
|  20   14    7 |
|---------------|
|  21    3    2 |
|  22    1    4 |
|  23    0    1 |
+---------------+

and I want to use Tau-a as an index of similarity between A and B, with appropriate confidence limits,  taking into consideration that being a multinomial list, the "observations" in A or B i.e. frequencies, are not independent because they sum to a fixed total.

So if I do something like
somersd Z B,transf(z) taua

the SE and hence the confidence limits I get will not be correct.

Al

the confidence limits I

Al

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Roger Newson
Sent: Monday, November 01, 2010 6:33 AM
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: somersd resampling question

Hi Al. As I understand it (correct me if I'm wrong), you have 2
multinomial lists of frequencies of an ordinal multinomial yariable for
2 groups of independent observations, and aim to measure ordinal
correlation between membership of Group A (instead of Group B) and the
ordinal variable. I will call the Group A membership indicator -groupa-,
the ordinal variable -y-, and the cell frequency variable -cfreq-, and
assume that you start with a dataset with 1 observation per table cell,
sorted (and keyed uniquely) by -groupa- and -y-.

Normally, I would estimate Somers' D of -y- with respect to -groupa- by
typing

somersd groupa y [fwei=cfreq], tdist transf(z)

which calculates a standard delta-jackknife asymmetric confidence
interval, using the t-distribution and the Fisher z-transform. However,
if you want to use the bootstrap or some other resampling method, then
have 1 observation per unit (whatever kind of unit -groupa- and -y- were
measured on). As in:

expgen =cfreq, sortedby(group) copyseq(unit)

where -unit- is the sequence number of the unit within its cell. After
-expgen- has run, the dataset in memory will have 1 observation per
unit, and will be sorted (and keyed uniquely) by -groupa-, -y- and
-unit-. You can then use the bootstrap, or any other resampling method.
As in:

bootstrap, reps(1000): somersd groupa y

I hope this helps.

Best wishes

Roger

Roger B Newson BSc MSc DPhil
Lecturer in Medical Statistics
Respiratory Epidemiology and Public Health Group
National Heart and Lung Institute
Imperial College London
Royal Brompton Campus
Room 33, Emmanuel Kaye Building
London SW3 6LR
UNITED KINGDOM
Tel: +44 (0)20 7352 8121 ext 3381
Fax: +44 (0)20 7351 8322
Email: r.newson@imperial.ac.uk
Web page: http://www.imperial.ac.uk/nhli/r.newson/
Departmental Web page:

Opinions expressed are those of the author, not of the institution.

On 29/10/2010 20:41, Feiveson, Alan H. (JSC-SK311) wrote:
> Hi Roger, Thanks for the idea of setting up artificial clusters, but I don't see how this can be done with two multinomial lists. Anyway, for anyone who might be interested, I've done a small simulation with 23 categories (because that's what I have) and various combinations of sample sizes in each list. It turns out that the ratio of the empirical se to the somersd-calculated SE depends almost completely on the minimum of the two sample sizes and is closer to 1 when the minimum sample size is small.
>
> Each row in the data below corresponds to 1000 simulated multinomial data sets with randomly generated independent cell probabilities - fixed over all 1000 data sets within a row, but varying from row to row.
>
> Try plotting rat (= se_emp/se_calc) against nmin [= min(n1,n2)].
>
> By the way, the purpose of all this is to come up with a quantifiable measure of how similar the distributions are with respect to their general patterns as opposed to actual values, such as might reflected by a chi-squared statistic.
>
>
>
> Al Feiveson
>
>
>       n1    n2    se_calc     se_emp   nmin        rat   set
>       60    30   .1439008   .1321683     30   .9184684     1
>      120    30   .1519339   .1160367     30   .7637313     1
>      120    60   .1367752   .1034096     60   .7560548     1
>      240    30   .1501265    .120686     30   .8038954     1
>      240    60   .1672979   .0987834     60   .5904641     1
>      240   120   .1612942   .1094221    120   .6784011     1
>      480    30   .1448482    .121629     30   .8396998     1
>      480    60   .1544797   .1151996     60   .7457264     1
>      480   120    .157679   .1038079    120   .6583494     1
>      480   240   .1655068   .0882562    240   .5332483     1
>      960    30   .1471903   .1238696     30   .8415608     1
>      960    60   .1492855   .1071405     60   .7176883     1
>      960   120   .1490777   .1053668    120   .7067916     1
>      960   240    .144429   .0809639    240   .5605789     1
>      960   480   .1958908   .0645837    480   .3296922     1
>       60    30   .1457042   .1229061     30   .8435318     2
>      120    30   .1521924   .1159594     30   .7619262     2
>      120    60   .1486831   .1267989     60   .8528129     2
>      240    30   .1444352   .1168832     30   .8092432     2
>      240    60   .1460266   .1109937     60   .7600925     2
>      240   120   .1626369   .0910218    120   .5596629     2
>      480    30   .1431084    .127222     30   .8889909     2
>      480    60   .1533591     .10581     60   .6899495     2
>      480   120   .1673665   .0932405    120   .5571038     2
>      480   240   .1370986   .0833428    240   .6079037     2
>      960    30   .1434537   .1124708     30   .7840216     2
>      960    60   .1532602   .1213565     60   .7918329     2
>      960   120   .1626063   .0967448    120   .5949637     2
>      960   240   .1578968   .0861469    240   .5455902     2
>      960   480   .1544878   .0632528    480   .4094355     2
>
>
>
>
>
>
>
>
> -----Original Message-----
> From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Roger Newson
> Sent: Friday, October 29, 2010 12:01 PM
> To: statalist@hsphsun2.harvard.edu
> Subject: Re: st: somersd resampling question
>
> Resampling is valid with -somersd-, as long as the units resampled are
> clusters rather than non-independent observations within clusters. In
> resampling method, then you will presumably have to expand the dataset
> (using -expgen-, -reshape- or some similar command) to get the units to
> be resampled.
>
> I hope this helps.
>
> Best wishes
>
> Roger
>
>
> Roger B Newson BSc MSc DPhil
> Lecturer in Medical Statistics
> Respiratory Epidemiology and Public Health Group
> National Heart and Lung Institute
> Imperial College London
> Royal Brompton Campus
> Room 33, Emmanuel Kaye Building
> London SW3 6LR
> UNITED KINGDOM
> Tel: +44 (0)20 7352 8121 ext 3381
> Fax: +44 (0)20 7351 8322
> Email: r.newson@imperial.ac.uk
> Web page: http://www.imperial.ac.uk/nhli/r.newson/
> Departmental Web page:
>
> Opinions expressed are those of the author, not of the institution.
>
> On 29/10/2010 17:04, Feiveson, Alan H. (JSC-SK311) wrote:
>> Hi - I want to use Kendall's Tau-a to characterize similarity between two multinomial samples. My question is whether the resampling in -somersd- to get standard errors is valid when comparing two multinomial samples, since technically the "obervations" (i.e. frequency counts) are not mutually independent. Anyone have an opinion on this?
>>
>> Thanks
>>
>> Al Feiveson
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```