# Re: st: Non-parametric tests for survey data? (e.g., Kruskal-Wallace)

 From Stas Kolenikov To statalist@hsphsun2.harvard.edu Subject Re: st: Non-parametric tests for survey data? (e.g., Kruskal-Wallace) Date Tue, 10 Feb 2009 22:48:49 -0600

```Oh, one of those wonderful how-do-I-stick-weights-into questions...
Survey statistics is way more complicated that figuring out where the
weights go.

In that world, we need to work with population based quantities and
their estimates. It can be something that might look model-free
(distribution function) or heavily model-based (regression
coefficient), but there must be something in the population that the
procedure should be consistent for (population distribution function,
census regression). That is, when you can access the complete
population, your procedure must give you an exact answer. If you are
estimating the mean, and can use the census, your estimator should
give you the true population mean, for instance. Ranks per se are
sample-based quantities, and if you get the full population as your
sample, your ranks run from 1 to however many millions your population
is -- not to the few dozens or hundreds or thousands your sample size
may be.

The non-parameteric procedures, despite being distribution-free, are
not at all model free: you still assume that your data are i.i.d. from
a distribution, with group differences described by simple shift.
That's not going to work in survey statistics world, at least in
design-based statistics world.

Asking whether two or more distribution functions are the same in two
or more domains in the population might be meaningful, or might be
not: you have a fixed population, so there is no reason to expect that
even two measurements from these distinct domains will be the same (if
we talk about a continuous variable), let alone two distribution
functions could coincide completely. On the other hand, here we do
talk about the distribution functions which are population based
quantities which are estimable with survey data, and asking whether we
can see the difference between the two or more distribution functions
using the sample data is something that should be answerable.

You might even be able to get something like Kruskal-Wallis statistic
and pretend that your sample value is an estimate of the
population-based quantity. But then you need to figure out (i) what to
do with that population Kruskal-Wallis -- if it is non-zero, how do
you interpret it? and (ii) you'd also need to think how to describe
the distribution of the sample based Kruskal-Wallis with respect to
the sample design -- that is the relevant probability space out there.
Obviously any distribution exists in the finite population sampling
world -- for one thing, that probability space is discrete and finite,
you can enumerate all samples and get your distribution in closed
form. At least that's the conceptual thinking. In large samples
though, you should be getting convergence to the population value,
rather than an O_p(1) chi-square distribution in the regular
asymptotics.

Besides, in this particular problem, I would guess you could only get
any hope of describing that sample distribution if you have sample
sizes fixed by design, and that is rarely guaranteed in most practical
situations.

On 2/10/09, Michael I. Lichter <mlichter@buffalo.edu> wrote:
> I don't see any procedures for doing non-parametric tests (aside from
> chi-square in svy: tab) with complex survey data (stratified, unequal
> probabilities of selection). I am particularly looking for tests of
> difference in ordinal dependent variables across k groups (k > 2).
> Kruskall-Wallace is the most obvious test, but only available for non-survey
> data.
>  I assume that these procedures are not available because (a) it's not clear
> what to do with weights in nonparametric analyses anyway (which I infer
> partly from the fact that none of Stata's nonparametric procedures take
> weights), (b) because there's no theory about whether/how they should work,
> and/or (c) because nobody has gotten around to it yet.
>
>  I'm looking for suggestions.
>
>  One possibility that comes to mind is to generate ranks using -egen- and
> analyze using -svy: mean- or -svy: reg- (I'd use one-way ANOVA if somebody
> could explain how to do it with -svy- commands). I could also do -svy:
> intreg- for the variables that represent ranges underlying continuous
> variables (since most of my ordinal variables do represent well-defined but
> unequal-sized ranges of underlying continuous variables, e.g., 1 = "> 1", 2
> = "2-4" 3 = "5 or more"), but that would require -intreg- to be robust to
> floor effects, and I doubt that it is (since the method assumes an
> underlying Normal distribution). (I guess -mlogit-, -ologit- and -gologit2-
> are also possibilities.)
>
>  Thanks.
>
>  --
>  Michael I. Lichter, Ph.D.
>  Research Assistant Professor & NRSA Fellow
>  UB Department of Family Medicine / Primary Care Research Institute
>  UB Clinical Center, 462 Grider Street, Buffalo, NY 14215
>  Office: CC 125 / Phone: 716-898-4751 / E-Mail: mlichter@buffalo.edu
>
>  *
>  *   For searches and help try:
>  *   http://www.stata.com/help.cgi?search
>  *   http://www.stata.com/support/statalist/faq
>  *   http://www.ats.ucla.edu/stat/stata/
>

--
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```