[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Dealing with survey data when the entire population is also in the dataset

From   "Michael I. Lichter" <MLichter@Buffalo.EDU>
Subject   Re: st: Dealing with survey data when the entire population is also in the dataset
Date   Sun, 26 Jul 2009 18:25:38 -0400

I guess Margo's real question is: If my null hypothesis is that there is no difference between a *sample* and a *population* with respect to the distributions of two or more *categorical* variables, what is the most appropriate way to test that hypothesis?

Austin proposed Hotelling's t-square, which is a global test of equality of *means* for independent *samples*. This takes care of the multiple comparisons problem, but doesn't fit Margo's needs because of level of measurement (except, possibly, if the categorical variables are dichotomous or ordinal and can be arguably treated as continuous-ish) and because it is a two-sample test instead of a single-sample test.

Margo's problem is the same (I think) as the problem of comparing the characteristics of a realized survey sample against the known characteristics of the sampling frame to detect bias. This is a common-enough procedure, and frequently done using chi-square tests without adjustment (correctly or not) for multiple comparisons. These are typically done as chi-square tests of independence, but since the characteristics of the sampling frame are *not* sample data, they should really be goodness of fit tests. (Right?)

I don't claim to be a real statistician, and I don't claim to have a real answer, but I think that results from multiple chi-square tests, interpreted jointly (so that, e.g., a single significant result with a relatively large p-value would not be considered strong evidence of difference), would be convincing enough for most audiences.

By the way, for clarification, here what I was suggesting with respect to sampling and recombining the sample and population data:

sysuse auto,clear
sample 50 if (foreign == 0)
sample 75 if (foreign == 1)
replace wt = 1/.75 if (foreign == 1)
replace wt = 1/.5 if (foreign == 0)
gen sample = 1
gen stratum = foreign
tempfile sample
save `sample'
sysuse auto,clear
append using `sample'
replace wt = 1 if missing(sample)
replace stratum = 2 if missing(sample)
replace sample = 0 if missing(sample)
svyset [pw=wt], strata(stratum)

Austin Nichols wrote:
Margo Schlanger<> :
I think Michael I. Lichter means for you to -append- your sample and
population in step 2 below.  Then you can run -hotelling- or the
equivalent linear discriminant model (with robust SEs) to compare
means for a bunch of variables observed in both.  I.e.
.  reg sample x* [pw=wt]
in step 2b, not tabulate, with or without svy: and chi2.

On Fri, Jul 24, 2009 at 11:24 PM, Michael I.
Lichter<> wrote:

1. select your sample and save it in a new dataset, and then in the new
a. define your stratum variable -stratavar- as you described
b. define your pweight as you described, wt = 1/(sampling fraction) for each
2. combine the full original dataset with the new one, but with stratavar =
1 for the new dataset and wt = 1 and with a new variable sample = 0 for the
original and =1 for the sample, and then
a. -svyset [pw=wt], strata(stratavar)-
b. do your chi square test or whatever using svy commands, e.g., -svy: tab
var1 sample-


Margo Schlanger wrote:
Hi --

I have a dataset in which the observation is a "case".  I started with
a complete census of the ~4000 relevant cases; each of them gets a
line in my dataset.  I have data filling a few variables about each of
them.  (When they were filed, where they were filed, the type of
outcome, etc.)

I randomly sampled them using 3 strata (for one strata, the sampling
probability was 1, for another about .5, and for a third, about .75).
I end up with a sample of about 2000.  I know much more about this

Ok, my question:

1) How do I use the svyset command to describe this dataset?  It would
be easy if I just dropped all the non-sampled observations, but I
don't want to do that, because of question 2:

2) How do I compare something about the sample to the entire
population, just to demonstrate that my sample isn't very different
from that entire population on any of the few variables I actually
have comprehensive data about. I could do this simply, if I didn't
have to worry about weighting:

tabulate year sample, chi2

But I need the weights.  In addition, I can't simply use weighting
commands, because in the population (when sample == 0), everything
should be weighted the same; the weights apply only to my sample (when
sample == 1).  And I can't (so far) use survey commands, because I
don't know the answer to (1), above.

NOTE: Nearly all the variables I care about are categorical:  year of
filing, type of case.  But it's easy enough to turn them into dummies,
if that's useful.

Thanks for any help with this.

Margo Schlanger

*   For searches and help try:

Michael I. Lichter, Ph.D. <>
Research Assistant Professor & NRSA Fellow
UB Department of Family Medicine / Primary Care Research Institute
UB Clinical Center, 462 Grider Street, Buffalo, NY 14215
Office: CC 126 / Phone: 716-898-4751 / FAX: 716-898-3536

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index