[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Dealing with survey data when the entire population is also in the dataset

From   Margo Schlanger <>
Subject   Re: st: Dealing with survey data when the entire population is also in the dataset
Date   Mon, 27 Jul 2009 19:10:29 -0400

Thanks, all -- this is very helpful.
Margo Schlanger
Professor of Law
University of Michigan Law School
Director, Civil Rights Litigation Clearinghouse

On Mon, Jul 27, 2009 at 1:53 PM, Austin Nichols<> wrote:
> Margo Schlanger<> :
> Ángel Rodríguez Laso suggests testing survey estimators' confidence
> intervals, and Michael Lichter suggests tests of goodness-of-fit or
> independence, but the purpose of the test may ask for another
> interpretation.  We know the sample was selected randomly, we know the
> sample means differ from the population means, etc.  The purpose of
> the comparison is to see how unlucky the draw might have been, and I
> am suggesting one way to do that is with -hotelling- or equivalent,
> since that F or p value is an omnibus measure that most people know
> well, and suffers less from overrejection as sample sizes get large.
> I suppose a goodness-of-fit test to see if the sample rejects the true
> distribution (from the population) serves a similar purpose; see also
> -mgof- on SSC.  However, there is no real hypothesis being tested here
> (would we reject that the true distribution is the true distribution
> given our sample?  who cares?  this is not the relevant question for
> assessing bias in some subsequent analysis using only the sample) so
> any ad hoc test or summary stat is probably OK.  I suggest the linear
> discriminant regression on the grounds of expediency.  One approach I
> have seen is a table of means, with stars for tests of equality of
> means/proportions, which is a variable-by-variable equivalent of
> -hotelling- (one looks down the rows of the table looking for small
> differences and no stars, in the ideal case, but again, this has no
> real probative value for assessing bias in some subsequent analysis).
> 2009/7/27 Ángel Rodríguez Laso <>:
>> Wouldn't be enough to calculate the means or proportions (with their
>> corresponding confidence intervals) in the sample dataset with -svy-
>> commands and then see if the estimators of the means and proportions
>> in the census dataset are included in the confidence intervals?
>> If multiple comparisons are asked, then I would correct the width of
>> the confidence intervals by the Bonferroni adjustment:
>> 1 comparison--------------------p value=0.05---------------------CI 95%
>> 10 comparisons-----------------p value=0.005-------------------CI 99.5%
>> Best regards,
>> Angel Rodriguez-Laso
>> 2009/7/27 Michael I. Lichter <>:
>>> I guess Margo's real question is: If my null hypothesis is that there is no
>>> difference between a *sample* and a *population* with respect to the
>>> distributions of two or more *categorical* variables, what is the most
>>> appropriate way to test that hypothesis?
>>> Austin proposed Hotelling's t-square, which is a global test of equality of
>>> *means* for independent *samples*. This takes care of the multiple
>>> comparisons problem, but doesn't fit Margo's needs because of level of
>>> measurement (except, possibly, if the categorical variables are dichotomous
>>> or ordinal and can be arguably treated as continuous-ish) and because it is
>>> a two-sample test instead of a single-sample test.
>>> Margo's problem is the same (I think) as the problem of comparing the
>>> characteristics of a realized survey sample against the known
>>> characteristics of the sampling frame to detect bias. This is a
>>> common-enough procedure, and frequently done using chi-square tests without
>>> adjustment (correctly or not) for multiple comparisons. These are typically
>>> done as chi-square tests of independence, but since the characteristics of
>>> the sampling frame are *not* sample data, they should really be goodness of
>>> fit tests. (Right?)
>>> I don't claim to be a real statistician, and I don't claim to have a real
>>> answer, but I think that results from multiple chi-square tests, interpreted
>>> jointly (so that, e.g., a single significant result with a relatively large
>>> p-value would not be considered strong evidence of difference), would be
>>> convincing enough for most audiences.
>>> By the way, for clarification, here what I was suggesting with respect to
>>> sampling and recombining the sample and population data:
>>> -----
>>> sysuse auto,clear
>>> sample 50 if (foreign == 0)
>>> sample 75 if (foreign == 1)
>>> replace wt = 1/.75 if (foreign == 1)
>>> replace wt = 1/.5 if (foreign == 0)
>>> gen sample = 1
>>> gen stratum = foreign
>>> tempfile sample
>>> save `sample'
>>> sysuse auto,clear
>>> append using `sample'
>>> replace wt = 1 if missing(sample)
>>> replace stratum = 2 if missing(sample)
>>> replace sample = 0 if missing(sample)
>>> svyset [pw=wt], strata(stratum)
>>> ----
>>> Austin Nichols wrote:
>>>> Margo Schlanger<> :
>>>> I think Michael I. Lichter means for you to -append- your sample and
>>>> population in step 2 below.  Then you can run -hotelling- or the
>>>> equivalent linear discriminant model (with robust SEs) to compare
>>>> means for a bunch of variables observed in both.  I.e.
>>>> .  reg sample x* [pw=wt]
>>>> in step 2b, not tabulate, with or without svy: and chi2.
>>>> On Fri, Jul 24, 2009 at 11:24 PM, Michael I.
>>>> Lichter<> wrote:
>>>>> Margo,
>>>>> 1. select your sample and save it in a new dataset, and then in the new
>>>>> dataset:
>>>>> a. define your stratum variable -stratavar- as you described
>>>>> b. define your pweight as you described, wt = 1/(sampling fraction) for
>>>>> each
>>>>> stratum
>>>>> 2. combine the full original dataset with the new one, but with stratavar
>>>>> =
>>>>> 1 for the new dataset and wt = 1 and with a new variable sample = 0 for
>>>>> the
>>>>> original and =1 for the sample, and then
>>>>> a. -svyset [pw=wt], strata(stratavar)-
>>>>> b. do your chi square test or whatever using svy commands, e.g., -svy:
>>>>> tab
>>>>> var1 sample-
>>>>> Michael
>>>>> Margo Schlanger wrote:
>>>>>> Hi --
>>>>>> I have a dataset in which the observation is a "case".  I started with
>>>>>> a complete census of the ~4000 relevant cases; each of them gets a
>>>>>> line in my dataset.  I have data filling a few variables about each of
>>>>>> them.  (When they were filed, where they were filed, the type of
>>>>>> outcome, etc.)
>>>>>> I randomly sampled them using 3 strata (for one strata, the sampling
>>>>>> probability was 1, for another about .5, and for a third, about .75).
>>>>>> I end up with a sample of about 2000.  I know much more about this
>>>>>> sample.
>>>>>> Ok, my question:
>>>>>> 1) How do I use the svyset command to describe this dataset?  It would
>>>>>> be easy if I just dropped all the non-sampled observations, but I
>>>>>> don't want to do that, because of question 2:
>>>>>> 2) How do I compare something about the sample to the entire
>>>>>> population, just to demonstrate that my sample isn't very different
>>>>>> from that entire population on any of the few variables I actually
>>>>>> have comprehensive data about. I could do this simply, if I didn't
>>>>>> have to worry about weighting:
>>>>>> tabulate year sample, chi2
>>>>>> But I need the weights.  In addition, I can't simply use weighting
>>>>>> commands, because in the population (when sample == 0), everything
>>>>>> should be weighted the same; the weights apply only to my sample (when
>>>>>> sample == 1).  And I can't (so far) use survey commands, because I
>>>>>> don't know the answer to (1), above.
>>>>>> NOTE: Nearly all the variables I care about are categorical:  year of
>>>>>> filing, type of case.  But it's easy enough to turn them into dummies,
>>>>>> if that's useful.
>>>>>> Thanks for any help with this.
>>>>>> Margo Schlanger
> *
> *   For searches and help try:
> *
> *
> *

*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index