# Re: st: Dealing with survey data when the entire population is also in the dataset

 From Austin Nichols To statalist@hsphsun2.harvard.edu Subject Re: st: Dealing with survey data when the entire population is also in the dataset Date Mon, 27 Jul 2009 13:53:24 -0400

```Margo Schlanger<margo.schlanger@gmail.com> :
Ángel Rodríguez Laso suggests testing survey estimators' confidence
intervals, and Michael Lichter suggests tests of goodness-of-fit or
independence, but the purpose of the test may ask for another
interpretation.  We know the sample was selected randomly, we know the
sample means differ from the population means, etc.  The purpose of
the comparison is to see how unlucky the draw might have been, and I
am suggesting one way to do that is with -hotelling- or equivalent,
since that F or p value is an omnibus measure that most people know
well, and suffers less from overrejection as sample sizes get large.
I suppose a goodness-of-fit test to see if the sample rejects the true
-mgof- on SSC.  However, there is no real hypothesis being tested here
(would we reject that the true distribution is the true distribution
given our sample?  who cares?  this is not the relevant question for
assessing bias in some subsequent analysis using only the sample) so
any ad hoc test or summary stat is probably OK.  I suggest the linear
discriminant regression on the grounds of expediency.  One approach I
have seen is a table of means, with stars for tests of equality of
means/proportions, which is a variable-by-variable equivalent of
-hotelling- (one looks down the rows of the table looking for small
differences and no stars, in the ideal case, but again, this has no
real probative value for assessing bias in some subsequent analysis).

2009/7/27 Ángel Rodríguez Laso <angelrlaso@gmail.com>:
> Wouldn't be enough to calculate the means or proportions (with their
> corresponding confidence intervals) in the sample dataset with -svy-
> commands and then see if the estimators of the means and proportions
> in the census dataset are included in the confidence intervals?
>
> If multiple comparisons are asked, then I would correct the width of
> the confidence intervals by the Bonferroni adjustment:
>
> 1 comparison--------------------p value=0.05---------------------CI 95%
> 10 comparisons-----------------p value=0.005-------------------CI 99.5%
>
> Best regards,
>
> Angel Rodriguez-Laso
>
> 2009/7/27 Michael I. Lichter <MLichter@buffalo.edu>:
>> I guess Margo's real question is: If my null hypothesis is that there is no
>> difference between a *sample* and a *population* with respect to the
>> distributions of two or more *categorical* variables, what is the most
>> appropriate way to test that hypothesis?
>>
>> Austin proposed Hotelling's t-square, which is a global test of equality of
>> *means* for independent *samples*. This takes care of the multiple
>> comparisons problem, but doesn't fit Margo's needs because of level of
>> measurement (except, possibly, if the categorical variables are dichotomous
>> or ordinal and can be arguably treated as continuous-ish) and because it is
>> a two-sample test instead of a single-sample test.
>>
>> Margo's problem is the same (I think) as the problem of comparing the
>> characteristics of a realized survey sample against the known
>> characteristics of the sampling frame to detect bias. This is a
>> common-enough procedure, and frequently done using chi-square tests without
>> adjustment (correctly or not) for multiple comparisons. These are typically
>> done as chi-square tests of independence, but since the characteristics of
>> the sampling frame are *not* sample data, they should really be goodness of
>> fit tests. (Right?)
>>
>> I don't claim to be a real statistician, and I don't claim to have a real
>> answer, but I think that results from multiple chi-square tests, interpreted
>> jointly (so that, e.g., a single significant result with a relatively large
>> p-value would not be considered strong evidence of difference), would be
>> convincing enough for most audiences.
>>
>> By the way, for clarification, here what I was suggesting with respect to
>> sampling and recombining the sample and population data:
>>
>> -----
>> sysuse auto,clear
>> sample 50 if (foreign == 0)
>> sample 75 if (foreign == 1)
>> replace wt = 1/.75 if (foreign == 1)
>> replace wt = 1/.5 if (foreign == 0)
>> gen sample = 1
>> gen stratum = foreign
>> tempfile sample
>> save `sample'
>> sysuse auto,clear
>> append using `sample'
>> replace wt = 1 if missing(sample)
>> replace stratum = 2 if missing(sample)
>> replace sample = 0 if missing(sample)
>> svyset [pw=wt], strata(stratum)
>> ----
>>
>>
>> Austin Nichols wrote:
>>>
>>> Margo Schlanger<margo.schlanger@gmail.com> :
>>> I think Michael I. Lichter means for you to -append- your sample and
>>> population in step 2 below.  Then you can run -hotelling- or the
>>> equivalent linear discriminant model (with robust SEs) to compare
>>> means for a bunch of variables observed in both.  I.e.
>>> .  reg sample x* [pw=wt]
>>> in step 2b, not tabulate, with or without svy: and chi2.
>>>
>>> On Fri, Jul 24, 2009 at 11:24 PM, Michael I.
>>> Lichter<MLichter@buffalo.edu> wrote:
>>>
>>>>
>>>> Margo,
>>>>
>>>> 1. select your sample and save it in a new dataset, and then in the new
>>>> dataset:
>>>> a. define your stratum variable -stratavar- as you described
>>>> b. define your pweight as you described, wt = 1/(sampling fraction) for
>>>> each
>>>> stratum
>>>> 2. combine the full original dataset with the new one, but with stratavar
>>>> =
>>>> 1 for the new dataset and wt = 1 and with a new variable sample = 0 for
>>>> the
>>>> original and =1 for the sample, and then
>>>> a. -svyset [pw=wt], strata(stratavar)-
>>>> b. do your chi square test or whatever using svy commands, e.g., -svy:
>>>> tab
>>>> var1 sample-
>>>>
>>>> Michael
>>>>
>>>> Margo Schlanger wrote:
>>>>
>>>>>
>>>>> Hi --
>>>>>
>>>>> I have a dataset in which the observation is a "case".  I started with
>>>>> a complete census of the ~4000 relevant cases; each of them gets a
>>>>> line in my dataset.  I have data filling a few variables about each of
>>>>> them.  (When they were filed, where they were filed, the type of
>>>>> outcome, etc.)
>>>>>
>>>>> I randomly sampled them using 3 strata (for one strata, the sampling
>>>>> probability was 1, for another about .5, and for a third, about .75).
>>>>> sample.
>>>>>
>>>>> Ok, my question:
>>>>>
>>>>> 1) How do I use the svyset command to describe this dataset?  It would
>>>>> be easy if I just dropped all the non-sampled observations, but I
>>>>> don't want to do that, because of question 2:
>>>>>
>>>>> 2) How do I compare something about the sample to the entire
>>>>> population, just to demonstrate that my sample isn't very different
>>>>> from that entire population on any of the few variables I actually
>>>>> have comprehensive data about. I could do this simply, if I didn't
>>>>> have to worry about weighting:
>>>>>
>>>>> tabulate year sample, chi2
>>>>>
>>>>> But I need the weights.  In addition, I can't simply use weighting
>>>>> commands, because in the population (when sample == 0), everything
>>>>> should be weighted the same; the weights apply only to my sample (when
>>>>> sample == 1).  And I can't (so far) use survey commands, because I
>>>>> don't know the answer to (1), above.
>>>>>
>>>>> NOTE: Nearly all the variables I care about are categorical:  year of
>>>>> filing, type of case.  But it's easy enough to turn them into dummies,
>>>>> if that's useful.
>>>>>
>>>>>
>>>>> Thanks for any help with this.
>>>>>
>>>>> Margo Schlanger

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```