Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Dealing with survey data when the entire population is also in the dataset


From   sjsamuels@gmail.com
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Dealing with survey data when the entire population is also in the dataset
Date   Mon, 27 Jul 2009 21:44:39 -0400

I agree with Austin, but have a different perspective.  Rather than a
discriminant analysis, it's probably easier to do the omnibus
Hotelling test by regressing the sample indicator (1 =in, 0 = out) on
the variables and referring to the model F statistic.

I suggest using -svy: reg- for this purpose, because it can
accommodate strata and will use design-based inference. This is _not_
a survey analysis, where the object is to estimate population
quantities or test hypotheses about them. The weights are all one,
because each sample and non-sample case  represents itself.

The null hypothesis  s that randomization worked, within strata; the
alternative hypothesis is that randomization failed; the metric  is a
difference between means in the sampled and unsampled groups.  This
would be ideally assessed by a randomization test such as the
user-written -tsrtest-  ("search randomization test, all") but this
won't accommodate the separate random selection in two strata.
Happily, the design-based survey analysis accomplishes the same thing.
Margo hasn't told us her sampling design, but it sampling was
systematic, I would ignore that fact in the analysis.  (A full
analysis would require assigning each case to its systematic sample,
and designating the sample as a cluster variable.)

Within a stratum, a difference in means between  sample and population
is equal to a scaled difference in means between the sampled and
non-sample mujltiplied by (1-f), where f is the sampling fraction in
the stratum.
This also true for overall difference between sample and population,
where the difference would by multiplied by 1 - f*  where f* is the
overall sampling fraction.

So, to do the omnibus or t-test in Stata, I would

svyset _n [pweight=1], strata(stratum) . The omission of the fpc is deliberate.

For the omnibus test, no scaling is necessary.   If  variables are
analyzed one at a time,  the sample-non-sample differences in means
can be converted to sample-population differences by multiplying them
by (1 - f*).  With this approach, the stratum with f = 1 will drop out
of the comparison, I believe, because it contains only sampled
observations.  If it doesn't drop out, it can be deleted before the
analysis.

Confidence intervals of individual differences are apt to be more
informative than tests. As Austin states, there are no strong
hypotheses here. A big difference in means could be controlled by a
ratio or regression adjustment. But given the large overall sampling
fraction, I doubt that any large differences will  appear .

-Steve

On Mon, Jul 27, 2009 at 1:53 PM, Austin Nichols<austinnichols@gmail.com> wrote:
> Margo Schlanger<margo.schlanger@gmail.com> :
> Ángel Rodríguez Laso suggests testing survey estimators' confidence
> intervals, and Michael Lichter suggests tests of goodness-of-fit or
> independence, but the purpose of the test may ask for another
> interpretation.  We know the sample was selected randomly, we know the
> sample means differ from the population means, etc.  The purpose of
> the comparison is to see how unlucky the draw might have been, and I
> am suggesting one way to do that is with -hotelling- or equivalent,
> since that F or p value is an omnibus measure that most people know
> well, and suffers less from overrejection as sample sizes get large.
> I suppose a goodness-of-fit test to see if the sample rejects the true
> distribution (from the population) serves a similar purpose; see also
> -mgof- on SSC.  However, there is no real hypothesis being tested here
> (would we reject that the true distribution is the true distribution
> given our sample?  who cares?  this is not the relevant question for
> assessing bias in some subsequent analysis using only the sample) so
> any ad hoc test or summary stat is probably OK.  I suggest the linear
> discriminant regression on the grounds of expediency.  One approach I
> have seen is a table of means, with stars for tests of equality of
> means/proportions, which is a variable-by-variable equivalent of
> -hotelling- (one looks down the rows of the table looking for small
> differences and no stars, in the ideal case, but again, this has no
> real probative value for assessing bias in some subsequent analysis).
>
> 2009/7/27 Ángel Rodríguez Laso <angelrlaso@gmail.com>:
>> Wouldn't be enough to calculate the means or proportions (with their
>> corresponding confidence intervals) in the sample dataset with -svy-
>> commands and then see if the estimators of the means and proportions
>> in the census dataset are included in the confidence intervals?
>>
>> If multiple comparisons are asked, then I would correct the width of
>> the confidence intervals by the Bonferroni adjustment:
>>
>> 1 comparison--------------------p value=0.05---------------------CI 95%
>> 10 comparisons-----------------p value=0.005-------------------CI 99.5%
>>
>> Best regards,
>>
>> Angel Rodriguez-Laso
>>
>> 2009/7/27 Michael I. Lichter <MLichter@buffalo.edu>:
>>> I guess Margo's real question is: If my null hypothesis is that there is no
>>> difference between a *sample* and a *population* with respect to the
>>> distributions of two or more *categorical* variables, what is the most
>>> appropriate way to test that hypothesis?
>>>
>>> Austin proposed Hotelling's t-square, which is a global test of equality of
>>> *means* for independent *samples*. This takes care of the multiple
>>> comparisons problem, but doesn't fit Margo's needs because of level of
>>> measurement (except, possibly, if the categorical variables are dichotomous
>>> or ordinal and can be arguably treated as continuous-ish) and because it is
>>> a two-sample test instead of a single-sample test.
>>>
>>> Margo's problem is the same (I think) as the problem of comparing the
>>> characteristics of a realized survey sample against the known
>>> characteristics of the sampling frame to detect bias. This is a
>>> common-enough procedure, and frequently done using chi-square tests without
>>> adjustment (correctly or not) for multiple comparisons. These are typically
>>> done as chi-square tests of independence, but since the characteristics of
>>> the sampling frame are *not* sample data, they should really be goodness of
>>> fit tests. (Right?)
>>>
>>> I don't claim to be a real statistician, and I don't claim to have a real
>>> answer, but I think that results from multiple chi-square tests, interpreted
>>> jointly (so that, e.g., a single significant result with a relatively large
>>> p-value would not be considered strong evidence of difference), would be
>>> convincing enough for most audiences.
>>>
>>> By the way, for clarification, here what I was suggesting with respect to
>>> sampling and recombining the sample and population data:
>>>
>>> -----
>>> sysuse auto,clear
>>> sample 50 if (foreign == 0)
>>> sample 75 if (foreign == 1)
>>> replace wt = 1/.75 if (foreign == 1)
>>> replace wt = 1/.5 if (foreign == 0)
>>> gen sample = 1
>>> gen stratum = foreign
>>> tempfile sample
>>> save `sample'
>>> sysuse auto,clear
>>> append using `sample'
>>> replace wt = 1 if missing(sample)
>>> replace stratum = 2 if missing(sample)
>>> replace sample = 0 if missing(sample)
>>> svyset [pw=wt], strata(stratum)
>>> ----
>>>
>>>
>>> Austin Nichols wrote:
>>>>
>>>> Margo Schlanger<margo.schlanger@gmail.com> :
>>>> I think Michael I. Lichter means for you to -append- your sample and
>>>> population in step 2 below.  Then you can run -hotelling- or the
>>>> equivalent linear discriminant model (with robust SEs) to compare
>>>> means for a bunch of variables observed in both.  I.e.
>>>> .  reg sample x* [pw=wt]
>>>> in step 2b, not tabulate, with or without svy: and chi2.
>>>>
>>>> On Fri, Jul 24, 2009 at 11:24 PM, Michael I.
>>>> Lichter<MLichter@buffalo.edu> wrote:
>>>>
>>>>>
>>>>> Margo,
>>>>>
>>>>> 1. select your sample and save it in a new dataset, and then in the new
>>>>> dataset:
>>>>> a. define your stratum variable -stratavar- as you described
>>>>> b. define your pweight as you described, wt = 1/(sampling fraction) for
>>>>> each
>>>>> stratum
>>>>> 2. combine the full original dataset with the new one, but with stratavar
>>>>> =
>>>>> 1 for the new dataset and wt = 1 and with a new variable sample = 0 for
>>>>> the
>>>>> original and =1 for the sample, and then
>>>>> a. -svyset [pw=wt], strata(stratavar)-
>>>>> b. do your chi square test or whatever using svy commands, e.g., -svy:
>>>>> tab
>>>>> var1 sample-
>>>>>
>>>>> Michael
>>>>>
>>>>> Margo Schlanger wrote:
>>>>>
>>>>>>
>>>>>> Hi --
>>>>>>
>>>>>> I have a dataset in which the observation is a "case".  I started with
>>>>>> a complete census of the ~4000 relevant cases; each of them gets a
>>>>>> line in my dataset.  I have data filling a few variables about each of
>>>>>> them.  (When they were filed, where they were filed, the type of
>>>>>> outcome, etc.)
>>>>>>
>>>>>> I randomly sampled them using 3 strata (for one strata, the sampling
>>>>>> probability was 1, for another about .5, and for a third, about .75).
>>>>>> I end up with a sample of about 2000.  I know much more about this
>>>>>> sample.
>>>>>>
>>>>>> Ok, my question:
>>>>>>
>>>>>> 1) How do I use the svyset command to describe this dataset?  It would
>>>>>> be easy if I just dropped all the non-sampled observations, but I
>>>>>> don't want to do that, because of question 2:
>>>>>>
>>>>>> 2) How do I compare something about the sample to the entire
>>>>>> population, just to demonstrate that my sample isn't very different
>>>>>> from that entire population on any of the few variables I actually
>>>>>> have comprehensive data about. I could do this simply, if I didn't
>>>>>> have to worry about weighting:
>>>>>>
>>>>>> tabulate year sample, chi2
>>>>>>
>>>>>> But I need the weights.  In addition, I can't simply use weighting
>>>>>> commands, because in the population (when sample == 0), everything
>>>>>> should be weighted the same; the weights apply only to my sample (when
>>>>>> sample == 1).  And I can't (so far) use survey commands, because I
>>>>>> don't know the answer to (1), above.
>>>>>>
>>>>>> NOTE: Nearly all the variables I care about are categorical:  year of
>>>>>> filing, type of case.  But it's easy enough to turn them into dummies,
>>>>>> if that's useful.

-- 
Steven Samuels
sjsamuels@gmail.com
18 Cantine's Island
Saugerties NY 12477
USA
845-246-0774

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index