[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Dealing with survey data when the entire population is also in the dataset

From	Ángel Rodríguez Laso <[email protected]>
To	[email protected]
Subject	Re: st: Dealing with survey data when the entire population is also in the dataset
Date	Mon, 27 Jul 2009 10:49:55 +0200

Wouldn't be enough to calculate the means or proportions (with their
corresponding confidence intervals) in the sample dataset with -svy-
commands and then see if the estimators of the means and proportions
in the census dataset are included in the confidence intervals?

If multiple comparisons are asked, then I would correct the width of
the confidence intervals by the Bonferroni adjustment:

1 comparison--------------------p value=0.05---------------------CI 95%
10 comparisons-----------------p value=0.005-------------------CI 99.5%

Best regards,

Angel Rodriguez-Laso

2009/7/27 Michael I. Lichter <[email protected]>:
> I guess Margo's real question is: If my null hypothesis is that there is no
> difference between a *sample* and a *population* with respect to the
> distributions of two or more *categorical* variables, what is the most
> appropriate way to test that hypothesis?
>
> Austin proposed Hotelling's t-square, which is a global test of equality of
> *means* for independent *samples*. This takes care of the multiple
> comparisons problem, but doesn't fit Margo's needs because of level of
> measurement (except, possibly, if the categorical variables are dichotomous
> or ordinal and can be arguably treated as continuous-ish) and because it is
> a two-sample test instead of a single-sample test.
>
> Margo's problem is the same (I think) as the problem of comparing the
> characteristics of a realized survey sample against the known
> characteristics of the sampling frame to detect bias. This is a
> common-enough procedure, and frequently done using chi-square tests without
> adjustment (correctly or not) for multiple comparisons. These are typically
> done as chi-square tests of independence, but since the characteristics of
> the sampling frame are *not* sample data, they should really be goodness of
> fit tests. (Right?)
>
> I don't claim to be a real statistician, and I don't claim to have a real
> answer, but I think that results from multiple chi-square tests, interpreted
> jointly (so that, e.g., a single significant result with a relatively large
> p-value would not be considered strong evidence of difference), would be
> convincing enough for most audiences.
>
> By the way, for clarification, here what I was suggesting with respect to
> sampling and recombining the sample and population data:
>
> -----
> sysuse auto,clear
> sample 50 if (foreign == 0)
> sample 75 if (foreign == 1)
> replace wt = 1/.75 if (foreign == 1)
> replace wt = 1/.5 if (foreign == 0)
> gen sample = 1
> gen stratum = foreign
> tempfile sample
> save `sample'
> sysuse auto,clear
> append using `sample'
> replace wt = 1 if missing(sample)
> replace stratum = 2 if missing(sample)
> replace sample = 0 if missing(sample)
> svyset [pw=wt], strata(stratum)
> ----
>
>
> Austin Nichols wrote:
>>
>> Margo Schlanger<[email protected]> :
>> I think Michael I. Lichter means for you to -append- your sample and
>> population in step 2 below.  Then you can run -hotelling- or the
>> equivalent linear discriminant model (with robust SEs) to compare
>> means for a bunch of variables observed in both.  I.e.
>> .  reg sample x* [pw=wt]
>> in step 2b, not tabulate, with or without svy: and chi2.
>>
>> On Fri, Jul 24, 2009 at 11:24 PM, Michael I.
>> Lichter<[email protected]> wrote:
>>
>>>
>>> Margo,
>>>
>>> 1. select your sample and save it in a new dataset, and then in the new
>>> dataset:
>>> a. define your stratum variable -stratavar- as you described
>>> b. define your pweight as you described, wt = 1/(sampling fraction) for
>>> each
>>> stratum
>>> 2. combine the full original dataset with the new one, but with stratavar
>>> =
>>> 1 for the new dataset and wt = 1 and with a new variable sample = 0 for
>>> the
>>> original and =1 for the sample, and then
>>> a. -svyset [pw=wt], strata(stratavar)-
>>> b. do your chi square test or whatever using svy commands, e.g., -svy:
>>> tab
>>> var1 sample-
>>>
>>> Michael
>>>
>>> Margo Schlanger wrote:
>>>
>>>>
>>>> Hi --
>>>>
>>>> I have a dataset in which the observation is a "case".  I started with
>>>> a complete census of the ~4000 relevant cases; each of them gets a
>>>> line in my dataset.  I have data filling a few variables about each of
>>>> them.  (When they were filed, where they were filed, the type of
>>>> outcome, etc.)
>>>>
>>>> I randomly sampled them using 3 strata (for one strata, the sampling
>>>> probability was 1, for another about .5, and for a third, about .75).
>>>> I end up with a sample of about 2000.  I know much more about this
>>>> sample.
>>>>
>>>> Ok, my question:
>>>>
>>>> 1) How do I use the svyset command to describe this dataset?  It would
>>>> be easy if I just dropped all the non-sampled observations, but I
>>>> don't want to do that, because of question 2:
>>>>
>>>> 2) How do I compare something about the sample to the entire
>>>> population, just to demonstrate that my sample isn't very different
>>>> from that entire population on any of the few variables I actually
>>>> have comprehensive data about. I could do this simply, if I didn't
>>>> have to worry about weighting:
>>>>
>>>> tabulate year sample, chi2
>>>>
>>>> But I need the weights.  In addition, I can't simply use weighting
>>>> commands, because in the population (when sample == 0), everything
>>>> should be weighted the same; the weights apply only to my sample (when
>>>> sample == 1).  And I can't (so far) use survey commands, because I
>>>> don't know the answer to (1), above.
>>>>
>>>> NOTE: Nearly all the variables I care about are categorical:  year of
>>>> filing, type of case.  But it's easy enough to turn them into dummies,
>>>> if that's useful.
>>>>
>>>>
>>>> Thanks for any help with this.
>>>>
>>>> Margo Schlanger
>>>>
>>>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
> --
> Michael I. Lichter, Ph.D. <[email protected]>
> Research Assistant Professor & NRSA Fellow
> UB Department of Family Medicine / Primary Care Research Institute
> UB Clinical Center, 462 Grider Street, Buffalo, NY 14215
> Office: CC 126 / Phone: 716-898-4751 / FAX: 716-898-3536
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Dealing with survey data when the entire population is also in the dataset
  - From: Austin Nichols <[email protected]>

References:
- st: Dealing with survey data when the entire population is also in the dataset
  - From: Margo Schlanger <[email protected]>
- Re: st: Dealing with survey data when the entire population is also in the dataset
  - From: "Michael I. Lichter" <[email protected]>
- Re: st: Dealing with survey data when the entire population is also in the dataset
  - From: Austin Nichols <[email protected]>
- Re: st: Dealing with survey data when the entire population is also in the dataset
  - From: "Michael I. Lichter" <[email protected]>

Prev by Date: st: 3-Level nlogit
Next by Date: st: New version of -dsconcat- on SSC
Previous by thread: Re: st: Dealing with survey data when the entire population is also in the dataset
Next by thread: Re: st: Dealing with survey data when the entire population is also in the dataset
Index(es):
- Date
- Thread