[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Margo Schlanger <margo.schlanger@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Dealing with survey data when the entire population is also in the dataset |

Date |
Mon, 27 Jul 2009 19:10:29 -0400 |

Thanks, all -- this is very helpful. ______________________ Margo Schlanger Professor of Law University of Michigan Law School Director, Civil Rights Litigation Clearinghouse (http://clearinghouse.wustl.edu) 314-255-3179 On Mon, Jul 27, 2009 at 1:53 PM, Austin Nichols<austinnichols@gmail.com> wrote: > Margo Schlanger<margo.schlanger@gmail.com> : > Ángel Rodríguez Laso suggests testing survey estimators' confidence > intervals, and Michael Lichter suggests tests of goodness-of-fit or > independence, but the purpose of the test may ask for another > interpretation. We know the sample was selected randomly, we know the > sample means differ from the population means, etc. The purpose of > the comparison is to see how unlucky the draw might have been, and I > am suggesting one way to do that is with -hotelling- or equivalent, > since that F or p value is an omnibus measure that most people know > well, and suffers less from overrejection as sample sizes get large. > I suppose a goodness-of-fit test to see if the sample rejects the true > distribution (from the population) serves a similar purpose; see also > -mgof- on SSC. However, there is no real hypothesis being tested here > (would we reject that the true distribution is the true distribution > given our sample? who cares? this is not the relevant question for > assessing bias in some subsequent analysis using only the sample) so > any ad hoc test or summary stat is probably OK. I suggest the linear > discriminant regression on the grounds of expediency. One approach I > have seen is a table of means, with stars for tests of equality of > means/proportions, which is a variable-by-variable equivalent of > -hotelling- (one looks down the rows of the table looking for small > differences and no stars, in the ideal case, but again, this has no > real probative value for assessing bias in some subsequent analysis). > > 2009/7/27 Ángel Rodríguez Laso <angelrlaso@gmail.com>: >> Wouldn't be enough to calculate the means or proportions (with their >> corresponding confidence intervals) in the sample dataset with -svy- >> commands and then see if the estimators of the means and proportions >> in the census dataset are included in the confidence intervals? >> >> If multiple comparisons are asked, then I would correct the width of >> the confidence intervals by the Bonferroni adjustment: >> >> 1 comparison--------------------p value=0.05---------------------CI 95% >> 10 comparisons-----------------p value=0.005-------------------CI 99.5% >> >> Best regards, >> >> Angel Rodriguez-Laso >> >> 2009/7/27 Michael I. Lichter <MLichter@buffalo.edu>: >>> I guess Margo's real question is: If my null hypothesis is that there is no >>> difference between a *sample* and a *population* with respect to the >>> distributions of two or more *categorical* variables, what is the most >>> appropriate way to test that hypothesis? >>> >>> Austin proposed Hotelling's t-square, which is a global test of equality of >>> *means* for independent *samples*. This takes care of the multiple >>> comparisons problem, but doesn't fit Margo's needs because of level of >>> measurement (except, possibly, if the categorical variables are dichotomous >>> or ordinal and can be arguably treated as continuous-ish) and because it is >>> a two-sample test instead of a single-sample test. >>> >>> Margo's problem is the same (I think) as the problem of comparing the >>> characteristics of a realized survey sample against the known >>> characteristics of the sampling frame to detect bias. This is a >>> common-enough procedure, and frequently done using chi-square tests without >>> adjustment (correctly or not) for multiple comparisons. These are typically >>> done as chi-square tests of independence, but since the characteristics of >>> the sampling frame are *not* sample data, they should really be goodness of >>> fit tests. (Right?) >>> >>> I don't claim to be a real statistician, and I don't claim to have a real >>> answer, but I think that results from multiple chi-square tests, interpreted >>> jointly (so that, e.g., a single significant result with a relatively large >>> p-value would not be considered strong evidence of difference), would be >>> convincing enough for most audiences. >>> >>> By the way, for clarification, here what I was suggesting with respect to >>> sampling and recombining the sample and population data: >>> >>> ----- >>> sysuse auto,clear >>> sample 50 if (foreign == 0) >>> sample 75 if (foreign == 1) >>> replace wt = 1/.75 if (foreign == 1) >>> replace wt = 1/.5 if (foreign == 0) >>> gen sample = 1 >>> gen stratum = foreign >>> tempfile sample >>> save `sample' >>> sysuse auto,clear >>> append using `sample' >>> replace wt = 1 if missing(sample) >>> replace stratum = 2 if missing(sample) >>> replace sample = 0 if missing(sample) >>> svyset [pw=wt], strata(stratum) >>> ---- >>> >>> >>> Austin Nichols wrote: >>>> >>>> Margo Schlanger<margo.schlanger@gmail.com> : >>>> I think Michael I. Lichter means for you to -append- your sample and >>>> population in step 2 below. Then you can run -hotelling- or the >>>> equivalent linear discriminant model (with robust SEs) to compare >>>> means for a bunch of variables observed in both. I.e. >>>> . reg sample x* [pw=wt] >>>> in step 2b, not tabulate, with or without svy: and chi2. >>>> >>>> On Fri, Jul 24, 2009 at 11:24 PM, Michael I. >>>> Lichter<MLichter@buffalo.edu> wrote: >>>> >>>>> >>>>> Margo, >>>>> >>>>> 1. select your sample and save it in a new dataset, and then in the new >>>>> dataset: >>>>> a. define your stratum variable -stratavar- as you described >>>>> b. define your pweight as you described, wt = 1/(sampling fraction) for >>>>> each >>>>> stratum >>>>> 2. combine the full original dataset with the new one, but with stratavar >>>>> = >>>>> 1 for the new dataset and wt = 1 and with a new variable sample = 0 for >>>>> the >>>>> original and =1 for the sample, and then >>>>> a. -svyset [pw=wt], strata(stratavar)- >>>>> b. do your chi square test or whatever using svy commands, e.g., -svy: >>>>> tab >>>>> var1 sample- >>>>> >>>>> Michael >>>>> >>>>> Margo Schlanger wrote: >>>>> >>>>>> >>>>>> Hi -- >>>>>> >>>>>> I have a dataset in which the observation is a "case". I started with >>>>>> a complete census of the ~4000 relevant cases; each of them gets a >>>>>> line in my dataset. I have data filling a few variables about each of >>>>>> them. (When they were filed, where they were filed, the type of >>>>>> outcome, etc.) >>>>>> >>>>>> I randomly sampled them using 3 strata (for one strata, the sampling >>>>>> probability was 1, for another about .5, and for a third, about .75). >>>>>> I end up with a sample of about 2000. I know much more about this >>>>>> sample. >>>>>> >>>>>> Ok, my question: >>>>>> >>>>>> 1) How do I use the svyset command to describe this dataset? It would >>>>>> be easy if I just dropped all the non-sampled observations, but I >>>>>> don't want to do that, because of question 2: >>>>>> >>>>>> 2) How do I compare something about the sample to the entire >>>>>> population, just to demonstrate that my sample isn't very different >>>>>> from that entire population on any of the few variables I actually >>>>>> have comprehensive data about. I could do this simply, if I didn't >>>>>> have to worry about weighting: >>>>>> >>>>>> tabulate year sample, chi2 >>>>>> >>>>>> But I need the weights. In addition, I can't simply use weighting >>>>>> commands, because in the population (when sample == 0), everything >>>>>> should be weighted the same; the weights apply only to my sample (when >>>>>> sample == 1). And I can't (so far) use survey commands, because I >>>>>> don't know the answer to (1), above. >>>>>> >>>>>> NOTE: Nearly all the variables I care about are categorical: year of >>>>>> filing, type of case. But it's easy enough to turn them into dummies, >>>>>> if that's useful. >>>>>> >>>>>> >>>>>> Thanks for any help with this. >>>>>> >>>>>> Margo Schlanger > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Dealing with survey data when the entire population is also in the dataset***From:*Margo Schlanger <margo.schlanger@gmail.com>

**Re: st: Dealing with survey data when the entire population is also in the dataset***From:*"Michael I. Lichter" <MLichter@Buffalo.EDU>

**Re: st: Dealing with survey data when the entire population is also in the dataset***From:*Austin Nichols <austinnichols@gmail.com>

**Re: st: Dealing with survey data when the entire population is also in the dataset***From:*"Michael I. Lichter" <MLichter@Buffalo.EDU>

**Re: st: Dealing with survey data when the entire population is also in the dataset***From:*Ángel Rodríguez Laso <angelrlaso@gmail.com>

**Re: st: Dealing with survey data when the entire population is also in the dataset***From:*Austin Nichols <austinnichols@gmail.com>

- Prev by Date:
**Re: st: Stata 11 imputation** - Next by Date:
**st: Stata 11** - Previous by thread:
**Re: st: Dealing with survey data when the entire population is also in the dataset** - Next by thread:
**Re: st: Dealing with survey data when the entire population is also in the dataset** - Index(es):

© Copyright 1996–2017 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |