[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Ángel Rodríguez Laso <angelrlaso@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Dealing with survey data when the entire population is also in the dataset |

Date |
Mon, 27 Jul 2009 10:49:55 +0200 |

Wouldn't be enough to calculate the means or proportions (with their corresponding confidence intervals) in the sample dataset with -svy- commands and then see if the estimators of the means and proportions in the census dataset are included in the confidence intervals? If multiple comparisons are asked, then I would correct the width of the confidence intervals by the Bonferroni adjustment: 1 comparison--------------------p value=0.05---------------------CI 95% 10 comparisons-----------------p value=0.005-------------------CI 99.5% Best regards, Angel Rodriguez-Laso 2009/7/27 Michael I. Lichter <MLichter@buffalo.edu>: > I guess Margo's real question is: If my null hypothesis is that there is no > difference between a *sample* and a *population* with respect to the > distributions of two or more *categorical* variables, what is the most > appropriate way to test that hypothesis? > > Austin proposed Hotelling's t-square, which is a global test of equality of > *means* for independent *samples*. This takes care of the multiple > comparisons problem, but doesn't fit Margo's needs because of level of > measurement (except, possibly, if the categorical variables are dichotomous > or ordinal and can be arguably treated as continuous-ish) and because it is > a two-sample test instead of a single-sample test. > > Margo's problem is the same (I think) as the problem of comparing the > characteristics of a realized survey sample against the known > characteristics of the sampling frame to detect bias. This is a > common-enough procedure, and frequently done using chi-square tests without > adjustment (correctly or not) for multiple comparisons. These are typically > done as chi-square tests of independence, but since the characteristics of > the sampling frame are *not* sample data, they should really be goodness of > fit tests. (Right?) > > I don't claim to be a real statistician, and I don't claim to have a real > answer, but I think that results from multiple chi-square tests, interpreted > jointly (so that, e.g., a single significant result with a relatively large > p-value would not be considered strong evidence of difference), would be > convincing enough for most audiences. > > By the way, for clarification, here what I was suggesting with respect to > sampling and recombining the sample and population data: > > ----- > sysuse auto,clear > sample 50 if (foreign == 0) > sample 75 if (foreign == 1) > replace wt = 1/.75 if (foreign == 1) > replace wt = 1/.5 if (foreign == 0) > gen sample = 1 > gen stratum = foreign > tempfile sample > save `sample' > sysuse auto,clear > append using `sample' > replace wt = 1 if missing(sample) > replace stratum = 2 if missing(sample) > replace sample = 0 if missing(sample) > svyset [pw=wt], strata(stratum) > ---- > > > Austin Nichols wrote: >> >> Margo Schlanger<margo.schlanger@gmail.com> : >> I think Michael I. Lichter means for you to -append- your sample and >> population in step 2 below. Then you can run -hotelling- or the >> equivalent linear discriminant model (with robust SEs) to compare >> means for a bunch of variables observed in both. I.e. >> . reg sample x* [pw=wt] >> in step 2b, not tabulate, with or without svy: and chi2. >> >> On Fri, Jul 24, 2009 at 11:24 PM, Michael I. >> Lichter<MLichter@buffalo.edu> wrote: >> >>> >>> Margo, >>> >>> 1. select your sample and save it in a new dataset, and then in the new >>> dataset: >>> a. define your stratum variable -stratavar- as you described >>> b. define your pweight as you described, wt = 1/(sampling fraction) for >>> each >>> stratum >>> 2. combine the full original dataset with the new one, but with stratavar >>> = >>> 1 for the new dataset and wt = 1 and with a new variable sample = 0 for >>> the >>> original and =1 for the sample, and then >>> a. -svyset [pw=wt], strata(stratavar)- >>> b. do your chi square test or whatever using svy commands, e.g., -svy: >>> tab >>> var1 sample- >>> >>> Michael >>> >>> Margo Schlanger wrote: >>> >>>> >>>> Hi -- >>>> >>>> I have a dataset in which the observation is a "case". I started with >>>> a complete census of the ~4000 relevant cases; each of them gets a >>>> line in my dataset. I have data filling a few variables about each of >>>> them. (When they were filed, where they were filed, the type of >>>> outcome, etc.) >>>> >>>> I randomly sampled them using 3 strata (for one strata, the sampling >>>> probability was 1, for another about .5, and for a third, about .75). >>>> I end up with a sample of about 2000. I know much more about this >>>> sample. >>>> >>>> Ok, my question: >>>> >>>> 1) How do I use the svyset command to describe this dataset? It would >>>> be easy if I just dropped all the non-sampled observations, but I >>>> don't want to do that, because of question 2: >>>> >>>> 2) How do I compare something about the sample to the entire >>>> population, just to demonstrate that my sample isn't very different >>>> from that entire population on any of the few variables I actually >>>> have comprehensive data about. I could do this simply, if I didn't >>>> have to worry about weighting: >>>> >>>> tabulate year sample, chi2 >>>> >>>> But I need the weights. In addition, I can't simply use weighting >>>> commands, because in the population (when sample == 0), everything >>>> should be weighted the same; the weights apply only to my sample (when >>>> sample == 1). And I can't (so far) use survey commands, because I >>>> don't know the answer to (1), above. >>>> >>>> NOTE: Nearly all the variables I care about are categorical: year of >>>> filing, type of case. But it's easy enough to turn them into dummies, >>>> if that's useful. >>>> >>>> >>>> Thanks for any help with this. >>>> >>>> Margo Schlanger >>>> >>>> >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ >> > > -- > Michael I. Lichter, Ph.D. <mlichter@buffalo.edu> > Research Assistant Professor & NRSA Fellow > UB Department of Family Medicine / Primary Care Research Institute > UB Clinical Center, 462 Grider Street, Buffalo, NY 14215 > Office: CC 126 / Phone: 716-898-4751 / FAX: 716-898-3536 > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Dealing with survey data when the entire population is also in the dataset***From:*Austin Nichols <austinnichols@gmail.com>

**References**:**st: Dealing with survey data when the entire population is also in the dataset***From:*Margo Schlanger <margo.schlanger@gmail.com>

**Re: st: Dealing with survey data when the entire population is also in the dataset***From:*"Michael I. Lichter" <MLichter@Buffalo.EDU>

**Re: st: Dealing with survey data when the entire population is also in the dataset***From:*Austin Nichols <austinnichols@gmail.com>

**Re: st: Dealing with survey data when the entire population is also in the dataset***From:*"Michael I. Lichter" <MLichter@Buffalo.EDU>

- Prev by Date:
**st: 3-Level nlogit** - Next by Date:
**st: New version of -dsconcat- on SSC** - Previous by thread:
**Re: st: Dealing with survey data when the entire population is also in the dataset** - Next by thread:
**Re: st: Dealing with survey data when the entire population is also in the dataset** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |