[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Stas Kolenikov <skolenik@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Combining multiple survey data sets |

Date |
Tue, 16 Feb 2010 10:17:49 -0600 |

On Sun, Feb 14, 2010 at 4:02 PM, James Swartz <jaswartz@uic.edu> wrote: > 1) In a simple bivariate analysis, I want to compare the prevalences of > chronic medical conditions in each data set. But how can I tell Stata to use > one set of survey parameters for cases in the NCS-R and another for cases in > my local data set? Also, how important is it to control for a finite > population correction factor? I have not done this in any analyses > previously. > A good suggestion was given already: use <the first survey> replace strata = strata + 1000 replace PSU = PSU + 10000 append using <the second survey> svyset PSU [pw] , strata(strata) assuming that the variable names are all the same, and that both surveys generalize to the same population. If they don't, God only knows what your results will mean in the end. > 2) In a second step, I used the PSMATCH2 add-on to create a matched sample > of 450 cases from the NCS-R data set based on a selected set of demographics > and other characteristics. I then want to fun logistic regressions on the > odds of having a chronic medical conditions while controlling for the > matching variables (the matches were not perfect) and other unmatched > characteristics. I assume that at this point, the survey parameters are not > applicable because there is no way (that I can figure) to apply the > subpopulation option. Is that correct? Is this analytic model reasonable > given the data sets available or would there be a better way to approach > this problem? > The matching estimators look fascinating, but I don't believe any single standard error published for them. If anybody knows a good reference (JASA or Econometrica or J of Econometrics will do; Communication in Statistics or Statistics in Medicine would be far less convincing) that proves that a certain variance estimator is consistent, I'd be partially relieved. If I were desperate to implement something like this, this is what I would do. 1. write an -eclass- estimator that would at the very least run -psmatch2- and -svy: logistic-. It would need to support weights as part of its syntax, see help on programming -svy, vce(jackknife)- and -svy, vce(brr)- estimators. 2. run this using using the encompassing -svy, vce(jackknife)-. Or run a survey bootstrap using a combination of my -bsweights- and Jeff Pitblado's -bs4rw-. Both modules are available through -findit- somewhere on the Net, and I have a working paper description of my part that may appear some day in SJ. I cannot stress enough that this is only an algorithmic answer to the problem. Without a rigorous proof with specific assumptions about the design and the allocation of treatments, there is no way of telling in what situations the above procedure will make sufficiently good sense. In theory, the standard errors should account for potential variability in both random sampling of subjects into your survey data set, and in (random or not so random) treatment assignment. I don't really know which should go first though, and depending on the order in which you want to treat them, you may get different results as to whether the variance estimates are solid or crappy. Think conceptually about what exactly you want your -subpop- option to do. You have at least three parts in your matching model (if there aren't more that I am blanking on): (i) estimate the propensity score model, (ii) pick the matches; (iii) run the final regression model. Which parts do you need the -subpop- to apply to? By the way, your standard error calculation must account for sampling uncertainties at all three of these estimation stages. That's why you need a single program that takes a data set with weights and if/in conditions as input and produces the point estimates, at the very least, as the output, for each resampled data set. A combination of logistic regression (used twice, mind you: in the propensity score estimation and as the main regression of interest) with relatively rare subpopulations and resampling estimators easily breaks down when you get empty cells and/or perfect prediction in some resamples, and there is no easy way of fixing this. -- Stas Kolenikov, also found at http://stas.kolenikov.name Small print: I use this email account for mailing lists only. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Number of characters in a variable label***From:*Mosi Ifatunji <ifatunji@gmail.com>

**Re: st: Number of characters in a variable label***From:*Eric Booth <ebooth@ppri.tamu.edu>

**st: Combining multiple survey data sets***From:*James Swartz <jaswartz@uic.edu>

- Prev by Date:
**st: RE: AW:AW: an egen command** - Next by Date:
**Re: st: RE: RE: Using GMM with Moment-Evaluator Program** - Previous by thread:
**Re: st: Combining multiple survey data sets** - Next by thread:
**Re: st: Number of characters in a variable label** - Index(es):

© Copyright 1996–2020 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |