[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Combining multiple survey data sets

From   Stas Kolenikov <>
Subject   Re: st: Combining multiple survey data sets
Date   Tue, 16 Feb 2010 10:17:49 -0600

On Sun, Feb 14, 2010 at 4:02 PM, James Swartz <> wrote:

> 1) In a simple bivariate analysis, I want to compare the prevalences of
> chronic medical conditions in each data set. But how can I tell Stata to use
> one set of survey parameters for cases in the NCS-R and another for cases in
> my local data set? Also, how important is it to control for a finite
> population correction factor?  I have not done this in any analyses
> previously.

A good suggestion was given already:

use <the first survey>
replace strata = strata + 1000
replace PSU = PSU + 10000
append using <the second survey>
svyset PSU [pw] , strata(strata)

assuming that the variable names are all the same, and that both surveys
generalize to the same population. If they don't, God only knows what your
results will mean in the end.

> 2) In a second step, I used the PSMATCH2 add-on to create a matched sample
> of 450 cases from the NCS-R data set based on a selected set of demographics
> and other characteristics. I then want to fun logistic regressions on the
> odds of having a chronic medical conditions while controlling for the
> matching variables (the matches were not perfect) and other unmatched
> characteristics. I assume that at this point, the survey parameters are not
> applicable because there is no way (that I can figure) to apply the
> subpopulation option. Is that correct?  Is this analytic model reasonable
> given the data sets available or would there be a better way to approach
> this problem?

The matching estimators look fascinating, but I don't believe any single
standard error published for them. If anybody knows a good reference (JASA
or Econometrica or J of Econometrics will do; Communication in Statistics or
Statistics in Medicine would be far less convincing) that proves that a
certain variance estimator is consistent, I'd be partially relieved. If I
were desperate to implement something like this, this is what I would do.

1. write an -eclass- estimator that would at the very least run -psmatch2-
and -svy: logistic-. It would need to support weights as part of its syntax,
see help on programming -svy, vce(jackknife)- and -svy, vce(brr)-

2. run this using using the encompassing -svy, vce(jackknife)-. Or run a
survey bootstrap using a combination of my -bsweights- and Jeff Pitblado's
-bs4rw-. Both modules are available through -findit- somewhere on the Net,
and I have a working paper description of my part that may appear some day
in SJ.

I cannot stress enough that this is only an algorithmic answer to the
problem. Without a rigorous proof with specific assumptions about the design
and the allocation of treatments, there is no way of telling in what
situations the above procedure will make sufficiently good sense. In theory,
the standard errors should account for potential variability in both random
sampling of subjects into your survey data set, and in (random or not so
random) treatment assignment. I don't really know which should go first
though, and depending on the order in which you want to treat them, you may
get different results as to whether the variance estimates are solid or

Think conceptually about what exactly you want your -subpop- option to do.
You have at least three parts in your matching model (if there aren't more
that I am blanking on): (i) estimate the propensity score model, (ii) pick
the matches; (iii) run the final regression model. Which parts do you need
the -subpop- to apply to?

By the way, your standard error calculation must account for sampling
uncertainties at all three of these estimation stages. That's why you need a
single program that takes a data set with weights and if/in conditions as
input and produces the point estimates, at the very least, as the output,
for each resampled data set. A combination of logistic regression (used
twice, mind you: in the propensity score estimation and as the main
regression of interest) with relatively rare subpopulations and resampling
estimators easily breaks down when you get empty cells and/or perfect
prediction in some resamples, and there is no easy way of fixing this.

Stas Kolenikov, also found at
Small print: I use this email account for mailing lists only.

*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index