Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Stas Kolenikov <skolenik@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: -svy- commands with a pps sample vs. a simple random sample |

Date |
Sat, 14 May 2011 02:04:11 -0500 |

-gsample- relies on -moremata- which is a black box. Doing PPS properly is highly non-trivial; if Ben Jann did not utilize something from Brewer's book, then God only knows what kind of properties his procedure might have. Nobody guarantees that you will have any gains in analytical models. What your informative sampling does is changes the design measure for your regression... if you know what I mean. In other words, you make your explanatory variables heavier or lighter. The gains in precision, however, can only come from making small residuals heavier and large residuals lighter, which does not happen in your simulation (but would happen if you had heteroskedasticity with variance increasing with x). Instead, what you observe is a typical efficiency loss due to unequal weights, with DEFF = 1 + CV of weights. Brewer & Hanif (1983). Sampling with unequal probabilities. Lecture Notes in Statistics, Springer. On Fri, May 13, 2011 at 4:58 PM, Mike Lacy <Michael.Lacy@colostate.edu> wrote: > > Greetings, > > I'm getting standard errors for means and regression coefficients using the > -svy- commands that surprise me enough to make me wonder if I am using them > correctly. What I'm finding is that the SE(mean) and SE(b) are smaller with > a simple random sample than with probability proportional to size, > even though the pps sample is constructed using a variable correlated about > 0.9 with the outcome of interest. Below, I have some code with simulated > data that shows what I am doing. > > > Background: I'm simulating data for an electrical utility usage reduction > experiment. I've made the simulated distribution of kwh usage look like the > real distribution. I assume that the percent of kwh usage saved (savepct) > following an experiment with the users is of the > form y = b0 + b1X + b2*sqrt(x), with that being the function of interested > to be estimated. > > // Create the simulated data > clear > set obs 25000 > local sampleN = 500 > set seed 83573 > gen kwh = exp(rnormal(6.4, 0.65)) // kwh usage > gen savepct = -0.61 - 0.00014*kwh + 0.14 * sqrt(kwh) // looks realistic to > me > replace savepct = savepct + rnormal(0,0.5) // gives r = 0.9 with kwh > // Population regression relationship > gen sqrtk = sqrt(kwh) > regress savepct kwh sqrtk // The true populatioh relationship > // > // Sample the data, pps, and run a regression model > quiet summ kwh, detail > gen pps = `sampleN' * kwh/r(sum) // sampling prob to get pps and n = 500 > // User written -gsample- , see -findit gsample- > gsample `sampleN' [aw = pps], gen(picked_pps) wor > gen pwt = 1/pps > svyset _n [pweight = pwt] > svy: mean savepct if picked_pps > svy: regress savepct kwh sqrtk if picked_pps > // > // Repeat analysis with simple random sampling > svyset, clear > gsample `sampleN', gen(picked_psrs) wor > gen psrs = `sampleN'/`=_N' // sampling prob > replace pwt = 1/psrs > svyset _n [pweight = pwt] > svy: mean savepct if picked_psrs > svy: regress savepct kwh sqrtk if picked_psrs > > > Thanks, > > > =-=-=-=-=-=-=-=-=-=-=-=-= > Mike Lacy, Assoc. Prof. > Soc. Dept., Colo. State. Univ. > Fort Collins CO 80523 USA > (970)-491-6721 > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > -- Stas Kolenikov, also found at http://stas.kolenikov.name Small print: I use this email account for mailing lists only. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: -svy- commands with a pps sample vs. a simple random sample***From:*Nick Cox <njcoxstata@gmail.com>

**References**:**st: -svy- commands with a pps sample vs. a simple random sample***From:*Mike Lacy <Michael.Lacy@colostate.edu>

- Prev by Date:
**st: time series analysis ommited variables bias/specification error testing** - Next by Date:
**Re: st: -svy- commands with a pps sample vs. a simple random sample** - Previous by thread:
**st: -svy- commands with a pps sample vs. a simple random sample** - Next by thread:
**Re: st: -svy- commands with a pps sample vs. a simple random sample** - Index(es):