Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: -svy- commands with a pps sample vs. a simple random sample

 From Stas Kolenikov To statalist@hsphsun2.harvard.edu Subject Re: st: -svy- commands with a pps sample vs. a simple random sample Date Sat, 14 May 2011 02:04:11 -0500

```-gsample- relies on -moremata- which is a black box. Doing PPS
properly is highly non-trivial; if Ben Jann did not utilize something
from Brewer's book, then God only knows what kind of properties his
procedure might have.

Nobody guarantees that you will have any gains in analytical models.
What your informative sampling does is changes the design measure for
your regression... if you know what I mean. In other words, you make
your explanatory variables heavier or lighter. The gains in precision,
however, can only come from making small residuals heavier and large
residuals lighter, which does not happen in your simulation (but would
happen if you had heteroskedasticity with variance increasing with x).
Instead, what you observe is a typical efficiency loss due to unequal
weights, with DEFF = 1 + CV of weights.

Brewer & Hanif (1983). Sampling with unequal probabilities. Lecture
Notes in Statistics, Springer.

On Fri, May 13, 2011 at 4:58 PM, Mike Lacy <Michael.Lacy@colostate.edu> wrote:
>
> Greetings,
>
> I'm getting standard errors for means and regression coefficients using the
> -svy- commands that surprise me enough to make me wonder if I am using them
> correctly.  What I'm finding is that the SE(mean) and SE(b) are smaller with
> a simple random sample than with probability proportional to size,
> even though the pps sample  is constructed using a variable correlated about
> 0.9 with the outcome of interest.  Below, I have some code with simulated
> data that shows what I am doing.
>
>
> Background: I'm simulating data for an electrical utility usage reduction
> experiment. I've made the simulated distribution of kwh usage look like the
> real distribution.  I assume that the percent of kwh usage saved (savepct)
> following an experiment with the users is of the
> form y = b0 + b1X + b2*sqrt(x), with that being the function of interested
> to be estimated.
>
> // Create the simulated data
> clear
> set obs 25000
> local sampleN = 500
> set seed 83573
> gen kwh = exp(rnormal(6.4, 0.65))  // kwh usage
> gen savepct = -0.61 - 0.00014*kwh + 0.14 * sqrt(kwh)  // looks realistic to
> me
> replace savepct = savepct + rnormal(0,0.5)  // gives r = 0.9 with kwh
> // Population regression relationship
> gen sqrtk = sqrt(kwh)
> regress savepct kwh sqrtk   // The true populatioh relationship
> //
> // Sample the data, pps, and run a regression model
> quiet summ kwh, detail
> gen pps = `sampleN' * kwh/r(sum)  // sampling prob to get pps and n = 500
> // User written -gsample- , see -findit gsample-
> gsample `sampleN' [aw = pps],  gen(picked_pps) wor
> gen pwt = 1/pps
> svyset _n [pweight = pwt]
> svy: mean savepct if picked_pps
> svy: regress savepct kwh sqrtk if picked_pps
> //
> // Repeat analysis with simple random sampling
> svyset, clear
> gsample `sampleN',  gen(picked_psrs) wor
> gen psrs = `sampleN'/`=_N' // sampling prob
> replace pwt = 1/psrs
> svyset _n [pweight = pwt]
> svy: mean savepct if picked_psrs
> svy: regress savepct kwh sqrtk if picked_psrs
>
>
> Thanks,
>
>
> =-=-=-=-=-=-=-=-=-=-=-=-=
> Mike Lacy, Assoc. Prof.
> Soc. Dept., Colo. State. Univ.
> Fort Collins CO 80523 USA
> (970)-491-6721
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

--
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```