Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: -svy- commands with a pps sample vs. a simple random sample |

Date |
Sat, 14 May 2011 09:33:40 +0100 |

Stas' initial statement is incorrect if the sense of "black box" is that you can't see the detailed code. The source code for -mm_sample()- is accessible once you have installed the -moremata- package. Example code below. There are calls to other functions. The user-written commands referred to, -moremata- and -gsample-, are from SSC. Nick *! mm_sample.mata *! version 1.0.1, Ben Jann, 14apr2006 version 9.1 mata: real colvector mm_sample( real colvector n, real matrix strata, | real colvector cluster, real colvector w, real scalar wor, real scalar count, real scalar fast) { real scalar i, b, e, n0, n1, N, cb, ce real colvector s, si, ci, wi, nn, R pointer scalar f if (args()<3) cluster=. if (args()<4) w=1 if (args()<5) wor=0 if (args()<6) count=0 if (args()<7) fast=0 if (cols(strata)<1) _error(3200) if (fast==0) { if (rows(strata)>1 | (cluster==. & rows(w)==1)) { if (missing(strata)) _error(3351, "'strata' has missing values") } } if (rows(n)<1) _error(3200) //choose sampling function and check w if (rows(w)==1) { if (wor) f = &mm_srswor() else f = &mm_srswr() } else { if (cluster!=. & rows(w)!=rows(cluster)) _error(3200, "rows('w') unequal number of clusters") if (fast==0) { if (missing(w)) _error(3351, "'w' has missing values") if (colsum(w:<0)) _error(3498, "'w' has negative values") if (colsum(w)<=0) _error(3498, "sum('w') is zero") } if (wor) f = &mm_upswor() else f = &mm_upswr() } //simple sample, unstratified if (rows(strata)==1) { N = strata[1,1] if (cluster==.) { if (N>=.) N = rows(w) else if (rows(w)!=1 & N!=rows(w)) _error(3200, "rows('w') unequal population size") return((*f)(n, (rows(w)==1 ? N : w), count)) } //cluster sample, unstratified if (rows(cluster)<1) _error(3200) if (N>=.) N = colsum(cluster) else if (fast==0) { if (N!=colsum(cluster)) _error(3200, "sum of cluster sizes unequal population size") } s = (*f)(n, (rows(w)==1 ? rows(cluster) : w), count) return(_mm_expandclusters(s, cluster, count, N)) } //stratified sample if (rows(strata)<1) _error(3200) if (cluster!=. & cols(strata)<2) _error(3200) //-sample sizes if (rows(n)==1) { if (n>=.) { if (cluster==.) nn = strata[.,1] else nn = strata[.,2] } else nn = J(rows(strata),1,n) } else { if (rows(n)!=rows(strata)) _error(3200) nn = n for (i=1;i<=rows(nn);i++) { if (nn[i]>=.) { if (cluster==.) nn[i] = strata[i,1] else nn[i] = strata[i,2] } } } //-prepare sample vector if (count) s = J(colsum(strata[.,1]),1,.) else s = J(colsum(nn),1,.) //-draw samples within strata if (count==0) n0 = 1 e = 0 ce = 0 for (i=1;i<=rows(strata);i++) { b = e + 1 e = e + strata[i,1] //--simple sample if (cluster==.) { si = (*f)(nn[i], (rows(w)==1 ? strata[i,1] : w[|b \ e|]), count) } //--cluster sample else { cb = ce + 1 ce = ce + strata[i,2] ci = cluster[|cb \ ce|] wi = (rows(w)==1 ? rows(ci) : w[|cb \ ce|]) if (fast==0) { if (strata[i,1]!=colsum(ci)) _error(3200, "sum of cluster sizes unequal size of stratum") } si = (*f)(nn[i], wi, count) if (count) si = _mm_expandclusters(si, ci, 1, strata[i,1]) } //--add subsample to sample vector if (count) R = (b \ e) else { n1 = n0 + rows(si) - 1 R = (n0 \ n1) if (cluster==.) si = si :+ (b-1) else si = si :+ (cb-1) n0 = n1 + 1 } if (R[1]<=R[2]) s[|R|] = si } if (count==0&cluster!=.) s = _mm_expandclusters(s, cluster) return(s) } On Sat, May 14, 2011 at 8:04 AM, Stas Kolenikov <skolenik@gmail.com> wrote: > -gsample- relies on -moremata- which is a black box. Doing PPS > properly is highly non-trivial; if Ben Jann did not utilize something > from Brewer's book, then God only knows what kind of properties his > procedure might have. > > Nobody guarantees that you will have any gains in analytical models. > What your informative sampling does is changes the design measure for > your regression... if you know what I mean. In other words, you make > your explanatory variables heavier or lighter. The gains in precision, > however, can only come from making small residuals heavier and large > residuals lighter, which does not happen in your simulation (but would > happen if you had heteroskedasticity with variance increasing with x). > Instead, what you observe is a typical efficiency loss due to unequal > weights, with DEFF = 1 + CV of weights. > > Brewer & Hanif (1983). Sampling with unequal probabilities. Lecture > Notes in Statistics, Springer. > > On Fri, May 13, 2011 at 4:58 PM, Mike Lacy <Michael.Lacy@colostate.edu> wrote: >> >> Greetings, >> >> I'm getting standard errors for means and regression coefficients using the >> -svy- commands that surprise me enough to make me wonder if I am using them >> correctly. What I'm finding is that the SE(mean) and SE(b) are smaller with >> a simple random sample than with probability proportional to size, >> even though the pps sample is constructed using a variable correlated about >> 0.9 with the outcome of interest. Below, I have some code with simulated >> data that shows what I am doing. >> >> >> Background: I'm simulating data for an electrical utility usage reduction >> experiment. I've made the simulated distribution of kwh usage look like the >> real distribution. I assume that the percent of kwh usage saved (savepct) >> following an experiment with the users is of the >> form y = b0 + b1X + b2*sqrt(x), with that being the function of interested >> to be estimated. >> >> // Create the simulated data >> clear >> set obs 25000 >> local sampleN = 500 >> set seed 83573 >> gen kwh = exp(rnormal(6.4, 0.65)) // kwh usage >> gen savepct = -0.61 - 0.00014*kwh + 0.14 * sqrt(kwh) // looks realistic to >> me >> replace savepct = savepct + rnormal(0,0.5) // gives r = 0.9 with kwh >> // Population regression relationship >> gen sqrtk = sqrt(kwh) >> regress savepct kwh sqrtk // The true populatioh relationship >> // >> // Sample the data, pps, and run a regression model >> quiet summ kwh, detail >> gen pps = `sampleN' * kwh/r(sum) // sampling prob to get pps and n = 500 >> // User written -gsample- , see -findit gsample- >> gsample `sampleN' [aw = pps], gen(picked_pps) wor >> gen pwt = 1/pps >> svyset _n [pweight = pwt] >> svy: mean savepct if picked_pps >> svy: regress savepct kwh sqrtk if picked_pps >> // >> // Repeat analysis with simple random sampling >> svyset, clear >> gsample `sampleN', gen(picked_psrs) wor >> gen psrs = `sampleN'/`=_N' // sampling prob >> replace pwt = 1/psrs >> svyset _n [pweight = pwt] >> svy: mean savepct if picked_psrs >> svy: regress savepct kwh sqrtk if picked_psrs * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: -svy- commands with a pps sample vs. a simple random sample***From:*Stas Kolenikov <skolenik@gmail.com>

**References**:**st: -svy- commands with a pps sample vs. a simple random sample***From:*Mike Lacy <Michael.Lacy@colostate.edu>

**Re: st: -svy- commands with a pps sample vs. a simple random sample***From:*Stas Kolenikov <skolenik@gmail.com>

- Prev by Date:
**Re: st: -svy- commands with a pps sample vs. a simple random sample** - Next by Date:
**Re: st: too good to be true : lr test in mlogit?** - Previous by thread:
**Re: st: -svy- commands with a pps sample vs. a simple random sample** - Next by thread:
**Re: st: -svy- commands with a pps sample vs. a simple random sample** - Index(es):