Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: -svy- commands with a pps sample vs. a simple random sample
From
Nick Cox <[email protected]>
To
[email protected]
Subject
Re: st: -svy- commands with a pps sample vs. a simple random sample
Date
Sat, 14 May 2011 09:33:40 +0100
Stas' initial statement is incorrect if the sense of "black box" is
that you can't see the detailed code. The source code for
-mm_sample()- is accessible once you have installed the -moremata-
package. Example code below. There are calls to other functions.
The user-written commands referred to, -moremata- and -gsample-, are from SSC.
Nick
*! mm_sample.mata
*! version 1.0.1, Ben Jann, 14apr2006
version 9.1
mata:
real colvector mm_sample(
real colvector n,
real matrix strata,
| real colvector cluster,
real colvector w,
real scalar wor,
real scalar count,
real scalar fast)
{
real scalar i, b, e, n0, n1, N, cb, ce
real colvector s, si, ci, wi, nn, R
pointer scalar f
if (args()<3) cluster=.
if (args()<4) w=1
if (args()<5) wor=0
if (args()<6) count=0
if (args()<7) fast=0
if (cols(strata)<1) _error(3200)
if (fast==0) {
if (rows(strata)>1 | (cluster==. & rows(w)==1)) {
if (missing(strata)) _error(3351, "'strata'
has missing values")
}
}
if (rows(n)<1) _error(3200)
//choose sampling function and check w
if (rows(w)==1) {
if (wor) f = &mm_srswor()
else f = &mm_srswr()
}
else {
if (cluster!=. & rows(w)!=rows(cluster))
_error(3200, "rows('w') unequal number of clusters")
if (fast==0) {
if (missing(w)) _error(3351, "'w' has missing values")
if (colsum(w:<0)) _error(3498, "'w' has
negative values")
if (colsum(w)<=0) _error(3498, "sum('w') is zero")
}
if (wor) f = &mm_upswor()
else f = &mm_upswr()
}
//simple sample, unstratified
if (rows(strata)==1) {
N = strata[1,1]
if (cluster==.) {
if (N>=.) N = rows(w)
else if (rows(w)!=1 & N!=rows(w))
_error(3200, "rows('w') unequal population size")
return((*f)(n, (rows(w)==1 ? N : w), count))
}
//cluster sample, unstratified
if (rows(cluster)<1) _error(3200)
if (N>=.) N = colsum(cluster)
else if (fast==0) {
if (N!=colsum(cluster))
_error(3200, "sum of cluster sizes unequal
population size")
}
s = (*f)(n, (rows(w)==1 ? rows(cluster) : w), count)
return(_mm_expandclusters(s, cluster, count, N))
}
//stratified sample
if (rows(strata)<1) _error(3200)
if (cluster!=. & cols(strata)<2) _error(3200)
//-sample sizes
if (rows(n)==1) {
if (n>=.) {
if (cluster==.) nn = strata[.,1]
else nn = strata[.,2]
}
else nn = J(rows(strata),1,n)
}
else {
if (rows(n)!=rows(strata)) _error(3200)
nn = n
for (i=1;i<=rows(nn);i++) {
if (nn[i]>=.) {
if (cluster==.) nn[i] = strata[i,1]
else nn[i] = strata[i,2]
}
}
}
//-prepare sample vector
if (count) s = J(colsum(strata[.,1]),1,.)
else s = J(colsum(nn),1,.)
//-draw samples within strata
if (count==0) n0 = 1
e = 0
ce = 0
for (i=1;i<=rows(strata);i++) {
b = e + 1
e = e + strata[i,1]
//--simple sample
if (cluster==.) {
si = (*f)(nn[i], (rows(w)==1 ? strata[i,1] :
w[|b \ e|]), count)
}
//--cluster sample
else {
cb = ce + 1
ce = ce + strata[i,2]
ci = cluster[|cb \ ce|]
wi = (rows(w)==1 ? rows(ci) : w[|cb \ ce|])
if (fast==0) {
if (strata[i,1]!=colsum(ci))
_error(3200, "sum of cluster sizes
unequal size of stratum")
}
si = (*f)(nn[i], wi, count)
if (count) si = _mm_expandclusters(si, ci, 1,
strata[i,1])
}
//--add subsample to sample vector
if (count) R = (b \ e)
else {
n1 = n0 + rows(si) - 1
R = (n0 \ n1)
if (cluster==.) si = si :+ (b-1)
else si = si :+ (cb-1)
n0 = n1 + 1
}
if (R[1]<=R[2]) s[|R|] = si
}
if (count==0&cluster!=.) s = _mm_expandclusters(s, cluster)
return(s)
}
On Sat, May 14, 2011 at 8:04 AM, Stas Kolenikov <[email protected]> wrote:
> -gsample- relies on -moremata- which is a black box. Doing PPS
> properly is highly non-trivial; if Ben Jann did not utilize something
> from Brewer's book, then God only knows what kind of properties his
> procedure might have.
>
> Nobody guarantees that you will have any gains in analytical models.
> What your informative sampling does is changes the design measure for
> your regression... if you know what I mean. In other words, you make
> your explanatory variables heavier or lighter. The gains in precision,
> however, can only come from making small residuals heavier and large
> residuals lighter, which does not happen in your simulation (but would
> happen if you had heteroskedasticity with variance increasing with x).
> Instead, what you observe is a typical efficiency loss due to unequal
> weights, with DEFF = 1 + CV of weights.
>
> Brewer & Hanif (1983). Sampling with unequal probabilities. Lecture
> Notes in Statistics, Springer.
>
> On Fri, May 13, 2011 at 4:58 PM, Mike Lacy <[email protected]> wrote:
>>
>> Greetings,
>>
>> I'm getting standard errors for means and regression coefficients using the
>> -svy- commands that surprise me enough to make me wonder if I am using them
>> correctly. What I'm finding is that the SE(mean) and SE(b) are smaller with
>> a simple random sample than with probability proportional to size,
>> even though the pps sample is constructed using a variable correlated about
>> 0.9 with the outcome of interest. Below, I have some code with simulated
>> data that shows what I am doing.
>>
>>
>> Background: I'm simulating data for an electrical utility usage reduction
>> experiment. I've made the simulated distribution of kwh usage look like the
>> real distribution. I assume that the percent of kwh usage saved (savepct)
>> following an experiment with the users is of the
>> form y = b0 + b1X + b2*sqrt(x), with that being the function of interested
>> to be estimated.
>>
>> // Create the simulated data
>> clear
>> set obs 25000
>> local sampleN = 500
>> set seed 83573
>> gen kwh = exp(rnormal(6.4, 0.65)) // kwh usage
>> gen savepct = -0.61 - 0.00014*kwh + 0.14 * sqrt(kwh) // looks realistic to
>> me
>> replace savepct = savepct + rnormal(0,0.5) // gives r = 0.9 with kwh
>> // Population regression relationship
>> gen sqrtk = sqrt(kwh)
>> regress savepct kwh sqrtk // The true populatioh relationship
>> //
>> // Sample the data, pps, and run a regression model
>> quiet summ kwh, detail
>> gen pps = `sampleN' * kwh/r(sum) // sampling prob to get pps and n = 500
>> // User written -gsample- , see -findit gsample-
>> gsample `sampleN' [aw = pps], gen(picked_pps) wor
>> gen pwt = 1/pps
>> svyset _n [pweight = pwt]
>> svy: mean savepct if picked_pps
>> svy: regress savepct kwh sqrtk if picked_pps
>> //
>> // Repeat analysis with simple random sampling
>> svyset, clear
>> gsample `sampleN', gen(picked_psrs) wor
>> gen psrs = `sampleN'/`=_N' // sampling prob
>> replace pwt = 1/psrs
>> svyset _n [pweight = pwt]
>> svy: mean savepct if picked_psrs
>> svy: regress savepct kwh sqrtk if picked_psrs
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/