Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: -svy- commands with a pps sample vs. a simple random sample


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: -svy- commands with a pps sample vs. a simple random sample
Date   Sat, 14 May 2011 09:33:40 +0100

Stas' initial statement is incorrect if the sense of "black box" is
that you can't see the detailed code. The source code for
-mm_sample()- is accessible once you have installed the -moremata-
package. Example code below. There are calls to other functions.

The user-written commands referred to, -moremata- and -gsample-, are from SSC.

Nick

*! mm_sample.mata
*! version 1.0.1, Ben Jann, 14apr2006
version 9.1
mata:

real colvector mm_sample(
 real colvector n,
 real matrix strata,
 | real colvector cluster,
   real colvector w,
   real scalar wor,
   real scalar count,
   real scalar fast)
{
        real scalar i, b, e, n0, n1, N, cb, ce
        real colvector s, si, ci, wi, nn, R
        pointer scalar f

        if (args()<3) cluster=.
        if (args()<4) w=1
        if (args()<5) wor=0
        if (args()<6) count=0
        if (args()<7) fast=0
        if (cols(strata)<1) _error(3200)
        if (fast==0) {
                if (rows(strata)>1 | (cluster==. & rows(w)==1)) {
                        if (missing(strata)) _error(3351, "'strata'
has missing values")
                }
        }
        if (rows(n)<1) _error(3200)

//choose sampling function and check w
        if (rows(w)==1) {
                if (wor) f = &mm_srswor()
                else     f = &mm_srswr()
        }
        else {
                if (cluster!=. & rows(w)!=rows(cluster))
                  _error(3200, "rows('w') unequal number of clusters")
                if (fast==0) {
                        if (missing(w)) _error(3351, "'w' has missing values")
                        if (colsum(w:<0)) _error(3498, "'w' has
negative values")
                        if (colsum(w)<=0) _error(3498, "sum('w') is zero")
                }
                if (wor) f = &mm_upswor()
                else     f = &mm_upswr()
        }

//simple sample, unstratified
        if (rows(strata)==1) {
                N = strata[1,1]
                if (cluster==.) {
                        if (N>=.) N = rows(w)
                        else if (rows(w)!=1 & N!=rows(w))
                          _error(3200, "rows('w') unequal population size")
                        return((*f)(n, (rows(w)==1 ? N : w), count))
                }

//cluster sample, unstratified
                if (rows(cluster)<1) _error(3200)
                if (N>=.) N = colsum(cluster)
                else if (fast==0) {
                        if (N!=colsum(cluster))
                          _error(3200, "sum of cluster sizes unequal
population size")
                }
                s = (*f)(n, (rows(w)==1 ? rows(cluster) : w), count)
                return(_mm_expandclusters(s, cluster, count, N))
        }

//stratified sample
        if (rows(strata)<1) _error(3200)
        if (cluster!=. & cols(strata)<2) _error(3200)
//-sample sizes
        if (rows(n)==1) {
                if (n>=.) {
                        if (cluster==.) nn = strata[.,1]
                        else            nn = strata[.,2]
                }
                else nn = J(rows(strata),1,n)
        }
        else {
                if (rows(n)!=rows(strata)) _error(3200)
                nn = n
                for (i=1;i<=rows(nn);i++) {
                        if (nn[i]>=.) {
                                if (cluster==.) nn[i] = strata[i,1]
                                else            nn[i] = strata[i,2]
                        }
                }
        }
//-prepare sample vector
        if (count) s = J(colsum(strata[.,1]),1,.)
        else s = J(colsum(nn),1,.)
//-draw samples within strata
        if (count==0) n0 = 1
        e = 0
        ce = 0
        for (i=1;i<=rows(strata);i++) {
                b = e + 1
                e = e + strata[i,1]
//--simple sample
                if (cluster==.) {
                        si = (*f)(nn[i], (rows(w)==1 ? strata[i,1] :
w[|b \ e|]), count)
                }
//--cluster sample
                else {
                        cb = ce + 1
                        ce = ce + strata[i,2]
                        ci = cluster[|cb \ ce|]
                        wi = (rows(w)==1 ? rows(ci) : w[|cb \ ce|])
                        if (fast==0) {
                                if (strata[i,1]!=colsum(ci))
                                  _error(3200, "sum of cluster sizes
unequal size of stratum")
                        }
                        si = (*f)(nn[i], wi, count)
                        if (count) si = _mm_expandclusters(si, ci, 1,
strata[i,1])
                }
//--add subsample to sample vector
                if (count) R = (b \ e)
                else {
                        n1 = n0 + rows(si) - 1
                        R = (n0 \ n1)
                        if (cluster==.) si = si :+ (b-1)
                        else si = si :+ (cb-1)
                        n0 = n1 + 1
                }
                if (R[1]<=R[2]) s[|R|] = si
        }
        if (count==0&cluster!=.) s = _mm_expandclusters(s, cluster)
        return(s)
}

On Sat, May 14, 2011 at 8:04 AM, Stas Kolenikov <skolenik@gmail.com> wrote:
> -gsample- relies on -moremata- which is a black box. Doing PPS
> properly is highly non-trivial; if Ben Jann did not utilize something
> from Brewer's book, then God only knows what kind of properties his
> procedure might have.
>
> Nobody guarantees that you will have any gains in analytical models.
> What your informative sampling does is changes the design measure for
> your regression... if you know what I mean. In other words, you make
> your explanatory variables heavier or lighter. The gains in precision,
> however, can only come from making small residuals heavier and large
> residuals lighter, which does not happen in your simulation (but would
> happen if you had heteroskedasticity with variance increasing with x).
> Instead, what you observe is a typical efficiency loss due to unequal
> weights, with DEFF = 1 + CV of weights.
>
> Brewer & Hanif (1983). Sampling with unequal probabilities. Lecture
> Notes in Statistics, Springer.
>
> On Fri, May 13, 2011 at 4:58 PM, Mike Lacy <Michael.Lacy@colostate.edu> wrote:
>>
>> Greetings,
>>
>> I'm getting standard errors for means and regression coefficients using the
>> -svy- commands that surprise me enough to make me wonder if I am using them
>> correctly.  What I'm finding is that the SE(mean) and SE(b) are smaller with
>> a simple random sample than with probability proportional to size,
>> even though the pps sample  is constructed using a variable correlated about
>> 0.9 with the outcome of interest.  Below, I have some code with simulated
>> data that shows what I am doing.
>>
>>
>> Background: I'm simulating data for an electrical utility usage reduction
>> experiment. I've made the simulated distribution of kwh usage look like the
>> real distribution.  I assume that the percent of kwh usage saved (savepct)
>> following an experiment with the users is of the
>> form y = b0 + b1X + b2*sqrt(x), with that being the function of interested
>> to be estimated.
>>
>> // Create the simulated data
>> clear
>> set obs 25000
>> local sampleN = 500
>> set seed 83573
>> gen kwh = exp(rnormal(6.4, 0.65))  // kwh usage
>> gen savepct = -0.61 - 0.00014*kwh + 0.14 * sqrt(kwh)  // looks realistic to
>> me
>> replace savepct = savepct + rnormal(0,0.5)  // gives r = 0.9 with kwh
>> // Population regression relationship
>> gen sqrtk = sqrt(kwh)
>> regress savepct kwh sqrtk   // The true populatioh relationship
>> //
>> // Sample the data, pps, and run a regression model
>> quiet summ kwh, detail
>> gen pps = `sampleN' * kwh/r(sum)  // sampling prob to get pps and n = 500
>> // User written -gsample- , see -findit gsample-
>> gsample `sampleN' [aw = pps],  gen(picked_pps) wor
>> gen pwt = 1/pps
>> svyset _n [pweight = pwt]
>> svy: mean savepct if picked_pps
>> svy: regress savepct kwh sqrtk if picked_pps
>> //
>> // Repeat analysis with simple random sampling
>> svyset, clear
>> gsample `sampleN',  gen(picked_psrs) wor
>> gen psrs = `sampleN'/`=_N' // sampling prob
>> replace pwt = 1/psrs
>> svyset _n [pweight = pwt]
>> svy: mean savepct if picked_psrs
>> svy: regress savepct kwh sqrtk if picked_psrs

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index