# Re: st: variance when using svy: mean

 From "Austin Nichols" <[email protected]> To [email protected] Subject Re: st: variance when using svy: mean Date Tue, 4 Dec 2007 16:45:00 -0500

```David Merriman:
Your example amounts to varying the local n in the following thought experiment.

clear
loc n=2
input cell paid psu wt
5    1    1   1
5    0    1   1
2    1    2   5
8    0    2   5
end
g x=round(cell*10^`n')
expand x
bys psu: g obswt=wt/_N
svyset psu [pw=obswt]
svy: mean paid
svy: tab paid, se ci
reg paid [pw=obs]
reg paid [pw=obs], cl(psu)

Note that the CI from -svy: tab- makes sense in a way that other CIs
do not (a CI for a proportion should not include negative numbers),
and that the -svy- prefix command corrects for clustering on PSU in
the same way as the -cluster- option on -regress-.  Given how your
example is set up, the clustering is perfect, and no matter how you
what integer n you pick above, the SE will be the same; this would
presumably not be the case in a real example.

I.e. perfect clustering is the limiting case, and in general you would
expect some improvement in efficiency given larger samples within PSU.
Also note that the asymptotic justifications depend on the number of
PSUs (#clusters) approaching infinity and 2 is not very close to
infinity.  It is likely that 100 is close enough, but you may want to
design a simulation that can address that question (unlike the above).

Note: I simplified to one n, but two n parameters changes nothing, and
extending to 3 or 4 clusters would also change nothing.  The issue is
that all the extra observations are just duplications of existing
data.

clear
loc n1=2
loc n2=3
input cell paid psu wt
5    1    1   1
5    0    1   1
2    1    2   5
8    0    2   5
end
g x=round(cell*10^`n1') if psu==1
replace x=round(cell*10^`n2') if psu==2
expand x
bys psu: g obswt=wt/_N
svyset psu [pw=obswt]
svy: mean paid
svy: tab paid, se ci
reg paid [pw=obs]
reg paid [pw=obs], cl(psu)

On 12/3/07, Steven Samuels <[email protected]> wrote:
> You are very welcome. One observation and 10 observations per psu
> will not give the -same- precision; increasing the number of
> observations will decrease the standard error, but there is a limit,
> depending on the relative variation within-and between- psu's.
> Theory can help determine the optimal balance between the number of
> psu's and the number of observations to take from each. The formulas
> take into account the relative cost of working in a new psu versus
> the cost of taking an additional observation once you are in the psu.
>
> I'm still not sure that you took a simple random sample, as this
> requires a finite population to begin with. Still, it sounds like you
> are on the right track.
>
> Good luck!
> Steve
>
> On Dec 3, 2007, at 1:26 PM, David Merriman wrote:
>
> > Thanks, I am understanding better.
> > Within each area I took a random (iid) sample from an infinite
> > population.  The sample sizes varied for reasons that are contextual
> > and difficult to explain without telling you all the details of the
> > project. (Can you help me w'ith the computer code here?  How do I tell
> > stata that I took a random sample within psus?)
> >
> > So, if I understand you correctly the variance of my estimate of the
> > population mean is roughly independent of how many observations I
> > have.  In other words, if I had gone to 100 different areas and
> > collected a single sample I would estimate the population mean with
> > about the same precision as if I had collected 10 samples within each
> > area?  I still have trouble getting my mind around that but your
> > example with a Census of the two areas does help.  Thanks again.
> >
> >
> > On Dec 3, 2007 12:08 PM, Steven Joel Hirsch Samuels
> > <[email protected]> wrote:
> >> Thanks, David that is clearer.
> >>
> >> Was there a sampling procedure within areas? If not, you can still
> >> apply the -svy- commands, but without a sampling procedure of some
> >> kind, the observations within a single area may not 'represent' that
> >> area in any statistical sense.
> >>
> >> In any case,  the major portion of the variance of sample estimates
> >> will still arise from PSU-to-PSU variation. The effective sample size
> >> is 100, the number of PSU's.  Thought experiment:  you take a sample
> >> of n= 2 areas and collect information about everybody (a census).
> >> Even if the number of observations is 200,000, your effective sample
> >> size is still n=2 for making inferences about the original
> >> population. As long as the two area means are different, the standard
> >> error will be nonzero.
> >>
> >> Your simulations are therefore correct in producing the same standard
> >> error for the two PSU's.  If you had simulated sampling within the
> >> PSU's, then the you would the SE's to change, as the sample
> >> proportions would vary from simulation to simulation.
> >>
> >> -Steven
> >>
> >>
> >> 4. The sampling weight for a PSU
> >>
> >> On Dec 3, 2007, at 12:21 PM, David Merriman wrote:
> >>
> >>> Thanks.  I do have a lot of trouble with the terminology.
> >>> I selected a weighted random sample of 100 geographic areas from 930
> >>> such areas.  The weights were designed so that my weighted random
> >>> sample would be representative of the population (e.g. I oversampled
> >>> areas with a high number of people).  I do not think I have any
> >>> strata
> >>> (I did NOT for example oversample high poverty areas).  My psus are
> >>> the 100 geographic areas.
> >>>
> >>> I am afraid I still do not know what to do next.  It sounds like you
> >>> are saying that the variance should not differ between my two cases
> >>> but this does not make intuitive sense to me.  Any help you can
> >>> provide would be appreciated.
> >>>
> >>> On Dec 3, 2007 11:11 AM, Steven Joel Hirsch Samuels
> >>> <[email protected]> wrote:
> >>>> David, it doesn't sound like your study is a probability sample; if
> >>>> not, you  don't need -svy- commands.  Instead, use non-survey
> >>>> commands and assign an -iweight-  or other weight variable to
> >>>> properly represent your population.
> >>>>
> >>>> If your data do arise from a probability sample, your 'areas'
> >>>> appear
> >>>> to be strata, not primary sampling units (psu's). Strata are units
> >>>> which partition a population. A psu is the highest stage unit
> >>>> selected by random numbers within a stratum.  Standard errors for
> >>>> survey data depend mainly on the number of psu's, not on the number
> >>>> of observations within them.
> >>>>
> >>>> -Steven
> >>>>
> >>>> On Dec 3, 2007, at 11:19 AM, David Merriman wrote:
> >>>>
> >>>>> Dear Statalisters:
> >>>>> I am a long time Statauser but new to svy: commands and am quite
> >>>>> confused.
> >>>>> I apologize if this is long-winded I am trying to say it as
> >>>>> concisely
> >>>>> as possible.
> >>>>>
> >>>>> I have collected primary data in several geographic areas.
> >>>>> Each of
> >>>>> the geographic areas has a different weight so that my entire
> >>>>> sample
> >>>>> should be representative of the population.  In each geographic
> >>>>> area I
> >>>>> have collected a number of observations but the number of
> >>>>> observations
> >>>>> in the area tells me nothing about the density of the activit
> >>>>
> >>>>> area.  I want to estimate the population mean (for all geographic
> >>>>> areas) and the variance of that estimate.  The problem is that
> >>>>> while I
> >>>>> get sensible means the variances do not seem to be a function
> >>>>> of the
> >>>>> number of observations I have.  Intuitively I think that the
> >>>>> variance
> >>>>> ought to change (fall) as the number of observations increases.
> >>>>>
> >>>>> I tried using
> >>>>> svyset psuedo_psu [pweight=obs_weight]
> >>>>> svy:  mean psuedo_chicago_tax_paid
> >>>>>
> >>>>> where psuedo_psu is the variable indicating the primary sampling
> >>>>> unit,
> >>>>> obs_weight is the psu_weight divided by the number of
> >>>>> observations in
> >>>>> that psu and psuedo_chicago_tax_paid is the (zero-one) variable
> >>>>> for
> >>>>> which I want to estimate the mean and variance.
> >>>>>
> >>>>> I created a simulated data set (the real one is more complex)
> >>>>> with 2
> >>>>> psus.  In the first trial, each psu had 50 observations.  psu 1
> >>>>> weight of 1 and a 50 percent chance of a 1.  psu 2 had a weight
> >>>>> of 5
> >>>>> and a 20 percent chance of a 1.  I get a sensible mean of .25
> >>>>> and a
> >>>>> standard error of .0833333.
> >>>>>
> >>>>> In the second trial, I also had two psu.  Psu 1 has 900
> >>>>> observations
> >>>>> and psu 2 has 100 observations.  psu 1 had a weight of 1 and a 50
> >>>>> percent chance of a 1.  psu 2 had a weight of 5 and a 20 percent
> >>>>> chance of a 1.  I get a sensible mean of .25 but the same standard
> >>>>> error of .0833333 as in case 1.   This does not make sense to
> >>>>> me.  I
> >>>>> have more observations in case 2 so I think I should get a smaller
> >>>>> variance.
> >>>>>
> >>>>> I imagine I am not using the correct design.  Can anyone help?
> >>>>> Below,
> >>>>> I show the computer code for my simulation (fake data set) but you
> >>>>> don't need to read this if you understand the comments above.
> >>>>> Thanks
> >>>>> so much.
> >>>>>
> >>>>>
> >>>>> #delimit ;
> >>>>> ****************************************************************
> >>>>> * created the simulated data
> >>>>> ***********************************************************;
> >>>>> set obs 100;
> >>>>> ****************************************************************
> >>>>> * generate psu
> >>>>> ***********************************************************;
> >>>>> gen psuedo_psu=1 if _n<51;
> >>>>> replace psuedo_psu=2 if _n>=51;
> >>>>> ****************************************************************
> >>>>> * generate chicago_tax_paid
> >>>>> ***********************************************************;
> >>>>> gen psuedo_chicago_tax_paid=1 if _n<=25;
> >>>>> replace psuedo_chicago_tax_paid=0 if _n>25 & _n<=50;
> >>>>> replace psuedo_chicago_tax_paid=1 if _n>50 & _n<61;
> >>>>> replace psuedo_chicago_tax_paid=0 if _n>=61;
> >>>>> ****************************************************************
> >>>>> * generate psu weights
> >>>>> ***********************************************************;
> >>>>> gen sample_weight=1 if psuedo_psu==1;
> >>>>> replace sample_weight=5 if psuedo_psu==2;
> >>>>> summarize;
> >>>>> ****************************************************************
> >>>>> * generate OBSERVATION weights
> >>>>> ***********************************************************;
> >>>>> sort psuedo_psu;
> >>>>> by psuedo_psu: gen obs_weight= sample_weight/_N;
> >>>>> summarize;
> >>>>> svyset psuedo_psu [pweight=obs_weight];
> >>>>> **********************************************************
> >>>>> * psu1 has a mean of .5 and a weight of 1
> >>>>> * psu2 has a mean of .2 and a weight of 5
> >>>>> * (5*.2)+(1*.5)=1.5
> >>>>> * 1.5/6=.25
> >>>>> *
> >>>>> * so the mean estimate makes sense to me
> >>>>> *******************************************************;
> >>>>> svy : mean psuedo_chicago_tax_paid;
> >>>>> mean psuedo_chicago_tax_paid;
> >>>>> *********************************************************
> >>>>> * do a second round with unequal size groups
> >>>>> *****************************************************;
> >>>>> clear;
> >>>>> #delimit ;
> >>>>> ****************************************************************
> >>>>> * created the simulated data
> >>>>> ***********************************************************;
> >>>>> set obs 1000;
> >>>>> ****************************************************************
> >>>>> * generate psu
> >>>>> ***********************************************************;
> >>>>> gen psuedo_psu=1 if _n<901;
> >>>>> replace psuedo_psu=2 if _n>=901;
> >>>>> ****************************************************************
> >>>>> * generate chicago_tax_paid
> >>>>> ***********************************************************;
> >>>>> gen psuedo_chicago_tax_paid=1 if _n<=450;
> >>>>> replace psuedo_chicago_tax_paid=0 if _n>450 & _n<=900;
> >>>>> replace psuedo_chicago_tax_paid=1 if _n>900 & _n<921;
> >>>>> replace psuedo_chicago_tax_paid=0 if _n>=921;
> >>>>> ****************************************************************
> >>>>> * generate PSU weights
> >>>>> ***********************************************************;
> >>>>> gen sample_weight=1 if psuedo_psu==1;
> >>>>> replace sample_weight=5 if psuedo_psu==2;
> >>>>> ****************************************************************
> >>>>> * generate OBSERVATION weights
> >>>>> ***********************************************************;
> >>>>> sort psuedo_psu;
> >>>>> by psuedo_psu: gen obs_weight= sample_weight/_N;
> >>>>> summarize;
> >>>>> svyset psuedo_psu [pweight=obs_weight];
> >>>>> **********************************************************
> >>>>> * psu1 has a mean of .5 and a weight of 1
> >>>>> * psu2 has a mean of .2 and a weight of 5
> >>>>> *
> >>>>> * I get the same answer for the mean in case 1 and case 2
> >>>>> * which I think is correct but
> >>>>> * I also get the same answer for the variance which I think is not
> >>>>> correct
> >>>>> *
> >>>>> * I think I should have a lower variance in case 2
> >>>>> *******************************************************;
> >>>>> svy : mean psuedo_chicago_tax_paid;
> >>>>> mean psuedo_chicago_tax_paid;
> >>>>>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```