# st: variance when using svy: mean

 From "David Merriman" <[email protected]> To [email protected] Subject st: variance when using svy: mean Date Mon, 3 Dec 2007 10:19:12 -0600

```Dear Statalisters:
I am a long time Statauser but new to svy: commands and am quite confused.
I apologize if this is long-winded I am trying to say it as concisely
as possible.

I have collected primary data in several geographic areas.  Each of
the geographic areas has a different weight so that my entire sample
should be representative of the population.  In each geographic area I
have collected a number of observations but the number of observations
in the area tells me nothing about the density of the activity in the
area.  I want to estimate the population mean (for all geographic
areas) and the variance of that estimate.  The problem is that while I
get sensible means the variances do not seem to be a function of the
number of observations I have.  Intuitively I think that the variance
ought to change (fall) as the number of observations increases.

I tried using
svyset psuedo_psu [pweight=obs_weight]
svy:  mean psuedo_chicago_tax_paid

where psuedo_psu is the variable indicating the primary sampling unit,
obs_weight is the psu_weight divided by the number of observations in
that psu and psuedo_chicago_tax_paid is the (zero-one) variable for
which I want to estimate the mean and variance.

I created a simulated data set (the real one is more complex) with 2
psus.  In the first trial, each psu had 50 observations.  psu 1 had a
weight of 1 and a 50 percent chance of a 1.  psu 2 had a weight of 5
and a 20 percent chance of a 1.  I get a sensible mean of .25 and a
standard error of .0833333.

In the second trial, I also had two psu.  Psu 1 has 900 observations
and psu 2 has 100 observations.  psu 1 had a weight of 1 and a 50
percent chance of a 1.  psu 2 had a weight of 5 and a 20 percent
chance of a 1.  I get a sensible mean of .25 but the same standard
error of .0833333 as in case 1.   This does not make sense to me.  I
have more observations in case 2 so I think I should get a smaller
variance.

I imagine I am not using the correct design.  Can anyone help?  Below,
I show the computer code for my simulation (fake data set) but you
don't need to read this if you understand the comments above.  Thanks
so much.

#delimit ;
****************************************************************
* created the simulated data
***********************************************************;
set obs 100;
****************************************************************
* generate psu
***********************************************************;
gen psuedo_psu=1 if _n<51;
replace psuedo_psu=2 if _n>=51;
****************************************************************
* generate chicago_tax_paid
***********************************************************;
gen psuedo_chicago_tax_paid=1 if _n<=25;
replace psuedo_chicago_tax_paid=0 if _n>25 & _n<=50;
replace psuedo_chicago_tax_paid=1 if _n>50 & _n<61;
replace psuedo_chicago_tax_paid=0 if _n>=61;
****************************************************************
* generate psu weights
***********************************************************;
gen sample_weight=1 if psuedo_psu==1;
replace sample_weight=5 if psuedo_psu==2;
summarize;
****************************************************************
* generate OBSERVATION weights
***********************************************************;
sort psuedo_psu;
by psuedo_psu: gen obs_weight= sample_weight/_N;
summarize;
svyset psuedo_psu [pweight=obs_weight];
**********************************************************
* psu1 has a mean of .5 and a weight of 1
* psu2 has a mean of .2 and a weight of 5
* (5*.2)+(1*.5)=1.5
* 1.5/6=.25
*
* so the mean estimate makes sense to me
*******************************************************;
svy : mean psuedo_chicago_tax_paid;
mean psuedo_chicago_tax_paid;
*********************************************************
* do a second round with unequal size groups
*****************************************************;
clear;
#delimit ;
****************************************************************
* created the simulated data
***********************************************************;
set obs 1000;
****************************************************************
* generate psu
***********************************************************;
gen psuedo_psu=1 if _n<901;
replace psuedo_psu=2 if _n>=901;
****************************************************************
* generate chicago_tax_paid
***********************************************************;
gen psuedo_chicago_tax_paid=1 if _n<=450;
replace psuedo_chicago_tax_paid=0 if _n>450 & _n<=900;
replace psuedo_chicago_tax_paid=1 if _n>900 & _n<921;
replace psuedo_chicago_tax_paid=0 if _n>=921;
****************************************************************
* generate PSU weights
***********************************************************;
gen sample_weight=1 if psuedo_psu==1;
replace sample_weight=5 if psuedo_psu==2;
****************************************************************
* generate OBSERVATION weights
***********************************************************;
sort psuedo_psu;
by psuedo_psu: gen obs_weight= sample_weight/_N;
summarize;
svyset psuedo_psu [pweight=obs_weight];
**********************************************************
* psu1 has a mean of .5 and a weight of 1
* psu2 has a mean of .2 and a weight of 5
*
* I get the same answer for the mean in case 1 and case 2
* which I think is correct but
* I also get the same answer for the variance which I think is not correct
*
* I think I should have a lower variance in case 2
*******************************************************;
svy : mean psuedo_chicago_tax_paid;
mean psuedo_chicago_tax_paid;

--
David Merriman
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```