Re: st: variance when using svy: mean

 From Steven Joel Hirsch Samuels <[email protected]> To [email protected] Subject Re: st: variance when using svy: mean Date Mon, 3 Dec 2007 12:11:10 -0500

David, it doesn't sound like your study is a probability sample; if not, you don't need -svy- commands. Instead, use non-survey commands and assign an -iweight- or other weight variable to properly represent your population.

If your data do arise from a probability sample, your 'areas' appear to be strata, not primary sampling units (psu's). Strata are units which partition a population. A psu is the highest stage unit selected by random numbers within a stratum. Standard errors for survey data depend mainly on the number of psu's, not on the number of observations within them.

-Steven

On Dec 3, 2007, at 11:19 AM, David Merriman wrote:

Dear Statalisters:
I am a long time Statauser but new to svy: commands and am quite confused.
I apologize if this is long-winded I am trying to say it as concisely
as possible.

I have collected primary data in several geographic areas. Each of
the geographic areas has a different weight so that my entire sample
should be representative of the population. In each geographic area I
have collected a number of observations but the number of observations
in the area tells me nothing about the density of the activit
area. I want to estimate the population mean (for all geographic
areas) and the variance of that estimate. The problem is that while I
get sensible means the variances do not seem to be a function of the
number of observations I have. Intuitively I think that the variance
ought to change (fall) as the number of observations increases.

I tried using
svyset psuedo_psu [pweight=obs_weight]
svy: mean psuedo_chicago_tax_paid

where psuedo_psu is the variable indicating the primary sampling unit,
obs_weight is the psu_weight divided by the number of observations in
that psu and psuedo_chicago_tax_paid is the (zero-one) variable for
which I want to estimate the mean and variance.

I created a simulated data set (the real one is more complex) with 2
psus. In the first trial, each psu had 50 observations. psu 1 had a
weight of 1 and a 50 percent chance of a 1. psu 2 had a weight of 5
and a 20 percent chance of a 1. I get a sensible mean of .25 and a
standard error of .0833333.

In the second trial, I also had two psu. Psu 1 has 900 observations
and psu 2 has 100 observations. psu 1 had a weight of 1 and a 50
percent chance of a 1. psu 2 had a weight of 5 and a 20 percent
chance of a 1. I get a sensible mean of .25 but the same standard
error of .0833333 as in case 1. This does not make sense to me. I
have more observations in case 2 so I think I should get a smaller
variance.

I imagine I am not using the correct design. Can anyone help? Below,
I show the computer code for my simulation (fake data set) but you
don't need to read this if you understand the comments above. Thanks
so much.

#delimit ;
****************************************************************
* created the simulated data
***********************************************************;
set obs 100;
****************************************************************
* generate psu
***********************************************************;
gen psuedo_psu=1 if _n<51;
replace psuedo_psu=2 if _n>=51;
****************************************************************
* generate chicago_tax_paid
***********************************************************;
gen psuedo_chicago_tax_paid=1 if _n<=25;
replace psuedo_chicago_tax_paid=0 if _n>25 & _n<=50;
replace psuedo_chicago_tax_paid=1 if _n>50 & _n<61;
replace psuedo_chicago_tax_paid=0 if _n>=61;
****************************************************************
* generate psu weights
***********************************************************;
gen sample_weight=1 if psuedo_psu==1;
replace sample_weight=5 if psuedo_psu==2;
summarize;
****************************************************************
* generate OBSERVATION weights
***********************************************************;
sort psuedo_psu;
by psuedo_psu: gen obs_weight= sample_weight/_N;
summarize;
svyset psuedo_psu [pweight=obs_weight];
**********************************************************
* psu1 has a mean of .5 and a weight of 1
* psu2 has a mean of .2 and a weight of 5
* (5*.2)+(1*.5)=1.5
* 1.5/6=.25
*
* so the mean estimate makes sense to me
*******************************************************;
svy : mean psuedo_chicago_tax_paid;
mean psuedo_chicago_tax_paid;
*********************************************************
* do a second round with unequal size groups
*****************************************************;
clear;
#delimit ;
****************************************************************
* created the simulated data
***********************************************************;
set obs 1000;
****************************************************************
* generate psu
***********************************************************;
gen psuedo_psu=1 if _n<901;
replace psuedo_psu=2 if _n>=901;
****************************************************************
* generate chicago_tax_paid
***********************************************************;
gen psuedo_chicago_tax_paid=1 if _n<=450;
replace psuedo_chicago_tax_paid=0 if _n>450 & _n<=900;
replace psuedo_chicago_tax_paid=1 if _n>900 & _n<921;
replace psuedo_chicago_tax_paid=0 if _n>=921;
****************************************************************
* generate PSU weights
***********************************************************;
gen sample_weight=1 if psuedo_psu==1;
replace sample_weight=5 if psuedo_psu==2;
****************************************************************
* generate OBSERVATION weights
***********************************************************;
sort psuedo_psu;
by psuedo_psu: gen obs_weight= sample_weight/_N;
summarize;
svyset psuedo_psu [pweight=obs_weight];
**********************************************************
* psu1 has a mean of .5 and a weight of 1
* psu2 has a mean of .2 and a weight of 5
*
* I get the same answer for the mean in case 1 and case 2
* which I think is correct but
* I also get the same answer for the variance which I think is not correct
*
* I think I should have a lower variance in case 2
*******************************************************;
svy : mean psuedo_chicago_tax_paid;
mean psuedo_chicago_tax_paid;

--
David Merriman
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
```Steven  Samuels

[email protected]
18 Cantine's Island
Saugerties, NY 12477
Phone: 845-246-0774
EFax: 208-498-7441

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```