Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: variance when using svy: mean


From   "David Merriman" <dmerrim@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: variance when using svy: mean
Date   Mon, 3 Dec 2007 11:21:28 -0600

Thanks.  I do have a lot of trouble with the terminology.
I selected a weighted random sample of 100 geographic areas from 930
such areas.  The weights were designed so that my weighted random
sample would be representative of the population (e.g. I oversampled
areas with a high number of people).  I do not think I have any strata
(I did NOT for example oversample high poverty areas).  My psus are
the 100 geographic areas.

I am afraid I still do not know what to do next.  It sounds like you
are saying that the variance should not differ between my two cases
but this does not make intuitive sense to me.  Any help you can
provide would be appreciated.

On Dec 3, 2007 11:11 AM, Steven Joel Hirsch Samuels
<sjhsamuels@earthlink.net> wrote:
> David, it doesn't sound like your study is a probability sample; if
> not, you  don't need -svy- commands.  Instead, use non-survey
> commands and assign an -iweight-  or other weight variable to
> properly represent your population.
>
> If your data do arise from a probability sample, your 'areas' appear
> to be strata, not primary sampling units (psu's). Strata are units
> which partition a population. A psu is the highest stage unit
> selected by random numbers within a stratum.  Standard errors for
> survey data depend mainly on the number of psu's, not on the number
> of observations within them.
>
> -Steven
>
> On Dec 3, 2007, at 11:19 AM, David Merriman wrote:
>
> > Dear Statalisters:
> > I am a long time Statauser but new to svy: commands and am quite
> > confused.
> > I apologize if this is long-winded I am trying to say it as concisely
> > as possible.
> >
> > I have collected primary data in several geographic areas.  Each of
> > the geographic areas has a different weight so that my entire sample
> > should be representative of the population.  In each geographic area I
> > have collected a number of observations but the number of observations
> > in the area tells me nothing about the density of the activit
>
> > area.  I want to estimate the population mean (for all geographic
> > areas) and the variance of that estimate.  The problem is that while I
> > get sensible means the variances do not seem to be a function of the
> > number of observations I have.  Intuitively I think that the variance
> > ought to change (fall) as the number of observations increases.
> >
> > I tried using
> > svyset psuedo_psu [pweight=obs_weight]
> > svy:  mean psuedo_chicago_tax_paid
> >
> > where psuedo_psu is the variable indicating the primary sampling unit,
> > obs_weight is the psu_weight divided by the number of observations in
> > that psu and psuedo_chicago_tax_paid is the (zero-one) variable for
> > which I want to estimate the mean and variance.
> >
> > I created a simulated data set (the real one is more complex) with 2
> > psus.  In the first trial, each psu had 50 observations.  psu 1 had a
> > weight of 1 and a 50 percent chance of a 1.  psu 2 had a weight of 5
> > and a 20 percent chance of a 1.  I get a sensible mean of .25 and a
> > standard error of .0833333.
> >
> > In the second trial, I also had two psu.  Psu 1 has 900 observations
> > and psu 2 has 100 observations.  psu 1 had a weight of 1 and a 50
> > percent chance of a 1.  psu 2 had a weight of 5 and a 20 percent
> > chance of a 1.  I get a sensible mean of .25 but the same standard
> > error of .0833333 as in case 1.   This does not make sense to me.  I
> > have more observations in case 2 so I think I should get a smaller
> > variance.
> >
> > I imagine I am not using the correct design.  Can anyone help?  Below,
> > I show the computer code for my simulation (fake data set) but you
> > don't need to read this if you understand the comments above.  Thanks
> > so much.
> >
> >
> > #delimit ;
> > ****************************************************************
> > * created the simulated data
> > ***********************************************************;
> > set obs 100;
> > ****************************************************************
> > * generate psu
> > ***********************************************************;
> > gen psuedo_psu=1 if _n<51;
> > replace psuedo_psu=2 if _n>=51;
> > ****************************************************************
> > * generate chicago_tax_paid
> > ***********************************************************;
> > gen psuedo_chicago_tax_paid=1 if _n<=25;
> > replace psuedo_chicago_tax_paid=0 if _n>25 & _n<=50;
> > replace psuedo_chicago_tax_paid=1 if _n>50 & _n<61;
> > replace psuedo_chicago_tax_paid=0 if _n>=61;
> > ****************************************************************
> > * generate psu weights
> > ***********************************************************;
> > gen sample_weight=1 if psuedo_psu==1;
> > replace sample_weight=5 if psuedo_psu==2;
> > summarize;
> > ****************************************************************
> > * generate OBSERVATION weights
> > ***********************************************************;
> > sort psuedo_psu;
> > by psuedo_psu: gen obs_weight= sample_weight/_N;
> > summarize;
> > svyset psuedo_psu [pweight=obs_weight];
> > **********************************************************
> > * psu1 has a mean of .5 and a weight of 1
> > * psu2 has a mean of .2 and a weight of 5
> > * (5*.2)+(1*.5)=1.5
> > * 1.5/6=.25
> > *
> > * so the mean estimate makes sense to me
> > *******************************************************;
> > svy : mean psuedo_chicago_tax_paid;
> > mean psuedo_chicago_tax_paid;
> > *********************************************************
> > * do a second round with unequal size groups
> > *****************************************************;
> > clear;
> > #delimit ;
> > ****************************************************************
> > * created the simulated data
> > ***********************************************************;
> > set obs 1000;
> > ****************************************************************
> > * generate psu
> > ***********************************************************;
> > gen psuedo_psu=1 if _n<901;
> > replace psuedo_psu=2 if _n>=901;
> > ****************************************************************
> > * generate chicago_tax_paid
> > ***********************************************************;
> > gen psuedo_chicago_tax_paid=1 if _n<=450;
> > replace psuedo_chicago_tax_paid=0 if _n>450 & _n<=900;
> > replace psuedo_chicago_tax_paid=1 if _n>900 & _n<921;
> > replace psuedo_chicago_tax_paid=0 if _n>=921;
> > ****************************************************************
> > * generate PSU weights
> > ***********************************************************;
> > gen sample_weight=1 if psuedo_psu==1;
> > replace sample_weight=5 if psuedo_psu==2;
> > ****************************************************************
> > * generate OBSERVATION weights
> > ***********************************************************;
> > sort psuedo_psu;
> > by psuedo_psu: gen obs_weight= sample_weight/_N;
> > summarize;
> > svyset psuedo_psu [pweight=obs_weight];
> > **********************************************************
> > * psu1 has a mean of .5 and a weight of 1
> > * psu2 has a mean of .2 and a weight of 5
> > *
> > * I get the same answer for the mean in case 1 and case 2
> > * which I think is correct but
> > * I also get the same answer for the variance which I think is not
> > correct
> > *
> > * I think I should have a lower variance in case 2
> > *******************************************************;
> > svy : mean psuedo_chicago_tax_paid;
> > mean psuedo_chicago_tax_paid;
> >
> >
> >  --
> > David Merriman
> > dmerrim@gmail.com
> > *
> > *   For searches and help try:
> > *   http://www.stata.com/support/faqs/res/findit.html
> > *   http://www.stata.com/support/statalist/faq
> > *   http://www.ats.ucla.edu/stat/stata/
>
> Steven  Samuels
>
> sjhsamuels@earthlink.net
> 18 Cantine's Island
> Saugerties, NY 12477
> Phone: 845-246-0774
> EFax: 208-498-7441
>
>
>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>



-- 
David Merriman
dmerrim@gmail.com
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2021 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index