Do the svy commands handle zero weights differently than non-svy commands do?
|
Title
|
|
The svy command’s handling of zero weights
|
|
Author
|
Bill Sribney, StataCorp
|
|
Date
|
April 1998; updated April 2005; minor revisions July 2009
|
Yes, the svy
commands treat zero weights differently than do non-svy commands that allow
pweights. The svy commands are the ones that dot all the i’s
and cross all the t’s—meaning they get all the details
right for complex survey data. Although one can use the non-svy commands
with survey data and get essentially correct results in almost all cases, it
is better to use the svy commands if you have data from a complex survey
design.
Non-svy commands ignore any observations with zero weights. You can see
the number of observations reported is different. Here’s an
example in which two observations have zero weights:
. webuse nhanes2d
. keep in 1/70
(10281 observations deleted)
. replace finalwgt = 0 in 1/2
(2 real changes made)
. logit highbp height weight [pw=finalwgt], nolog
Logistic regression Number of obs = 68
Wald chi2(2) = 10.11
Prob > chi2 = 0.0064
Log pseudolikelihood = -229472.34 Pseudo R2 = 0.1735
------------------------------------------------------------------------------
| Robust
highbp | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
height | -.119504 .0428143 -2.79 0.005 -.2034184 -.0355896
weight | .0622002 .023644 2.63 0.009 .0158588 .1085416
_cons | 12.86556 6.303096 2.04 0.041 .511714 25.2194
------------------------------------------------------------------------------
. svyset [pw=finalwgt]
pweight: finalwgt
VCE: linearized
Single unit: missing
Strata 1: <one>
SU 1: <observations>
FPC 1: <zero>
. svy: logit highbp height weight
(running logit on estimation sample)
Survey: Logistic regression
Number of strata = 1 Number of obs = 70
Number of PSUs = 70 Population size = 811930
Design df = 69
F( 2, 68) = 4.99
Prob > F = 0.0095
------------------------------------------------------------------------------
| Linearized
highbp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
height | -.119504 .0428051 -2.79 0.007 -.2048979 -.0341101
weight | .0622002 .023639 2.63 0.010 .0150417 .1093587
_cons | 12.86556 6.301753 2.04 0.045 .2939029 25.43721
------------------------------------------------------------------------------
First, note the point estimates are exactly the same. This is always
true. Only the elements with nonzero weights are used to compute the point
estimates.
Zero weights affect only the variance computation. In the above example,
one can see the standard errors differ in the fifth decimal place.
So how are zero weights handled?
Actually, nothing special is done with them; they are treated just like
nonzero weights. If you look at the formulas in [SVY] variance
estimation, you see the variance formula involves the sum:
(Sum over clusters) (zi − zbar)2
where i indexes clusters (PSUs) and
zi = (Sum over elements in the i-th cluster) weight*something
and zbar is the mean of zi. (I’m assuming
there is only one stratum.)
If all weights in a cluster are zero, then zi is zero.
Thus there is a term (0 − zbar)2 in the sum in the variance
formula. Clearly, this result is different from the result one would get if
one ignored observations with zero weights.
Hence, the rule is “Zero weights give different results in the svy
commands from the non-svy commands when all the weights in one or more
clusters are zero.”
In the example above, “clusters” are observations, so the above
rule implies there will be a difference in this case.
How are zero weights interpreted?
Theoretically, zero sampling weights should not be possible. Sampling
weights are supposed to be the inverse of the probability of being sampled,
so, if this is the case, they cannot be zero. But often weights are adjusted
through various procedures, and they can be set to zero or even a negative
value. (Aside: only
svyset with
iweights will handle negative weights, all other commands will exit
with an error mentioning negative weights.)
Zero weights can also be created when one is modeling a subpopulation. For
instance, suppose you have males and females in your sample, and you want to
model only the males. You can do this by setting all the weights for
females to zero. It would be incorrect to model males (gender==1) by
doing
. svy: logit y x ... if gender==1
When you do the above, you are ignoring the variation due to sampling
different numbers of males. That is, if you redid the sampling, you would
get different numbers of males each time.
To model males properly, you can set the weights of females to zero. This,
however, is unnecessary with the svy commands. You can simply use the
subpop() option. But this is what the subpop() option is
effectively doing—it makes the weights zero for everyone not in the
subpopulation.
In the previous example, we can use the subpop() option and get the
same results. We will create a subpopulation indicator variable called
sub that is 1 when the weights are nonzero and 0 when they are zero:
. generate sub = (finalwgt != 0)
. svy, subpop(sub): logit highbp height weight
(running logit on estimation sample)
Survey: Logistic regression
Number of strata = 1 Number of obs = 70
Number of PSUs = 70 Population size = 811930
Subpop. no. of obs = 68
Subpop. size = 811930
Design df = 69
F( 2, 68) = 4.99
Prob > F = 0.0095
------------------------------------------------------------------------------
| Linearized
highbp | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
height | -.119504 .0428051 -2.79 0.007 -.2048979 -.0341101
weight | .0622002 .023639 2.63 0.010 .0150417 .1093587
_cons | 12.86556 6.301753 2.04 0.045 .2939029 25.43721
------------------------------------------------------------------------------
The interpretation of zero weights is that the svy commands pick up the
component of variance due to sampling differing numbers of elements with
zero–nonzero weights.
Hence, when there are only a few zero weights, the difference in standard
errors will be very, very small—as it is in this example. Only when
there are substantial numbers of zero weights will the standard errors
differ appreciably.
|