Why doesn’t summarize accept pweights? What
does summarize calculate when you use aweights?
|
Title
|
|
Probability weights, analytic weights, and summary statistics
|
|
Author
|
William Sribney, StataCorp
|
|
Date
|
March 1997; updated July 2009; minor revisions July 2011
|
Question
My data come with probability weights (the inverse of the probability of
an observation being selected into the sample). I am trying to compute
various summary statistics, including the mean, standard deviation, and
various percentiles of the data. Is there any way to compute the mean,
standard deviation, and percentiles of a variable with probability weights?
Short answer
It is important to distinguish among
- an estimate of the population mean (mu),
- an estimate of the population standard deviation (sigma), and
- the standard error of the estimate of the population mean.
The command
svy: mean
provides an estimate of the population mean and an estimate of its
standard error. When computing the standard error, consider the effect
of clustering and stratification as well as the effect of sampling weights.
An estimate of the population standard deviation can be obtained from
estat sd after svy: mean. An estimate of the population
standard deviation (sigma) is given by
estimate of sigma = sqrt( n * V_srs )
where V_srs is an estimate of the variance of the mean assuming a
simple random sample (of the same size as your survey dataset) and n
is the number of observations. svy: mean computes the above variance
and saves it in a matrix called e(V_srs). svy: mean also
saves the number of observations in e(N). estat sd uses these
values to estimate the population standard deviation.
Long answer
For survey sampling data (i.e., for data that are not from a simple random
sample), one has to go back to the basics and carefully think about the
terms “mean” and “standard deviation”.
Let me describe the simple case of estimates for the mean and variance for a
simple random sample. I want to go over it because
we are so used to it that we forget how nicely everything works out.
Simple random samples
Consider a variable X with population mean mu and population
variance sigma2.
For a simple random sample:
- An estimate of the population mean (mu) is the sample mean (xbar).
- An estimate of the population standard deviation (sigma) is the
sample standard deviation (s).
- An estimate of the standard error of xbar is s/sqrt(n).
The relationship between (2) and (3) is so simple that we often don’t
distinguish between them properly. Indeed, (2) and (3) are estimates for
two different things.
The summarize
command doesn’t provide (3); it gives only (2). One must use
ci or
mean to get (3).
Weighted/clustered/stratified survey sample
When we say we want “the mean and standard deviation of a
variable with probability weights”, what we most likely want is an estimate
of the population mean and the standard error of this estimator for the
population mean. We probably don’t care about an estimate of the
standard deviation of the population.
The svy: mean command provides
- muhat: an estimate of the population mean (mu)
- V_db: an estimate of the variance of muhat accounting for
the survey design used to collect the data
- V_srs: an estimate of the variance of muhat assuming
a simple random sample of the same number of observations
muhat is the obvious weighted sample mean, and V_db is pretty
complicated; see [SVY] variance estimation for details.
Should we also want an estimate of the population standard deviation, we can
work backward using the formula that produced V_srs. Recall that
V_srs = s2/n
thus the estimator for sigma that estat sd reports is
sigma = sqrt(n * V_srs)
Clustering and stratification
The svy: mean command handles clustering and stratification.
Sampling weights, clustering, and stratification can all have a big effect
on the standard error of muhat. Thus, if you want to get the right standard
error of the “mean” (i.e., muhat), you must
consider clustering and stratification as well as sampling weights. See
[SVY] variance estimation for more details.
The clustering and stratification do not affect the point estimate of the
mean, and thus if you are interested only in the point estimate, you could
use summarize with aweights since it gives the same weighted
mean as svy: mean.
For quantiles, summarize with aweights and pctile with
aweights or pweights all give the same answers.
The summarize command
It was intentional that summarize does not allow pweights.
summarize’s purpose, as I see it, is to provide descriptive
statistics for the sample, not to provide inferential statistics for
the population. By this criterion, I argue that pweights
do not belong here since pweights are used to provide estimates
of the population parameter mu.
Furthermore, the “standard deviation” that summarize
displays presents a quandary for pweights. What should we put here?
An estimate of the population sigma? The estimate of population
sigma is only indirectly related to the standard error of the mean.
I fear that displaying anything here would lead to misinterpretation.
Thus I believe it is better that summarize not allow
pweights so that people are properly steered to the inferential
commands svy: mean and estat sd.
summarize with aweights
The formula used by summarize with aweights for what it labels
“Std. Dev.” is the correct formula for estimating the
population standard deviation with pweighted data.
The problem is this formula does not give the population
standard deviation for aweights.
First, let me show that summarize with aweights gives the same
result as estat sd or manually calculating the standard deviation
using n and V_srs.
. webuse nhanes2
. svy: mean loglead
(running mean on estimation sample)
Survey: Mean estimation
Number of strata = 31 Number of obs = 4948
Number of PSUs = 62 Population size = 56405414
Design df = 31
--------------------------------------------------------------
| Linearized
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
loglead | 2.578102 .0196583 2.538008 2.618195
--------------------------------------------------------------
. estat sd
-------------------------------------
| Mean Std. Dev.
-------------+-----------------------
loglead | 2.578102 .4166916
-------------------------------------
. di sqrt(e(N) * el(e(V_srs),1,1))
.41669164
. summarize loglead [aw=finalwgt]
Variable | Obs Weight Mean Std. Dev. Min Max
-------------+-----------------------------------------------------------------
loglead | 4948 56405414 2.578102 .4166916 .6931472 4.382027
Formula for s2 used by summarize with aweights
summarize with aweights displays s for the “Std.
Dev.”, where s is calculated according to the formula:
s2 = (1/(n - 1)) sum w*i (xi - xbar) 2
where xi (i = 1, 2, ..., n) are the
data, w*i are "normalized" weights, and xbar is the
weighted mean. That is,
w*i = n wi / (sum wi)
and
xbar = (sum wi xi) / (sum wi)
where wi are the raw weights.
Thus writing the formula for s2 in terms of the raw
weights gives
s2 = {n/[W(n - 1)]} sum wi (xi - xbar)2
where W is the sum of the raw weights.
Paradigm for aweights
The paradigm for aweights is that the data x1, ...,
xn are considered as realizations of random variables
X1, ..., Xn where
Xi ∼ N(mui, sigma2/wi)
where wi are assumed to be known, fixed constants.
If, for example, Xi are group means of
wi random variables each distributed as X ∼ N(mu,
sigma2), then Xi ∼ N(mu,
sigma2/wi).
With this set up, we can compute expectations:
E[ (xi - xbar)2 ] = (mui - mu)2 + sigma2 (1/wi - 1/W)
where mu = (1/W) sum wi mui.
Thus
E[ sum wi (xi - xbar)2 ] = (n - 1) sigma2 + sum wi(mui - mu)2
We are in trouble trying to come up with an estimator for
sigma2. In this model, we can see variation due to
sigma2 or variation due to mui varying
about mu. There is no way to distinguish between these two variance
components.
For example, we could have data:
. list
+--------------------+
| x weight |
|--------------------|
1. | -.1042242 10 |
2. | .0131263 10 |
3. | -.0446007 10 |
4. | -.2504879 10 |
5. | .2510872 10 |
|--------------------|
6. | -.6012362 10 |
7. | .4534686 10 |
8. | -.4625476 10 |
9. | -.2094607 10 |
10. | .266293 10 |
+--------------------+
where each observation could be the mean of 10 N(0,1) variables (which is
how I generated it), or the first x could be the mean of 10
N(−.1042242, 0) variables, the second x could be the mean of 10
N(.0131263, 0) variables, etc. Alternatively, the first x could be
the mean of 10 N(−0.1, 0.5) variables, etc. ...
Thus, in general, there is no way to estimate sigma for a model with
general mui.
For the special case mui = mu for all i, we can estimate
sigma. Here,
E[ sum wi (xi - xbar)2 ] = (n - 1) sigma2
So
u2 = (1/(n - 1)) sum wi (xi - xbar)2
is an unbiased estimator for sigma2. This formula uses
the raw weights. If the scale of the wi changes, the
estimate of sigma2 changes.
This finding makes intuitive sense. For the above data (assuming
mui = mu), the estimate of sigma is 1.047 (the
actual value is 1). If I had data with the same x’s but with
weights = 100
. list
+--------------------+
| x weight |
|--------------------|
1. | -.1042242 100 |
2. | .0131263 100 |
3. | -.0446007 100 |
4. | -.2504879 100 |
5. | .2510872 100 |
|--------------------|
6. | -.6012362 100 |
7. | .4534686 100 |
8. | -.4625476 100 |
9. | -.2094607 100 |
10. | .266293 100 |
+--------------------+
then the estimate of sigma is 3.312. This makes sense because as the
sizes of the groups get larger, we expect that the group means (x)
get closer to mu. Thus, if the spread of the group means stays the
same as weight increases, then sigma must be increasing.
So we have found a problem with Stata’s aweight paradigm.
Stata assumes that with aweights, the scale of the weights does not
matter. This is not true for the estimate of sigma.
John Gleason (1997) wrote an excellent article that shows
the estimate of rho also depends on the scale of the
weights—contrary to the way Stata’s loneway was doing it
prior to Stata 6 with normalized weights.
Logic of summarize’s formula
Now there was a logic behind the use of summarize’s formula for
aweights:
s2 = {n/[W(n - 1)]} sum wi (xi - xbar)2
For the case mui = mu, the variance of xbar = (sum
wi xi)/(sum wi) is
sigma2/W. This can be estimated by
u2/W, where u2 is the estimate of
sigma2 we had before.
This is the logic behind summarize’s formula for s2
for aweights. If you unknowingly stick this s2
into the unweighted formula for the variance of the mean estimator, you will
get the right answer.
pweights and the estimate of sigma
For pweights, the formula
s2 = {n/[W(n - 1)]} sum wi (xi - xbar)2
gives an unbiased estimator for sigma2. This is really a
coincidence—when this formula was implemented, no one was thinking
about pweights.
However, it is not too surprising that this formula is correct for
pweights, because the formula IS invariant to the scale of the
weights, as the formula for pweights must be.
It is easy to see why the scale of the pweights does not matter
for the estimation of sigma: say that X ∼ N(0,1) in a
population of 100 persons. I sample 10 persons and get the following data:
. list
+--------------------+
| x weight |
|--------------------|
1. | -1.09397 10 |
2. | .3670809 10 |
3. | .145398 10 |
4. | .2657781 10 |
5. | .4794085 10 |
|--------------------|
6. | -1.233643 10 |
7. | .3014338 10 |
8. | -1.545905 10 |
9. | .1389086 10 |
10. | 1.133268 10 |
+--------------------+
The weight is 10 because one person in the sample represents 10 in the
population. Since this is just a simple random sample, we can compute
sigma in the standard way. If we do so, we get 0.872 as an estimate
of sigma.
Suppose now that X ∼ N(0,1) in a population of 1,000 persons. I
sample 10 persons and get the following data:
. list
+--------------------+
| x weight |
|--------------------|
1. | -1.09397 100 |
2. | .3670809 100 |
3. | .145398 100 |
4. | .2657781 100 |
5. | .4794085 100 |
|--------------------|
6. | -1.233643 100 |
7. | .3014338 100 |
8. | -1.545905 100 |
9. | .1389086 100 |
10. | 1.133268 100 |
+--------------------+
The weight is 100 since one person in the sample represents 100 in the
population. Obviously, the estimate of sigma is unchanged;
it’s still 0.872.
The same scale invariance applies when persons are sampled with unequal
weights.
The formal proof that
s2 = {n/[W(n - 1)]} sum wi (xi - xbar)2
gives an unbiased estimator for sigma2 is tricky. The set
of weights is not fixed; different samples can lead to different sets of
weights. Thus one has to be careful when proving E(s2) =
sigma2. One starts with the finite population definition of
sigma:
Var(X) = sigma2 = (1/M) sum over population (Xi - Xbar)2
where Xi now denotes everyone in the population and
i = 1, 2, ..., M (the total number of persons in the
population). We then use results like sum over sample
wi*xi2 as an unbiased estimator for
sum over population Xi2. M is generally
unknown; we are also estimating it.
Note:
Do not jump to the conclusion that the scale of pweights
never matters. It does matter for a few computations. E.g.,
the scale of the weights affects the computation of finite population
corrections and the estimation of totals (of course!).
Reference
-
Gleason, J. 1997.
- sg65: Computing intraclass correlations and large ANOVAs.
Stata Technical Bulletin
35: 25–31. Reprinted in
Stata Technical Bulletin Reprints, vol. 6, pp. 167–176.
|