Title | Probability weights, analytic weights, and summary statistics | |
Author | William Sribney, StataCorp |
My data come with probability weights (the inverse of the probability of an observation being selected into the sample). I am trying to compute various summary statistics, including the mean, standard deviation, and various percentiles of the data. Is there any way to compute the mean, standard deviation, and percentiles of a variable with probability weights?
It is important to distinguish among
The command svy: mean provides an estimate of the population mean and an estimate of its standard error. When computing the standard error, consider the effect of clustering and stratification as well as the effect of sampling weights.
An estimate of the population standard deviation can be obtained from estat sd after svy: mean. An estimate of the population standard deviation (sigma) is given by
estimate of sigma = sqrt( n * V_srs )
where V_srs is an estimate of the variance of the mean assuming a simple random sample (of the same size as your survey dataset) and n is the number of observations. svy: mean computes the above variance and saves it in a matrix called e(V_srs). svy: mean also saves the number of observations in e(N). estat sd uses these values to estimate the population standard deviation.
For survey sampling data (i.e., for data that are not from a simple random sample), one has to go back to the basics and carefully think about the terms “mean” and “standard deviation”.
Let me describe the simple case of estimates for the mean and variance for a simple random sample. I want to go over it because we are so used to it that we forget how nicely everything works out.
Consider a variable X with population mean mu and population variance sigma^{2}.
For a simple random sample:
The relationship between (2) and (3) is so simple that we often don’t distinguish between them properly. Indeed, (2) and (3) are estimates for two different things.
The summarize command doesn’t provide (3); it gives only (2). One must use ci or mean to get (3).
When we say we want “the mean and standard deviation of a variable with probability weights”, what we most likely want is an estimate of the population mean and the standard error of this estimator for the population mean. We probably don’t care about an estimate of the standard deviation of the population.
The svy: mean command provides
muhat is the obvious weighted sample mean, and V_db is pretty complicated; see [SVY] variance estimation for details.
Should we also want an estimate of the population standard deviation, we can work backward using the formula that produced V_srs. Recall that
V_srs = s^{2}/n
thus the estimator for sigma that estat sd reports is
sigma = sqrt(n * V_srs)
The svy: mean command handles clustering and stratification. Sampling weights, clustering, and stratification can all have a big effect on the standard error of muhat. Thus, if you want to get the right standard error of the “mean” (i.e., muhat), you must consider clustering and stratification as well as sampling weights. See [SVY] variance estimation for more details.
The clustering and stratification do not affect the point estimate of the mean, and thus if you are interested only in the point estimate, you could use summarize with aweights since it gives the same weighted mean as svy: mean.
For quantiles, summarize with aweights and pctile with aweights or pweights all give the same answers.
It was intentional that summarize does not allow pweights. summarize’s purpose, as I see it, is to provide descriptive statistics for the sample, not to provide inferential statistics for the population. By this criterion, I argue that pweights do not belong here since pweights are used to provide estimates of the population parameter mu.
Furthermore, the “standard deviation” that summarize displays presents a quandary for pweights. What should we put here? An estimate of the population sigma? The estimate of population sigma is only indirectly related to the standard error of the mean. I fear that displaying anything here would lead to misinterpretation.
Thus I believe it is better that summarize not allow pweights so that people are properly steered to the inferential commands svy: mean and estat sd.
The formula used by summarize with aweights for what it labels “Std. Dev.” is the correct formula for estimating the population standard deviation with pweighted data.
The problem is this formula does not give the population standard deviation for aweights.
First, let me show that summarize with aweights gives the same result as estat sd or manually calculating the standard deviation using n and V_srs.
. webuse nhanes2 . svy: mean loglead (running mean on estimation sample) Survey: Mean estimation Number of strata = 31 Number of obs = 4948 Number of PSUs = 62 Population size = 56405414 Design df = 31 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ loglead | 2.578102 .0196583 2.538008 2.618195 -------------------------------------------------------------- . estat sd ------------------------------------- | Mean Std. Dev. -------------+----------------------- loglead | 2.578102 .4166916 ------------------------------------- . di sqrt(e(N) * el(e(V_srs),1,1)) .41669164 . summarize loglead [aw=finalwgt] Variable | Obs Weight Mean Std. Dev. Min Max -------------+----------------------------------------------------------------- loglead | 4948 56405414 2.578102 .4166916 .6931472 4.382027
summarize with aweights displays s for the “Std. Dev.”, where s is calculated according to the formula:
s^{2} = (1/(n - 1)) sum w*_{i} (x_{i} - xbar)^{2}
where x_{i} (i = 1, 2, ..., n) are the data, w*_{i} are "normalized" weights, and xbar is the weighted mean. That is,
w*_{i} = n w_{i} / (sum w_{i})
and
xbar = (sum (w_{i} x_{i})) / (sum w_{i})
where w_{i} are the raw weights.
Thus writing the formula for s^{2} in terms of the raw weights gives
s^{2} = {n/[W(n - 1)]} sum w_{i} (x_{i} - xbar)^{2}
where W is the sum of the raw weights.
The paradigm for aweights is that the data x_{1}, ..., x_{n} are considered as realizations of random variables X_{1}, ..., X_{n} where
X_{i} ∼ N(mu_{i}, sigma^{2}/w_{i})
where w_{i} are assumed to be known, fixed constants.
If, for example, X_{i} are group means of w_{i} random variables each distributed as X ∼ N(mu, sigma^{2}), then X_{i} ∼ N(mu, sigma^{2}/w_{i}).
With this set up, we can compute expectations:
E[ (x_{i} - xbar)^{2} ] = (mu_{i} - mu)^{2} + sigma^{2} (1/w_{i} - 1/W)
where mu = (1/W) sum w_{i} mu_{i}.
Thus
E[ sum w_{i} (x_{i} - xbar)^{2} ] = (n - 1) sigma^{2} + sum w_{i}(mu_{i} - mu)^{2}
We are in trouble trying to come up with an estimator for sigma^{2}. In this model, we can see variation due to sigma^{2} or variation due to mu_{i} varying about mu. There is no way to distinguish between these two variance components.
For example, we could have data:
. list +--------------------+ | x weight | |--------------------| 1. | -.1042242 10 | 2. | .0131263 10 | 3. | -.0446007 10 | 4. | -.2504879 10 | 5. | .2510872 10 | |--------------------| 6. | -.6012362 10 | 7. | .4534686 10 | 8. | -.4625476 10 | 9. | -.2094607 10 | 10. | .266293 10 | +--------------------+
where each observation could be the mean of 10 N(0,1) variables (which is how I generated it), or the first x could be the mean of 10 N(−.1042242, 0) variables, the second x could be the mean of 10 N(.0131263, 0) variables, etc. Alternatively, the first x could be the mean of 10 N(−0.1, 0.5) variables, etc. ...
Thus, in general, there is no way to estimate sigma for a model with general mu_{i}.
For the special case mu_{i} = mu for all i, we can estimate sigma. Here,
E[ sum w_{i} (x_{i} - xbar)^{2} ] = (n - 1) sigma^{2}
So
u^{2} = (1/(n - 1)) sum w_{i} (x_{i} - xbar)^{2}
is an unbiased estimator for sigma^{2}. This formula uses the raw weights. If the scale of the w_{i} changes, the estimate of sigma^{2} changes.
This finding makes intuitive sense. For the above data (assuming mu_{i} = mu), the estimate of sigma is 1.047 (the actual value is 1). If I had data with the same x’s but with weights = 100
. list +--------------------+ | x weight | |--------------------| 1. | -.1042242 100 | 2. | .0131263 100 | 3. | -.0446007 100 | 4. | -.2504879 100 | 5. | .2510872 100 | |--------------------| 6. | -.6012362 100 | 7. | .4534686 100 | 8. | -.4625476 100 | 9. | -.2094607 100 | 10. | .266293 100 | +--------------------+
then the estimate of sigma is 3.312. This makes sense because as the sizes of the groups get larger, we expect that the group means (x) get closer to mu. Thus, if the spread of the group means stays the same as weight increases, then sigma must be increasing.
So we have found a problem with Stata’s aweight paradigm. Stata assumes that with aweights, the scale of the weights does not matter. This is not true for the estimate of sigma.
John Gleason (1997) wrote an excellent article that shows the estimate of rho also depends on the scale of the weights.
Now there was a logic behind the use of summarize’s formula for aweights:
s^{2} = {n/[W(n - 1)]} sum w_{i} (x_{i} - xbar)^{2}
For the case mu_{i} = mu, the variance of xbar = (sum w_{i} x_{i})/(sum w_{i}) is sigma^{2}/W. This can be estimated by u^{2}/W, where u^{2} is the estimate of sigma^{2} we had before.
This is the logic behind summarize’s formula for s^{2} for aweights. If you unknowingly stick this s^{2} into the unweighted formula for the variance of the mean estimator, you will get the right answer.
For pweights, the formula
s^{2} = {n/[W(n - 1)]} sum w_{i} (x_{i} - xbar)^{2}
gives an unbiased estimator for sigma^{2}.
It is not too surprising that this formula is correct for pweights, because the formula IS invariant to the scale of the weights, as the formula for pweights must be.
It is easy to see why the scale of the pweights does not matter for the estimation of sigma: say that X ∼ N(0,1) in a population of 100 persons. I sample 10 persons and get the following data:
. list +--------------------+ | x weight | |--------------------| 1. | -1.09397 10 | 2. | .3670809 10 | 3. | .145398 10 | 4. | .2657781 10 | 5. | .4794085 10 | |--------------------| 6. | -1.233643 10 | 7. | .3014338 10 | 8. | -1.545905 10 | 9. | .1389086 10 | 10. | 1.133268 10 | +--------------------+
The weight is 10 because one person in the sample represents 10 in the population. Since this is just a simple random sample, we can compute sigma in the standard way. If we do so, we get 0.872 as an estimate of sigma.
Suppose now that X ∼ N(0,1) in a population of 1,000 persons. I sample 10 persons and get the following data:
. list +--------------------+ | x weight | |--------------------| 1. | -1.09397 100 | 2. | .3670809 100 | 3. | .145398 100 | 4. | .2657781 100 | 5. | .4794085 100 | |--------------------| 6. | -1.233643 100 | 7. | .3014338 100 | 8. | -1.545905 100 | 9. | .1389086 100 | 10. | 1.133268 100 | +--------------------+
The weight is 100 since one person in the sample represents 100 in the population. Obviously, the estimate of sigma is unchanged; it’s still 0.872.
The same scale invariance applies when persons are sampled with unequal weights.
The formal proof that
s^{2} = {n/[W(n - 1)]} sum w_{i} (x_{i} - xbar)^{2}
gives an unbiased estimator for sigma^{2} is tricky. The set of weights is not fixed; different samples can lead to different sets of weights. Thus one has to be careful when proving E(s^{2}) = sigma^{2}. One starts with the finite population definition of sigma:
Var(X) = sigma^{2} = (1/M) sum over population (X_{i} - Xbar)^{2}
where X_{i} now denotes everyone in the population and i = 1, 2, ..., M (the total number of persons in the population). We then use results like sum over sample w_{i}*x_{i}^{2} as an unbiased estimator for sum over population X_{i}^{2}. M is generally unknown; we are also estimating it.