Home  /  Resources & support  /  FAQs  /  summarize and aweights and pweights

Why doesn’t summarize accept pweights?

What does summarize calculate when you use aweights?

Title   Probability weights, analytic weights, and summary statistics
Author William Sribney, StataCorp

Question

My data come with probability weights (the inverse of the probability of an observation being selected into the sample). I am trying to compute various summary statistics, including the mean, standard deviation, and various percentiles of the data. Is there any way to compute the mean, standard deviation, and percentiles of a variable with probability weights?

Short answer

It is important to distinguish among

  • an estimate of the population mean (mu),
  • an estimate of the population standard deviation (sigma), and
  • the standard error of the estimate of the population mean.

The command svy: mean provides an estimate of the population mean and an estimate of its standard error. When computing the standard error, consider the effect of clustering and stratification as well as the effect of sampling weights.

An estimate of the population standard deviation can be obtained from estat sd after svy: mean. An estimate of the population standard deviation (sigma) is given by

    estimate of sigma = sqrt( n * V_srs )

where V_srs is an estimate of the variance of the mean assuming a simple random sample (of the same size as your survey dataset) and n is the number of observations. svy: mean computes the above variance and saves it in a matrix called e(V_srs). svy: mean also saves the number of observations in e(N). estat sd uses these values to estimate the population standard deviation.

Long answer

For survey sampling data (i.e., for data that are not from a simple random sample), one has to go back to the basics and carefully think about the terms “mean” and “standard deviation”.

Let me describe the simple case of estimates for the mean and variance for a simple random sample. I want to go over it because we are so used to it that we forget how nicely everything works out.

Simple random samples

Consider a variable X with population mean mu and population variance sigma2.

For a simple random sample:

  1. An estimate of the population mean (mu) is the sample mean (xbar).
  2. An estimate of the population standard deviation (sigma) is the sample standard deviation (s).
  3. An estimate of the standard error of xbar is s/sqrt(n).

The relationship between (2) and (3) is so simple that we often don’t distinguish between them properly. Indeed, (2) and (3) are estimates for two different things.

The summarize command doesn’t provide (3); it gives only (2). One must use ci or mean to get (3).

Weighted/clustered/stratified survey sample

When we say we want “the mean and standard deviation of a variable with probability weights”, what we most likely want is an estimate of the population mean and the standard error of this estimator for the population mean. We probably don’t care about an estimate of the standard deviation of the population.

The svy: mean command provides

  1. muhat: an estimate of the population mean (mu)
  2. V_db: an estimate of the variance of muhat accounting for the survey design used to collect the data
  3. V_srs: an estimate of the variance of muhat assuming a simple random sample of the same number of observations

muhat is the obvious weighted sample mean, and V_db is pretty complicated; see [SVY] variance estimation for details.

Should we also want an estimate of the population standard deviation, we can work backward using the formula that produced V_srs. Recall that

    V_srs = s2/n

thus the estimator for sigma that estat sd reports is

    sigma = sqrt(n * V_srs)

Clustering and stratification

The svy: mean command handles clustering and stratification. Sampling weights, clustering, and stratification can all have a big effect on the standard error of muhat. Thus, if you want to get the right standard error of the “mean” (i.e., muhat), you must consider clustering and stratification as well as sampling weights. See [SVY] variance estimation for more details.

The clustering and stratification do not affect the point estimate of the mean, and thus if you are interested only in the point estimate, you could use summarize with aweights since it gives the same weighted mean as svy: mean.

For quantiles, summarize with aweights and pctile with aweights or pweights all give the same answers.

The summarize command

It was intentional that summarize does not allow pweights. summarize’s purpose, as I see it, is to provide descriptive statistics for the sample, not to provide inferential statistics for the population. By this criterion, I argue that pweights do not belong here since pweights are used to provide estimates of the population parameter mu.

Furthermore, the “standard deviation” that summarize displays presents a quandary for pweights. What should we put here? An estimate of the population sigma? The estimate of population sigma is only indirectly related to the standard error of the mean. I fear that displaying anything here would lead to misinterpretation.

Thus I believe it is better that summarize not allow pweights so that people are properly steered to the inferential commands svy: mean and estat sd.

summarize with aweights

The formula used by summarize with aweights for what it labels “Std. Dev.” is the correct formula for estimating the population standard deviation with pweighted data.

The problem is this formula does not give the population standard deviation for aweights.

First, let me show that summarize with aweights gives the same result as estat sd or manually calculating the standard deviation using n and V_srs.

. webuse nhanes2

. svy: mean loglead
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31              Number of obs   =      4,948
Number of PSUs   = 62              Population size = 56,405,414
                                   Design df       =         31

Linearized
Mean std. err. [95% conf. interval]
loglead 2.578102 .0196583 2.538008 2.618195
. estat sd
Mean Std. dev.
loglead 2.578102 .4166916
. display sqrt(e(N) * el(e(V_srs),1,1)) .41669164 . summarize loglead [aweight=finalwgt]
Variable Obs Weight Mean Std. dev. Min Max
loglead 4,948 56405414 2.578102 .4166916 .6931472 4.382027

Formula for s2 used by summarize with aweights

summarize with aweights displays s for the “Std. Dev.”, where s is calculated according to the formula:

    s2 = (1/(n - 1)) sum w*i (xi - xbar)2

where xi (i = 1, 2, ..., n) are the data, w*i are "normalized" weights, and xbar is the weighted mean. That is,

    w*i = n wi / (sum wi)

and

    xbar = (sum (wi xi)) / (sum wi)

where wi are the raw weights.

Thus writing the formula for s2 in terms of the raw weights gives

    s2 = {n/[W(n - 1)]} sum wi (xi - xbar)2

where W is the sum of the raw weights.

Paradigm for aweights

The paradigm for aweights is that the data x1, ..., xn are considered as realizations of random variables X1, ..., Xn where

    Xi ∼ N(mui, sigma2/wi)

where wi are assumed to be known, fixed constants.

If, for example, Xi are group means of wi random variables each distributed as X ∼ N(mu, sigma2), then Xi ∼ N(mu, sigma2/wi).

With this set up, we can compute expectations:

    E[ (xi - xbar)2 ] = (mui - mu)2 + sigma2 (1/wi - 1/W)

where mu = (1/W) sum wi mui.

Thus

    E[ sum wi (xi - xbar)2 ] = (n - 1) sigma2 + sum wi(mui - mu)2

We are in trouble trying to come up with an estimator for sigma2. In this model, we can see variation due to sigma2 or variation due to mui varying about mu. There is no way to distinguish between these two variance components.

For example, we could have data:

. list

x weight
1. -.1042242 10
2. .0131263 10
3. -.0446007 10
4. -.2504879 10
5. .2510872 10
6. -.6012362 10
7. .4534686 10
8. -.4625476 10
9. -.2094607 10
10. .266293 10

where each observation could be the mean of 10 N(0,1) variables (which is how I generated it), or the first x could be the mean of 10 N(−.1042242, 0) variables, the second x could be the mean of 10 N(.0131263, 0) variables, etc. Alternatively, the first x could be the mean of 10 N(−0.1, 0.5) variables, etc. ...

Thus, in general, there is no way to estimate sigma for a model with general mui.

For the special case mui = mu for all i, we can estimate sigma. Here,

    E[ sum wi (xi - xbar)2 ] = (n - 1) sigma2

So

    u2 = (1/(n - 1)) sum wi (xi - xbar)2

is an unbiased estimator for sigma2. This formula uses the raw weights. If the scale of the wi changes, the estimate of sigma2 changes.

This finding makes intuitive sense. For the above data (assuming mui = mu), the estimate of sigma is 1.047 (the actual value is 1). If I had data with the same x’s but with weights = 100

. list

x weight
1. -.1042242 100
2. .0131263 100
3. -.0446007 100
4. -.2504879 100
5. .2510872 100
6. -.6012362 100
7. .4534686 100
8. -.4625476 100
9. -.2094607 100
10. .266293 100

then the estimate of sigma is 3.312. This makes sense because as the sizes of the groups get larger, we expect that the group means (x) get closer to mu. Thus, if the spread of the group means stays the same as weight increases, then sigma must be increasing.

So we have found a problem with Stata’s aweight paradigm. Stata assumes that with aweights, the scale of the weights does not matter. This is not true for the estimate of sigma.

John Gleason (1997) wrote an excellent article that shows the estimate of rho also depends on the scale of the weights.

Logic of summarize’s formula

Now there was a logic behind the use of summarize’s formula for aweights:

    s2 = {n/[W(n - 1)]} sum wi (xi - xbar)2

For the case mui = mu, the variance of xbar = (sum wi xi)/(sum wi) is sigma2/W. This can be estimated by u2/W, where u2 is the estimate of sigma2 we had before.

This is the logic behind summarize’s formula for s2 for aweights. If you unknowingly stick this s2 into the unweighted formula for the variance of the mean estimator, you will get the right answer.

pweights and the estimate of sigma

For pweights, the formula

    s2 = {n/[W(n - 1)]} sum wi (xi - xbar)2

gives an unbiased estimator for sigma2.

It is not too surprising that this formula is correct for pweights, because the formula IS invariant to the scale of the weights, as the formula for pweights must be.

It is easy to see why the scale of the pweights does not matter for the estimation of sigma: say that X ∼ N(0,1) in a population of 100 persons. I sample 10 persons and get the following data:

. list

x weight
1. -1.09397 10
2. .3670809 10
3. .145398 10
4. .2657781 10
5. .4794085 10
6. -1.233643 10
7. .3014338 10
8. -1.545905 10
9. .1389086 10
10. 1.133268 10

The weight is 10 because one person in the sample represents 10 in the population. Since this is just a simple random sample, we can compute sigma in the standard way. If we do so, we get 0.872 as an estimate of sigma.

Suppose now that X ∼ N(0,1) in a population of 1,000 persons. I sample 10 persons and get the following data:

. list

x weight
1. -1.09397 100
2. .3670809 100
3. .145398 100
4. .2657781 100
5. .4794085 100
6. -1.233643 100
7. .3014338 100
8. -1.545905 100
9. .1389086 100
10. 1.133268 100

The weight is 100 since one person in the sample represents 100 in the population. Obviously, the estimate of sigma is unchanged; it’s still 0.872.

The same scale invariance applies when persons are sampled with unequal weights.

The formal proof that

    s2 = {n/[W(n - 1)]} sum wi (xi - xbar)2

gives an unbiased estimator for sigma2 is tricky. The set of weights is not fixed; different samples can lead to different sets of weights. Thus one has to be careful when proving E(s2) = sigma2. One starts with the finite population definition of sigma:

    Var(X) = sigma2 = (1/M) sum over population (Xi - Xbar)2 

where Xi now denotes everyone in the population and i = 1, 2, ..., M (the total number of persons in the population). We then use results like sum over sample wi*xi2 as an unbiased estimator for sum over population Xi2. M is generally unknown; we are also estimating it.

Note:   Do not jump to the conclusion that the scale of pweights never matters. It does matter for a few computations. E.g., the scale of the weights affects the computation of finite population corrections and the estimation of totals (of course!).

Reference

Gleason, J. 1997.
sg65: Computing intraclass correlations and large ANOVAs. Stata Technical Bulletin 35: 25–31. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 167–176.