FAQ: summarize and aweights and pweights

Home / Resources & support / FAQs / summarize and aweights and pweights

Why doesn’t summarize accept pweights?

What does summarize calculate when you use aweights?

Title		Probability weights, analytic weights, and summary statistics
Author		William Sribney, StataCorp

Question

My data come with probability weights (the inverse of the probability of an observation being selected into the sample). I am trying to compute various summary statistics, including the mean, standard deviation, and various percentiles of the data. Is there any way to compute the mean, standard deviation, and percentiles of a variable with probability weights?

Short answer

It is important to distinguish among

an estimate of the population mean (mu),
an estimate of the population standard deviation (sigma), and
the standard error of the estimate of the population mean.

The command svy: mean provides an estimate of the population mean and an estimate of its standard error. When computing the standard error, consider the effect of clustering and stratification as well as the effect of sampling weights.

An estimate of the population standard deviation can be obtained from estat sd after svy: mean. An estimate of the population standard deviation (sigma) is given by

    estimate of sigma = sqrt( n * V_srs )

where V_srs is an estimate of the variance of the mean assuming a simple random sample (of the same size as your survey dataset) and n is the number of observations. svy: mean computes the above variance and saves it in a matrix called e(V_srs). svy: mean also saves the number of observations in e(N). estat sd uses these values to estimate the population standard deviation.

Long answer

For survey sampling data (i.e., for data that are not from a simple random sample), one has to go back to the basics and carefully think about the terms “mean” and “standard deviation”.

Let me describe the simple case of estimates for the mean and variance for a simple random sample. I want to go over it because we are so used to it that we forget how nicely everything works out.

Simple random samples

Consider a variable X with population mean mu and population variance sigma².

For a simple random sample:

An estimate of the population mean (mu) is the sample mean (xbar).
An estimate of the population standard deviation (sigma) is the sample standard deviation (s).
An estimate of the standard error of xbar is s/sqrt(n).

The relationship between (2) and (3) is so simple that we often don’t distinguish between them properly. Indeed, (2) and (3) are estimates for two different things.

The summarize command doesn’t provide (3); it gives only (2). One must use ci or mean to get (3).

Weighted/clustered/stratified survey sample

When we say we want “the mean and standard deviation of a variable with probability weights”, what we most likely want is an estimate of the population mean and the standard error of this estimator for the population mean. We probably don’t care about an estimate of the standard deviation of the population.

The svy: mean command provides

muhat: an estimate of the population mean (mu)
V_db: an estimate of the variance of muhat accounting for the survey design used to collect the data
V_srs: an estimate of the variance of muhat assuming a simple random sample of the same number of observations

muhat is the obvious weighted sample mean, and V_db is pretty complicated; see [SVY] variance estimation for details.

Should we also want an estimate of the population standard deviation, we can work backward using the formula that produced V_srs. Recall that

    V_srs = s²/n

thus the estimator for sigma that estat sd reports is

    sigma = sqrt(n * V_srs)

Clustering and stratification

The svy: mean command handles clustering and stratification. Sampling weights, clustering, and stratification can all have a big effect on the standard error of muhat. Thus, if you want to get the right standard error of the “mean” (i.e., muhat), you must consider clustering and stratification as well as sampling weights. See [SVY] variance estimation for more details.

The clustering and stratification do not affect the point estimate of the mean, and thus if you are interested only in the point estimate, you could use summarize with aweights since it gives the same weighted mean as svy: mean.

For quantiles, summarize with aweights and pctile with aweights or pweights all give the same answers.

The summarize command

It was intentional that summarize does not allow pweights. summarize’s purpose, as I see it, is to provide descriptive statistics for the sample, not to provide inferential statistics for the population. By this criterion, I argue that pweights do not belong here since pweights are used to provide estimates of the population parameter mu.

Furthermore, the “standard deviation” that summarize displays presents a quandary for pweights. What should we put here? An estimate of the population sigma? The estimate of population sigma is only indirectly related to the standard error of the mean. I fear that displaying anything here would lead to misinterpretation.

Thus I believe it is better that summarize not allow pweights so that people are properly steered to the inferential commands svy: mean and estat sd.

summarize with aweights

The formula used by summarize with aweights for what it labels “Std. Dev.” is the correct formula for estimating the population standard deviation with pweighted data.

The problem is this formula does not give the population standard deviation for aweights.

First, let me show that summarize with aweights gives the same result as estat sd or manually calculating the standard deviation using n and V_srs.

. webuse nhanes2

. svy: mean loglead
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 31              Number of obs   =      4,948
Number of PSUs   = 62              Population size = 56,405,414
                                   Design df       =         31



                           Linearized                         
                     Mean   std. err.     [95% conf. interval]
   
     loglead     2.578102   .0196583      2.538008    2.618195



. estat sd



                     Mean   Std. dev.
   
     loglead     2.578102    .4166916



. display sqrt(e(N) * el(e(V_srs),1,1))
.41669164

. summarize loglead [aweight=finalwgt]


    Variable       Obs      Weight        Mean   Std. dev.       Min        Max
   
     loglead     4,948    56405414    2.578102   .4166916   .6931472   4.382027

Formula for s² used by summarize with aweights

summarize with aweights displays s for the “Std. Dev.”, where s is calculated according to the formula:

    s² = (1/(n - 1)) sum w*_i (x_i - xbar)²

where x_i (i = 1, 2, ..., n) are the data, w*_i are "normalized" weights, and xbar is the weighted mean. That is,

    w*_i = n w_i / (sum w_i)

and

    xbar = (sum (w_i x_i)) / (sum w_i)

where w_i are the raw weights.

Thus writing the formula for s² in terms of the raw weights gives

    s² = {n/[W(n - 1)]} sum w_i (x_i - xbar)²

where W is the sum of the raw weights.

Paradigm for aweights

The paradigm for aweights is that the data x₁, ..., x_n are considered as realizations of random variables X₁, ..., X_n where

    X_i ∼ N(mu_i, sigma²/w_i)

where w_i are assumed to be known, fixed constants.

If, for example, X_i are group means of w_i random variables each distributed as X ∼ N(mu, sigma²), then X_i ∼ N(mu, sigma²/w_i).

With this set up, we can compute expectations:

    E[ (x_i - xbar)² ] = (mu_i - mu)² + sigma² (1/w_i - 1/W)

where mu = (1/W) sum w_i mu_i.

Thus

    E[ sum w_i (x_i - xbar)² ] = (n - 1) sigma² + sum w_i(mu_i - mu)²

We are in trouble trying to come up with an estimator for sigma². In this model, we can see variation due to sigma² or variation due to mu_i varying about mu. There is no way to distinguish between these two variance components.

For example, we could have data:

. list


             x   weight 
 1.  -.1042242       10 
 2.   .0131263       10 
 3.  -.0446007       10 
 4.  -.2504879       10 
 5.   .2510872       10 
 6.  -.6012362       10 
 7.   .4534686       10 
 8.  -.4625476       10 
 9.  -.2094607       10 
10.    .266293       10

where each observation could be the mean of 10 N(0,1) variables (which is how I generated it), or the first x could be the mean of 10 N(−.1042242, 0) variables, the second x could be the mean of 10 N(.0131263, 0) variables, etc. Alternatively, the first x could be the mean of 10 N(−0.1, 0.5) variables, etc. ...

Thus, in general, there is no way to estimate sigma for a model with general mu_i.

For the special case mu_i = mu for all i, we can estimate sigma. Here,

    E[ sum w_i (x_i - xbar)² ] = (n - 1) sigma²

    u² = (1/(n - 1)) sum w_i (x_i - xbar)²

is an unbiased estimator for sigma². This formula uses the raw weights. If the scale of the w_i changes, the estimate of sigma² changes.

This finding makes intuitive sense. For the above data (assuming mu_i = mu), the estimate of sigma is 1.047 (the actual value is 1). If I had data with the same x’s but with weights = 100

. list


             x   weight 
 1.  -.1042242      100 
 2.   .0131263      100 
 3.  -.0446007      100 
 4.  -.2504879      100 
 5.   .2510872      100 
 6.  -.6012362      100 
 7.   .4534686      100 
 8.  -.4625476      100 
 9.  -.2094607      100 
10.    .266293      100

then the estimate of sigma is 3.312. This makes sense because as the sizes of the groups get larger, we expect that the group means (x) get closer to mu. Thus, if the spread of the group means stays the same as weight increases, then sigma must be increasing.

So we have found a problem with Stata’s aweight paradigm. Stata assumes that with aweights, the scale of the weights does not matter. This is not true for the estimate of sigma.

John Gleason (1997) wrote an excellent article that shows the estimate of rho also depends on the scale of the weights.

Logic of summarize’s formula

Now there was a logic behind the use of summarize’s formula for aweights:

    s² = {n/[W(n - 1)]} sum w_i (x_i - xbar)²

For the case mu_i = mu, the variance of xbar = (sum w_i x_i)/(sum w_i) is sigma²/W. This can be estimated by u²/W, where u² is the estimate of sigma² we had before.

This is the logic behind summarize’s formula for s² for aweights. If you unknowingly stick this s² into the unweighted formula for the variance of the mean estimator, you will get the right answer.

pweights and the estimate of sigma

For pweights, the formula

    s² = {n/[W(n - 1)]} sum w_i (x_i - xbar)²

gives an unbiased estimator for sigma².

It is not too surprising that this formula is correct for pweights, because the formula IS invariant to the scale of the weights, as the formula for pweights must be.

It is easy to see why the scale of the pweights does not matter for the estimation of sigma: say that X ∼ N(0,1) in a population of 100 persons. I sample 10 persons and get the following data:

. list


             x   weight 
 1.   -1.09397       10 
 2.   .3670809       10 
 3.    .145398       10 
 4.   .2657781       10 
 5.   .4794085       10 
 6.  -1.233643       10 
 7.   .3014338       10 
 8.  -1.545905       10 
 9.   .1389086       10 
10.   1.133268       10

The weight is 10 because one person in the sample represents 10 in the population. Since this is just a simple random sample, we can compute sigma in the standard way. If we do so, we get 0.872 as an estimate of sigma.

Suppose now that X ∼ N(0,1) in a population of 1,000 persons. I sample 10 persons and get the following data:

. list


             x   weight 
 1.   -1.09397      100 
 2.   .3670809      100 
 3.    .145398      100 
 4.   .2657781      100 
 5.   .4794085      100 
 6.  -1.233643      100 
 7.   .3014338      100 
 8.  -1.545905      100 
 9.   .1389086      100 
10.   1.133268      100

The weight is 100 since one person in the sample represents 100 in the population. Obviously, the estimate of sigma is unchanged; it’s still 0.872.

The same scale invariance applies when persons are sampled with unequal weights.

The formal proof that

    s² = {n/[W(n - 1)]} sum w_i (x_i - xbar)²

gives an unbiased estimator for sigma² is tricky. The set of weights is not fixed; different samples can lead to different sets of weights. Thus one has to be careful when proving E(s²) = sigma². One starts with the finite population definition of sigma:

    Var(X) = sigma² = (1/M) sum over population (X_i - Xbar)²

where X_i now denotes everyone in the population and i = 1, 2, ..., M (the total number of persons in the population). We then use results like sum over sample w_i*x_i² as an unbiased estimator for sum over population X_i². M is generally unknown; we are also estimating it.

Note: Do not jump to the conclusion that the scale of pweights never matters. It does matter for a few computations. E.g., the scale of the weights affects the computation of finite population corrections and the estimation of totals (of course!).

Reference

Gleason, J. 1997.: sg65: Computing intraclass correlations and large ANOVAs. Stata Technical Bulletin 35: 25–31. Reprinted in Stata Technical Bulletin Reprints, vol. 6, pp. 167–176.

Why doesn’t summarize accept pweights?

What does summarize calculate when you use aweights?

Question

Short answer

Long answer

Simple random samples

Weighted/clustered/stratified survey sample

Clustering and stratification

The summarize command

summarize with aweights

Formula for s² used by summarize with aweights

Paradigm for aweights

Logic of summarize’s formula

pweights and the estimate of sigma

Reference

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies


		Linearized
		Mean std. err. [95% conf. interval]

loglead		2.578102 .0196583 2.538008 2.618195

Variable		Obs Weight Mean Std. dev. Min Max

loglead		4,948 56405414 2.578102 .4166916 .6931472 4.382027

	x weight
1.	-.1042242 10
2.	.0131263 10
3.	-.0446007 10
4.	-.2504879 10
5.	.2510872 10
6.	-.6012362 10
7.	.4534686 10
8.	-.4625476 10
9.	-.2094607 10
10.	.266293 10

	x weight
1.	-.1042242 100
2.	.0131263 100
3.	-.0446007 100
4.	-.2504879 100
5.	.2510872 100
6.	-.6012362 100
7.	.4534686 100
8.	-.4625476 100
9.	-.2094607 100
10.	.266293 100

	x weight
1.	-1.09397 10
2.	.3670809 10
3.	.145398 10
4.	.2657781 10
5.	.4794085 10
6.	-1.233643 10
7.	.3014338 10
8.	-1.545905 10
9.	.1389086 10
10.	1.133268 10

	x weight
1.	-1.09397 100
2.	.3670809 100
3.	.145398 100
4.	.2657781 100
5.	.4794085 100
6.	-1.233643 100
7.	.3014338 100
8.	-1.545905 100
9.	.1389086 100
10.	1.133268 100

Stata/MP4 Annual License (download)

Why doesn’t summarize accept pweights?

What does summarize calculate when you use aweights?

Question

Short answer

Long answer

Simple random samples

Weighted/clustered/stratified survey sample

Clustering and stratification

The summarize command

summarize with aweights

Formula for s2 used by summarize with aweights

Paradigm for aweights

Logic of summarize’s formula

pweights and the estimate of sigma

Reference

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

Formula for s² used by summarize with aweights