How do I obtain percentiles for survey data?
|
Title
|
|
Calculate percentiles with survey data
|
|
Author
|
Nini Zang, StataCorp
|
|
Date
|
January 2008
|
When we have survey data, we can still use
pctile or
_pctile
to get percentiles. This is the case because survey characteristics, other
than pweights, affect only the variance estimation. Therefore, point
estimation of the percentile for survey data can be obtained with pctile or
_pctile with pweights.
I will start by presenting an example on how _pctile works with
survey data.
. sysuse auto
(1978 Automobile Data)
. rename mpg psu
. rename length strata
. keep price psu strata weight
. keep in 1/4
(70 observations deleted)
. svyset psu [pweight=weight], strata(strata)
pweight: weight
VCE: linearized
Single unit: missing
Strata 1: strata
SU 1: psu
FPC 1: <zero>
. _pctile price [pweight=weight], p(10)
. return list
scalars:
r(r1) = 3799
As we already know, a percentile is the value of a variable below which a
certain percentage of observations fall. So the 10th percentile is the value
below which 10% of the observations may be found. Although we have
survey structures—such as strata, PSU, and pweights—the percentiles are
only affected by pweights. Let’s look at the formula of
pctile or _pctile we use in Stata.
Let x(j) refer to the x in ascending order for
j = 1, 2, ..., n. Let w(j) refer to the
corresponding weights of x(j);
if there are no weights, w(j) = 1. Let N =
Σnj=1w(j).
To obtain the pth percentile, which we will denote as
x[p], we need to
find the first index i such that
W(i) > P, where P = N * p/100 and
W(i) =
Σij=1w(j).
The pth percentile is then
| |
{ |
x(i−1) + x(i) |
|
|
| x[p] = |
2 |
If w(i−1) = P |
| |
x(i) |
otherwise |
From above, we can see that the calculation of a percentile is only associated
with weights and observations.
Let’s manually calculate the percentile obtained above with _pctile.
We first sort the data:
. sort price
. list
+-------------------------------+
| price psu weight strata |
|-------------------------------|
1. | 3,799 22 2,640 168 |
2. | 4,099 22 2,930 186 |
3. | 4,749 17 3,350 173 |
4. | 4,816 20 3,250 196 |
+-------------------------------+
Let
price(j) = the variable price in ascending order for
j = 1, 2, 3, 4
weight(j) = the corresponding weights
price[10] = 10th percentile of price
Then,
N = Σ4j=1weight(j) =
2640 + 2930 + 3350 + 3250 = 12170 and
P = N * p/100 = (12170 * 10)/100 = 1217. To obtain the 10th
percentile, we must find the first index i such that
weight(i) > 1217.
When index i =1, we can see weight(1) = 2640, which
is greater than 1217. Thus the 10th percentile price[10]
is equal to price(1); that is, the
price[10] = 3799.
We can also estimate the median from survey data by using
summarize with
aweights.
. sysuse auto, clear
(1978 Automobile Data)
. rename mpg psu
. rename length strata
. keep price psu strata weight
. keep in 1/4
(70 observations deleted)
. svyset psu [pweight=weight], strata(strata)
pweight: weight
VCE: linearized
Single unit: missing
Strata 1: strata
SU 1: psu
FPC 1: <zero>
. summarize price [aweight=weight], detail
Price
-------------------------------------------------------------
Percentiles Smallest
1% 3799 3799
5% 3799 4099
10% 3799 4749 Obs 4
25% 4099 4816 Sum of Wgt. 12170
50% 4749 Mean 4404.32
Largest Std. Dev. 489.7492
75% 4816 3799
90% 4816 4099 Variance 239854.3
95% 4816 4749 Skewness -.3284718
99% 4816 4816 Kurtosis 1.321737
From above, we can see that the median of price is equal to 4749. The 10th
percentile of age is equal to 3799, which is the same result that we
obtained with _pctile and pweights.
|