Title | Calculate percentiles with survey data | |

Author | Nini Zang, StataCorp |

When we have survey data, we can still use
**pctile** or
**_pctile**
to get percentiles. This is the case because survey characteristics, other
than **pweight**s, affect only the variance estimation. Therefore, point
estimation of the percentile for survey data can be obtained with **pctile** or
**_pctile** with **pweight**s.

I will start by presenting an example on how **_pctile** works with
survey data.

. sysuse auto(1978 Automobile Data). rename mpg psu . rename length strata . keep price psu strata weight . keep in 1/4(70 observations deleted). svyset psu [pweight=weight], strata(strata)pweight: weight VCE: linearized Single unit: missing Strata 1: strata SU 1: psu FPC 1: <zero>. _pctile price [pweight=weight], p(10) . return listscalars: r(r1) = 3799

As we already know, a percentile is the value of a variable below which a
certain percentage of observations fall. So the 10th percentile is the value
below which 10% of the observations may be found. Although we have
survey structures—such as strata, PSU, and **pweight**s—the percentiles are
only affected by **pweight**s. Let’s look at the formula of
**pctile** or **_pctile** we use in Stata.

Let *x*_{(j)} refer to the *x* in ascending order for
*j* = 1, 2, ..., *n*. Let *w*_{(j)} refer to the
corresponding weights of *x*_{(j)};
if there are no weights, *w*_{(j)} = 1. Let N =
Σ^{n}_{j=1}*w*_{(j)}.
To obtain the *p*th percentile, which we will denote as
*x*_{[p]}, we need to
find the first index *i* such that
*W*_{(i)} > P, where P = N * *p*/100 and
*W*_{(i)} =
Σ^{i}_{j=1}*w*_{(j)}.

The *p*th percentile is then

{ | x_{(i−1)} + x_{(i)} |
|||

x_{[p]} = |
2 | If w_{(i−1)} = P |
||

x_{(i)} |
otherwise |

From above, we can see that the calculation of a percentile is only associated with weights and observations.

Let’s manually calculate the percentile obtained above with **_pctile**.
We first sort the data:

. sort price . list+-------------------------------+ | price psu weight strata | |-------------------------------| 1. | 3,799 22 2,640 168 | 2. | 4,099 22 2,930 186 | 3. | 4,749 17 3,350 173 | 4. | 4,816 20 3,250 196 | +-------------------------------+

Let

price_{(j)} = the variable price in ascending order for
*j* = 1, 2, 3, 4

weight_{(j)} = the corresponding weights

price_{[10]} = 10th percentile of price

We generate variable **w**, cumulative. Sum of weight:

. generate w=sum(weight) . list+---------------------------------------+ | price psu weight strata w | |---------------------------------------| 1. | 3,799 22 2,640 168 2640 | 2. | 4,099 22 2,930 186 5570 | 3. | 4,749 17 3,350 173 8920 | 4. | 4,816 20 3,250 196 12170 | +---------------------------------------+

Then,
N = Σ^{4}_{j=1}weight_{(j)} =
2640 + 2930 + 3350 + 3250 = 12170 and
P = N * *p*/100 = (12170 * 10)/100 = 1217. To obtain the 10th
percentile, we must find the first index *i* such that
W_{(i)} > 1217.
When index *i* =1, we can see W_{(1)} = 2640, which
is greater than 1217. Thus the 10th percentile price_{[10]}
is equal to price_{(1)}; that is, the
price_{[10]} = 3799.

We can also estimate the median from survey data by using summarize with aweights.

. sysuse auto, clear(1978 Automobile Data). rename mpg psu . rename length strata . keep price psu strata weight . keep in 1/4(70 observations deleted). svyset psu [pweight=weight], strata(strata)pweight: weight VCE: linearized Single unit: missing Strata 1: strata SU 1: psu FPC 1: <zero>. summarize price [aweight=weight], detailPrice ------------------------------------------------------------- Percentiles Smallest 1% 3799 3799 5% 3799 4099 10% 3799 4749 Obs 4 25% 4099 4816 Sum of Wgt. 12170 50% 4749 Mean 4404.32 Largest Std. Dev. 489.7492 75% 4816 3799 90% 4816 4099 Variance 239854.3 95% 4816 4749 Skewness -.3284718 99% 4816 4816 Kurtosis 1.321737

From above, we can see that the median of price is equal to 4749. The 10th
percentile of price is equal to 3799, which is the same result that we
obtained with **_pctile** and **pweight**s.