Home  /  Resources & support  /  FAQs  /  Calculate percentiles with survey data

How do I obtain percentiles for survey data?

Title   Calculate percentiles with survey data
Author Nini Zang, StataCorp

When we have survey data, we can still use pctile or _pctile to get percentiles. This is the case because survey characteristics, other than pweights, affect only the variance estimation. Therefore, point estimation of the percentile for survey data can be obtained with pctile or _pctile with pweights.

I will start by presenting an example on how _pctile works with survey data.

 . sysuse auto
 (1978 Automobile Data)

 . rename mpg psu

 . rename length strata

 . keep price psu strata weight

 . keep in 1/4
 (70 observations deleted)

 . svyset psu [pweight=weight], strata(strata)

       pweight: weight
           VCE: linearized
   Single unit: missing
      Strata 1: strata
          SU 1: psu
         FPC 1: <zero>

 . _pctile price [pweight=weight], p(10)

 . return list

 scalars:
                  r(r1) =  3799

As we already know, a percentile is the value of a variable below which a certain percentage of observations fall. So the 10th percentile is the value below which 10% of the observations may be found. Although we have survey structures—such as strata, PSU, and pweights—the percentiles are only affected by pweights. Let’s look at the formula of pctile or _pctile we use in Stata.

Let x(j) refer to the x in ascending order for j = 1, 2, ..., n. Let w(j) refer to the corresponding weights of x(j); if there are no weights, w(j) = 1. Let N = Σnj=1w(j). To obtain the pth percentile, which we will denote as x[p], we need to find the first index i such that W(i) > P, where P = N * p/100 and W(i) = Σij=1w(j).

The pth percentile is then

  { x(i−1) + x(i)     
x[p] = 2 If w(i−1) = P
  x(i) otherwise

From above, we can see that the calculation of a percentile is only associated with weights and observations.

Let’s manually calculate the percentile obtained above with _pctile. We first sort the data:

 . sort price

 . list

      +-------------------------------+
      | price   psu   weight   strata |
      |-------------------------------|
   1. | 3,799    22    2,640      168 |
   2. | 4,099    22    2,930      186 |
   3. | 4,749    17    3,350      173 |
   4. | 4,816    20    3,250      196 |
      +-------------------------------+

Let

price(j) = the variable price in ascending order for j = 1, 2, 3, 4

weight(j) = the corresponding weights

price[10] = 10th percentile of price

We generate variable w, cumulative. Sum of weight:

. generate w=sum(weight)

. list


      +---------------------------------------+
      | price   psu   weight   strata       w |
      |---------------------------------------|
   1. | 3,799    22    2,640      168    2640 |
   2. | 4,099    22    2,930      186    5570 |
   3. | 4,749    17    3,350      173    8920 |
   4. | 4,816    20    3,250      196   12170 |
      +---------------------------------------+

Then, N = Σ4j=1weight(j) = 2640 + 2930 + 3350 + 3250 = 12170 and P = N * p/100 = (12170 * 10)/100 = 1217. To obtain the 10th percentile, we must find the first index i such that W(i) > 1217. When index i =1, we can see W(1) = 2640, which is greater than 1217. Thus the 10th percentile price[10] is equal to price(1); that is, the price[10] = 3799.

We can also estimate the median from survey data by using summarize with aweights.

 . sysuse auto, clear
 (1978 Automobile Data)

 . rename mpg psu

 . rename length strata

 . keep price psu strata weight

 . keep in 1/4
 (70 observations deleted)

 . svyset psu [pweight=weight], strata(strata)

       pweight: weight
           VCE: linearized
   Single unit: missing
      Strata 1: strata
          SU 1: psu
         FPC 1: <zero>

 . summarize price [aweight=weight], detail

                             Price
 -------------------------------------------------------------
       Percentiles      Smallest
  1%         3799           3799
  5%         3799           4099
 10%         3799           4749       Obs                   4
 25%         4099           4816       Sum of Wgt.       12170
 
 50%         4749                      Mean            4404.32
                         Largest       Std. Dev.      489.7492
 75%         4816           3799
 90%         4816           4099       Variance       239854.3
 95%         4816           4749       Skewness      -.3284718
 99%         4816           4816       Kurtosis       1.321737

From above, we can see that the median of price is equal to 4749. The 10th percentile of price is equal to 3799, which is the same result that we obtained with _pctile and pweights.