Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: pctile, xtile methods


From   [email protected]
To   [email protected]
Subject   st: pctile, xtile methods
Date   Mon, 9 Jun 2003 08:30:52 -0400

I could be mistaken but I have noticed two possible issues with -pctile- and
-xtile-:

1. Interpretation of percentiles and percentile ranks
=====================================================

It seems to me that -xtile- gives results that are inconsistent with the
method used by -pctile- for computing quantiles.

Both methods used by -pctile-, the default and -altdef-, are consistent with
the interpretation of the p^th percentile as a value (x_p) for which p
percent of observations (p) are strictly smaller.  (This is also how I have
always interpreted percentiles.)  The manual doesn't say this explicitly but
the rule "find the smallest index i such that i > P where P = N*(p/100)"
ensures that percentiles fits this definition.  This is especially true
since in cases where (i-1) = P, x_p is defined as the mean of x[i] and
x[i-1] which results in x_p always being greater than x[i-1].  Lets call
this definition, definition A.

However, -xtile- seems to be generating its categorical variable based on a
different interpretation which we could call definition B: a value (x_p) for
which p percent of observations (p) are less than *or equal* to x_p.

This can seen from the fairly trivial example which follows.  As a point of
comparison, I wrote an egen routine (_gqrank.ado) which generates percentile
ranks consistent with definition A.  The code appears as a post script.

set obs 100
g nn = _n
pctile x_p = nn, nq(100) genp(p)
xtile   xt = nn, nq(100)

//  -xtile- generates categoricals numbered 1 to 100
//  renumber categories 0 to 99 to correspond to
//  percentile ranks
replace xt = xt-1
egen qrank = qrank(nn), perc
compare xt qrank

. list

     +------------------------------+
     |  nn    x_p    p   xt   qrank |
     |------------------------------|
  1. |   1    1.5    1    0       0 |
  2. |   2    2.5    2    1       1 |
  3. |   3    3.5    3    2       2 |
  4. |   4    4.5    4    3       3 |
  5. |   5    5.5    5    4       4 |
     |------------------------------|
  6. |   6    6.5    6    5       5 |
  7. |   7      8    7    6       6 |
  8. |   8    8.5    8    6       7 |
  9. |   9    9.5    9    7       8 |
 10. |  10   10.5   10    8       9 |

...

     |------------------------------|
 96. |  96   96.5   96   95      95 |
 97. |  97   97.5   97   96      96 |
 98. |  98   98.5   98   97      97 |
 99. |  99   99.5   99   98      98 |
100. | 100      .    .   99      99 |
     +------------------------------+

When varname (nn) is precisely equal to a percentile, -xtile- assigns it the
to the 'previous' percentile rank.  E.g. the 7th percentile (x_7) is 8, but
when nn==8, -xtile- gives it percentile rank 6 rather than 7.  But clearly,
at least 7 percent of observations lie below x_7 = 8.

Hence, it would seems that the commands consider percentile ranks as values
that fall within the intervals:

            interval (p^th perc. rank)   corresponds to
		--------------------------   --------------
   pctile         [ x_p, x_(p+1) )       (definition A)
   xtile          ( x_p, x_(p+1) ]       (definition B)


2. Possibly inaccurate percentiles
==================================

In the example above, the 7th percentile is 8.  But according to the default
method, x_7 should be given by the mean of x[6] and x[7] because P =
_N*(7/100) = 7, the smallest index i > P is i = 8 and (i-1) = P.  This
implies that x_p should be 7.5 for p = 7.

See Methods & Formulas section of [R] -pctile- for info on the methods.


Patrick Joly
[email protected]
[email protected]


P.S.: _gqrank.ado
-----------------

* Careful: your editor may wrap long lines

*! _gqrank: generate quantile ranks
*! version 1.2   28may2003   PJoly
* v.1.2   28may2003   PJoly   renamed _gqrank from _grank2
* v.1.1   26may2003   PJoly
* v.1.0   16jun2002   PJoly
program define _gqrank, sclass
      version 7
      gettoken type 0 : 0
      gettoken g    0 : 0
      gettoken eqs  0 : 0
      syntax varlist(max=1) [if] [in] [, by(varlist) low ALTdef    /*
                  */   QUARtiles QUIntiles  PERCentiles            /*
                  */   QUANtiles(numlist min=1 max=1 <=100 >0) ]

      local  quint = cond("`quintiles'"!="",1,0)
      local  quart = cond("`quartiles'"!="",1,0)
      local  perc  = cond("`percentiles'"!="",1,0)

      if ((`quint' + `quart' + `perc' + ("`quantiles'"!=""))== 0) | /*
     */  ((`quint' + `quart' + `perc' + ("`quantiles'"!="")) > 1 ) {
            di as err "{p}specify one of percentiles, quartiles, quintiles,
"
            di as err "or quantiles(){p_end}"
            exit 198
      }
      marksample touse
      if "`low'"!="" { local sign = -1 }
      else { local sign = 1 }
      local lbl "quantile"

      if `perc' {
            local quantiles = 100
            local lbl "percentile"
      }
      if `quart' {
            local quantiles = 4
            local lbl "quartile"
      }
      if `quint' {
            local quantiles = 5
            local lbl "quintile"
      }
      local divisor = 100/`quantiles'

      qui {
            tempvar vv group
            local type : type `varlist'
            g `type' `vv' = `sign'*`varlist' if `touse'
            g `typlist' `g' = . if `touse'

            if "`by'" != "" {
                  bysort `by' : g long `group' = (_n==1)
                  replace `group' = sum(`group')
            }
            else {
                  g byte `group' = 1
            }

            summ `group', meanonly
            forv k = 1/`r(max)' {

                  tempvar Vpctiles Vpcent

* identify unique percentiles:
* - if pctiles not unique, keep lowest percentage rank
* - store ranks and percent values in 2 parallel lists

                  preserve
                  keep if `group'==`k'
                  pctile `Vpctiles' = `vv' if `touse',       /*
                      */      nq(100) genp(`Vpcent') `altdef'
                  bysort `Vpctiles' (`Vpcent') : keep if _n==1
                  vallist `Vpctiles', local(pctiles) sort
                  vallist `Vpcent'  , local(pcent)   sort
                  restore

                  forv i = `: word count `pcent''(-1)1 {
                        local val  : word `i' of `pctiles'
                        local perc : word `i' of `pcent'

                        replace `g' = `perc' if `vv' >= `val' /*
                          */  &  missing(`g') & `group'==`k'  /*
                          */  & `touse'
                  }
                  replace `g' = 0 if missing(`g')     /*
                          */       & `group'==`k' & `touse'
            }
            replace `g' = int(`g'/`divisor') if `touse'
      }
      la var `g' "`=trim("`lbl' rank of `varlist'")'"

      sret clear
      sret local pctiles "`pctiles'"
      sret local percent "`pcent'"
end

exit

Syntax diagram:
---------------

egen newvarname = qrank(varname),
   {percentiles|quartiles|quintiles|quantiles(#)}
   [altdef low]   (allows by varlist:)

    creates a variable containing quantile ranks of values
    of varname.  All quantiles are calculated using command
    pctile (see help pctile).  qrank(#) specifies the number
    of quantiles with which to break the data (1 <= # <= 100).
    low requests that quantiles be ranked from high to low
    rather than low to high, in terms of values of varname.
    altdef uses an alternative formula for calculating
    quantiles which involves interpolation (see help pctile).
    qrank() is also a sclass command, it returns the list of
    unique quantiles and associated percentage values in s().
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index