[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
Joly.Patrick@ic.gc.ca |

To |
statalist@hsphsun2.harvard.edu |

Subject |
st: pctile, xtile methods |

Date |
Mon, 9 Jun 2003 08:30:52 -0400 |

I could be mistaken but I have noticed two possible issues with -pctile- and -xtile-: 1. Interpretation of percentiles and percentile ranks ===================================================== It seems to me that -xtile- gives results that are inconsistent with the method used by -pctile- for computing quantiles. Both methods used by -pctile-, the default and -altdef-, are consistent with the interpretation of the p^th percentile as a value (x_p) for which p percent of observations (p) are strictly smaller. (This is also how I have always interpreted percentiles.) The manual doesn't say this explicitly but the rule "find the smallest index i such that i > P where P = N*(p/100)" ensures that percentiles fits this definition. This is especially true since in cases where (i-1) = P, x_p is defined as the mean of x[i] and x[i-1] which results in x_p always being greater than x[i-1]. Lets call this definition, definition A. However, -xtile- seems to be generating its categorical variable based on a different interpretation which we could call definition B: a value (x_p) for which p percent of observations (p) are less than *or equal* to x_p. This can seen from the fairly trivial example which follows. As a point of comparison, I wrote an egen routine (_gqrank.ado) which generates percentile ranks consistent with definition A. The code appears as a post script. set obs 100 g nn = _n pctile x_p = nn, nq(100) genp(p) xtile xt = nn, nq(100) // -xtile- generates categoricals numbered 1 to 100 // renumber categories 0 to 99 to correspond to // percentile ranks replace xt = xt-1 egen qrank = qrank(nn), perc compare xt qrank . list +------------------------------+ | nn x_p p xt qrank | |------------------------------| 1. | 1 1.5 1 0 0 | 2. | 2 2.5 2 1 1 | 3. | 3 3.5 3 2 2 | 4. | 4 4.5 4 3 3 | 5. | 5 5.5 5 4 4 | |------------------------------| 6. | 6 6.5 6 5 5 | 7. | 7 8 7 6 6 | 8. | 8 8.5 8 6 7 | 9. | 9 9.5 9 7 8 | 10. | 10 10.5 10 8 9 | ... |------------------------------| 96. | 96 96.5 96 95 95 | 97. | 97 97.5 97 96 96 | 98. | 98 98.5 98 97 97 | 99. | 99 99.5 99 98 98 | 100. | 100 . . 99 99 | +------------------------------+ When varname (nn) is precisely equal to a percentile, -xtile- assigns it the to the 'previous' percentile rank. E.g. the 7th percentile (x_7) is 8, but when nn==8, -xtile- gives it percentile rank 6 rather than 7. But clearly, at least 7 percent of observations lie below x_7 = 8. Hence, it would seems that the commands consider percentile ranks as values that fall within the intervals: interval (p^th perc. rank) corresponds to -------------------------- -------------- pctile [ x_p, x_(p+1) ) (definition A) xtile ( x_p, x_(p+1) ] (definition B) 2. Possibly inaccurate percentiles ================================== In the example above, the 7th percentile is 8. But according to the default method, x_7 should be given by the mean of x[6] and x[7] because P = _N*(7/100) = 7, the smallest index i > P is i = 8 and (i-1) = P. This implies that x_p should be 7.5 for p = 7. See Methods & Formulas section of [R] -pctile- for info on the methods. Patrick Joly joly.patrick@ic.gc.ca pat.joly@utoronto.ca P.S.: _gqrank.ado ----------------- * Careful: your editor may wrap long lines *! _gqrank: generate quantile ranks *! version 1.2 28may2003 PJoly * v.1.2 28may2003 PJoly renamed _gqrank from _grank2 * v.1.1 26may2003 PJoly * v.1.0 16jun2002 PJoly program define _gqrank, sclass version 7 gettoken type 0 : 0 gettoken g 0 : 0 gettoken eqs 0 : 0 syntax varlist(max=1) [if] [in] [, by(varlist) low ALTdef /* */ QUARtiles QUIntiles PERCentiles /* */ QUANtiles(numlist min=1 max=1 <=100 >0) ] local quint = cond("`quintiles'"!="",1,0) local quart = cond("`quartiles'"!="",1,0) local perc = cond("`percentiles'"!="",1,0) if ((`quint' + `quart' + `perc' + ("`quantiles'"!=""))== 0) | /* */ ((`quint' + `quart' + `perc' + ("`quantiles'"!="")) > 1 ) { di as err "{p}specify one of percentiles, quartiles, quintiles, " di as err "or quantiles(){p_end}" exit 198 } marksample touse if "`low'"!="" { local sign = -1 } else { local sign = 1 } local lbl "quantile" if `perc' { local quantiles = 100 local lbl "percentile" } if `quart' { local quantiles = 4 local lbl "quartile" } if `quint' { local quantiles = 5 local lbl "quintile" } local divisor = 100/`quantiles' qui { tempvar vv group local type : type `varlist' g `type' `vv' = `sign'*`varlist' if `touse' g `typlist' `g' = . if `touse' if "`by'" != "" { bysort `by' : g long `group' = (_n==1) replace `group' = sum(`group') } else { g byte `group' = 1 } summ `group', meanonly forv k = 1/`r(max)' { tempvar Vpctiles Vpcent * identify unique percentiles: * - if pctiles not unique, keep lowest percentage rank * - store ranks and percent values in 2 parallel lists preserve keep if `group'==`k' pctile `Vpctiles' = `vv' if `touse', /* */ nq(100) genp(`Vpcent') `altdef' bysort `Vpctiles' (`Vpcent') : keep if _n==1 vallist `Vpctiles', local(pctiles) sort vallist `Vpcent' , local(pcent) sort restore forv i = `: word count `pcent''(-1)1 { local val : word `i' of `pctiles' local perc : word `i' of `pcent' replace `g' = `perc' if `vv' >= `val' /* */ & missing(`g') & `group'==`k' /* */ & `touse' } replace `g' = 0 if missing(`g') /* */ & `group'==`k' & `touse' } replace `g' = int(`g'/`divisor') if `touse' } la var `g' "`=trim("`lbl' rank of `varlist'")'" sret clear sret local pctiles "`pctiles'" sret local percent "`pcent'" end exit Syntax diagram: --------------- egen newvarname = qrank(varname), {percentiles|quartiles|quintiles|quantiles(#)} [altdef low] (allows by varlist:) creates a variable containing quantile ranks of values of varname. All quantiles are calculated using command pctile (see help pctile). qrank(#) specifies the number of quantiles with which to break the data (1 <= # <= 100). low requests that quantiles be ranked from high to low rather than low to high, in terms of values of varname. altdef uses an alternative formula for calculating quantiles which involves interpolation (see help pctile). qrank() is also a sclass command, it returns the list of unique quantiles and associated percentage values in s(). * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: RE: -mktab- with -ml maximize-** - Next by Date:
**st: RE: pctile, xtile methods -- correction** - Previous by thread:
**st: RE: -mktab- with -ml maximize-** - Next by thread:
**Re: st: pctile, xtile methods** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |