Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

Re: st: how to force cutpoint in xtile

 From Nick Cox <[email protected]> To [email protected] Subject Re: st: how to force cutpoint in xtile Date Tue, 26 Jun 2012 10:51:10 +0100

```Strange though it may seem, -xtile- doesn't try directly to equalize
group frequencies.

The -xtile- problem is a clustering problem, so that we should worry
about the combinatorics of possible solutions, but -xtile- sensibly
ignores that.

Suppose that we have n ordered values, which can be thought of as laid
down in a series end to end like this:

1 2 3 5 7 ...

The general problem is to group them into k subseries. This can be
thought of as placing (k - 1) markers; there are (n - 1) places
between values where the markers can be placed and so comb(n - 1, k -
1) possible splits of n into k. That number grows explosively:
comb(185, 4) is 1.7 billion or so.

Now in practice we often, as here in Martha's problem, have ties. If
we work with the rule  that no distinct value can be "split" into
quantile-based groups, then the problem is also simpler to the extent
that we have ties. In Martha's example her variable has 12 distinct
values and so the number of possible splits comes down to comb(11, 4)
= 330. But no general solution for -xtile- can be based on the
_assumption_ that we will have lots of ties.

The other half of the problem is wanting groups to be of approximately
equal frequencies, about which nothing said yet.

As said, -xtile- wisely avoids the problem raised by a combinatorial
explosion of possibilities. Its main idea might be called a SCIBOC(*)
algorithm or "Are we there yet?". Concretely, giving a desire to split
at 20(20)80 for 5 groups, it goes through the cumulative percents and
asks: Have we passed 20% yet? If so, then 40%? and so on. (At least
that's what the results seem to imply: I have _not_ read the code
really closely.) -xtile- has to be careful, because, for example,
there might easily be less than k distinct values in practice even
though the user asks for k quantile-based groups.

In principle, you could look at possible solutions and choose those
with (e.g.) the lowest variance of group frequencies, but I still
worry about the combinatorics, but that is not what -xtile- does.

More positively, -xtile- often seems to work to its users'
satisfaction, although there are intermittent naive questions of the
form "Why doesn't -xtile- produce groups of exactly equal size?" when
the values make such a split impossible. (This wasn't Martha's
question.)  And there is one easy thing you can try, negate the
variable and try that.

Nick

(*) SCIBOC = small child in back of car. In my culture, it is standard
that on any non-trivial family car journey, a small child will ask
after some small time or distance "Are we there yet?" I guess that
this is an invariant across cultures and even modes of transport
(small child in back of canoe, small child on back of camel, etc.).

On Tue, Jun 26, 2012 at 4:10 AM, Skiles, Martha Priedeman
<[email protected]> wrote:
> Thank you Nick.
> I was hoping that the xtile command would set the cutpoint at the closest break to the 20% rather than pass the 20% and then choose the closest.  It sounds like that is not an option, rather I need to choose between quintiling as-is or reversing the order.
>
> Per your question about why I'd want to quintile, this is just a very small part of my output that I need in quintiles in  order to compare relative (rather than absolute) positions.  The "value" itself has no readily interpretable meaning, rather it is more helpful to think about relative groups and how that classification of quintile changes from one data run to another.
>
> I appreciate your taking the time to respond.
>
> Regards,
> Martha
>
> ________________________________________
> From: [email protected] [[email protected]] on behalf of Nick Cox [[email protected]]
> Sent: Monday, June 25, 2012 7:36 PM
> To: [email protected]
> Subject: Re: st: how to force cutpoint in xtile
>
> The Stata version you are using is immaterial here.
>
> The over-arching problem (for you) is that -xtile- will not split
> observed values and that it declares a boundary when the appropriate
> cumulative percents (here 20(20)80 %) have been passed. With these
> data that bites as very unequal class frequencies.
>
> What you can do, given that algorithm is negate the variable and apply
> -xtile- going the other way
>
> . input value freq
>
>         value       freq
>  1.              11          11
>  2.              12           4
>  3.              13          17
>  4.              14          37
>  5.              15           7
>  6.              16          27
>  7.              17          13
>  8.              18           5
>  9.              19          14
>  10.              20          11
>  11.              21          23
>  12.              27          16
>  13. end
>
> . expand freq
> (173 observations created)
>
> . gen negvalue = -value
>
> . xtile nQ5 = negvalue, nq(5)
>
> . tab nQ5
>
> 5 quantiles |
> of negvalue |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>          1 |         39       21.08       21.08
>          2 |         43       23.24       44.32
>          3 |         34       18.38       62.70
>          4 |         37       20.00       82.70
>          5 |         32       17.30      100.00
> ------------+-----------------------------------
>      Total |        185      100.00
>
> . xtile Q5 = value, nq(5)
>
> . tab Q5
>
> 5 quantiles |
>   of value |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>          1 |         69       37.30       37.30
>          2 |          7        3.78       41.08
>          3 |         40       21.62       62.70
>          4 |         53       28.65       91.35
>          5 |         16        8.65      100.00
> ------------+-----------------------------------
>      Total |        185      100.00
>
> But why are you are doing this? The data are already in a small number
> of discrete values. Quintiles force 21 and 27 together, which
> underlines that you are throwing away important detail.
>
> Nick
>
> On Mon, Jun 25, 2012 at 10:21 PM, Skiles, Martha Priedeman
> <[email protected]> wrote:
>
>> I've used -xtile- in Stata 11 successfully, but am having difficulty with it in Stata 12.  I have the following variable "S0D0_links" which I'd like to quintile (5 groups), but the -xtile- function is not creating groups where I would expect.  Per below, I expected the first quintile to break at 17.3 cumulative percent rather than 37.3.  Can I force the cutpoint to be either closest to my 20/40/60/80/100 quintiles or always <20/<40/<60/etc?
>> I am able to force it by using "cumul" to generate a cumulative percent, and then write code using "ceil(5*cumpercent)" but I hope there's a better option.  My preference is to have the cutpoint create quintiles as close to 20/40/60/etc as possible.
>>
>> Thank you,
>> Martha Skiles
>>
>> LOG:
>>
>> S0D0_links |      Freq.     Percent        Cum.
>> ------------+-----------------------------------
>>          11 |         11        5.95        5.95
>>          12 |          4        2.16        8.11
>>          13 |         17        9.19       17.30
>>          14 |         37       20.00       37.30
>>          15 |          7        3.78       41.08
>>          16 |         27       14.59       55.68
>>          17 |         13        7.03       62.70
>>          18 |          5        2.70       65.41
>>          19 |         14        7.57       72.97
>>          20 |         11        5.95       78.92
>>          21 |         23       12.43       91.35
>>          27 |         16        8.65      100.00
>> ------------+-----------------------------------
>>       Total |        185      100.00
>>
>>
>> . tab Q5
>>
>> 5 quantiles |
>>          of |
>> S0D0_links |      Freq.     Percent        Cum.
>> ------------+-----------------------------------
>>           1 |         69       37.30       37.30
>>           2 |          7        3.78       41.08
>>           3 |         40       21.62       62.70
>>           4 |         53       28.65       91.35
>>           5 |         16        8.65      100.00
>> ------------+-----------------------------------
>>       Total |        185      100.00

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```