Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: how to force cutpoint in xtile |

Date |
Tue, 26 Jun 2012 10:51:10 +0100 |

Strange though it may seem, -xtile- doesn't try directly to equalize group frequencies. The -xtile- problem is a clustering problem, so that we should worry about the combinatorics of possible solutions, but -xtile- sensibly ignores that. Suppose that we have n ordered values, which can be thought of as laid down in a series end to end like this: 1 2 3 5 7 ... The general problem is to group them into k subseries. This can be thought of as placing (k - 1) markers; there are (n - 1) places between values where the markers can be placed and so comb(n - 1, k - 1) possible splits of n into k. That number grows explosively: comb(185, 4) is 1.7 billion or so. Now in practice we often, as here in Martha's problem, have ties. If we work with the rule that no distinct value can be "split" into quantile-based groups, then the problem is also simpler to the extent that we have ties. In Martha's example her variable has 12 distinct values and so the number of possible splits comes down to comb(11, 4) = 330. But no general solution for -xtile- can be based on the _assumption_ that we will have lots of ties. The other half of the problem is wanting groups to be of approximately equal frequencies, about which nothing said yet. As said, -xtile- wisely avoids the problem raised by a combinatorial explosion of possibilities. Its main idea might be called a SCIBOC(*) algorithm or "Are we there yet?". Concretely, giving a desire to split at 20(20)80 for 5 groups, it goes through the cumulative percents and asks: Have we passed 20% yet? If so, then 40%? and so on. (At least that's what the results seem to imply: I have _not_ read the code really closely.) -xtile- has to be careful, because, for example, there might easily be less than k distinct values in practice even though the user asks for k quantile-based groups. In principle, you could look at possible solutions and choose those with (e.g.) the lowest variance of group frequencies, but I still worry about the combinatorics, but that is not what -xtile- does. More positively, -xtile- often seems to work to its users' satisfaction, although there are intermittent naive questions of the form "Why doesn't -xtile- produce groups of exactly equal size?" when the values make such a split impossible. (This wasn't Martha's question.) And there is one easy thing you can try, negate the variable and try that. Nick (*) SCIBOC = small child in back of car. In my culture, it is standard that on any non-trivial family car journey, a small child will ask after some small time or distance "Are we there yet?" I guess that this is an invariant across cultures and even modes of transport (small child in back of canoe, small child on back of camel, etc.). On Tue, Jun 26, 2012 at 4:10 AM, Skiles, Martha Priedeman <skiles@live.unc.edu> wrote: > Thank you Nick. > I was hoping that the xtile command would set the cutpoint at the closest break to the 20% rather than pass the 20% and then choose the closest. It sounds like that is not an option, rather I need to choose between quintiling as-is or reversing the order. > > Per your question about why I'd want to quintile, this is just a very small part of my output that I need in quintiles in order to compare relative (rather than absolute) positions. The "value" itself has no readily interpretable meaning, rather it is more helpful to think about relative groups and how that classification of quintile changes from one data run to another. > > I appreciate your taking the time to respond. > > Regards, > Martha > > ________________________________________ > From: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] on behalf of Nick Cox [njcoxstata@gmail.com] > Sent: Monday, June 25, 2012 7:36 PM > To: statalist@hsphsun2.harvard.edu > Subject: Re: st: how to force cutpoint in xtile > > The Stata version you are using is immaterial here. > > The over-arching problem (for you) is that -xtile- will not split > observed values and that it declares a boundary when the appropriate > cumulative percents (here 20(20)80 %) have been passed. With these > data that bites as very unequal class frequencies. > > What you can do, given that algorithm is negate the variable and apply > -xtile- going the other way > > . input value freq > > value freq > 1. 11 11 > 2. 12 4 > 3. 13 17 > 4. 14 37 > 5. 15 7 > 6. 16 27 > 7. 17 13 > 8. 18 5 > 9. 19 14 > 10. 20 11 > 11. 21 23 > 12. 27 16 > 13. end > > . expand freq > (173 observations created) > > . gen negvalue = -value > > . xtile nQ5 = negvalue, nq(5) > > . tab nQ5 > > 5 quantiles | > of negvalue | Freq. Percent Cum. > ------------+----------------------------------- > 1 | 39 21.08 21.08 > 2 | 43 23.24 44.32 > 3 | 34 18.38 62.70 > 4 | 37 20.00 82.70 > 5 | 32 17.30 100.00 > ------------+----------------------------------- > Total | 185 100.00 > > . xtile Q5 = value, nq(5) > > . tab Q5 > > 5 quantiles | > of value | Freq. Percent Cum. > ------------+----------------------------------- > 1 | 69 37.30 37.30 > 2 | 7 3.78 41.08 > 3 | 40 21.62 62.70 > 4 | 53 28.65 91.35 > 5 | 16 8.65 100.00 > ------------+----------------------------------- > Total | 185 100.00 > > But why are you are doing this? The data are already in a small number > of discrete values. Quintiles force 21 and 27 together, which > underlines that you are throwing away important detail. > > Nick > > On Mon, Jun 25, 2012 at 10:21 PM, Skiles, Martha Priedeman > <skiles@live.unc.edu> wrote: > >> I've used -xtile- in Stata 11 successfully, but am having difficulty with it in Stata 12. I have the following variable "S0D0_links" which I'd like to quintile (5 groups), but the -xtile- function is not creating groups where I would expect. Per below, I expected the first quintile to break at 17.3 cumulative percent rather than 37.3. Can I force the cutpoint to be either closest to my 20/40/60/80/100 quintiles or always <20/<40/<60/etc? >> I am able to force it by using "cumul" to generate a cumulative percent, and then write code using "ceil(5*cumpercent)" but I hope there's a better option. My preference is to have the cutpoint create quintiles as close to 20/40/60/etc as possible. >> >> Thank you, >> Martha Skiles >> >> LOG: >> tab S0D0_links >> >> S0D0_links | Freq. Percent Cum. >> ------------+----------------------------------- >> 11 | 11 5.95 5.95 >> 12 | 4 2.16 8.11 >> 13 | 17 9.19 17.30 >> 14 | 37 20.00 37.30 >> 15 | 7 3.78 41.08 >> 16 | 27 14.59 55.68 >> 17 | 13 7.03 62.70 >> 18 | 5 2.70 65.41 >> 19 | 14 7.57 72.97 >> 20 | 11 5.95 78.92 >> 21 | 23 12.43 91.35 >> 27 | 16 8.65 100.00 >> ------------+----------------------------------- >> Total | 185 100.00 >> >> . xtile Q5=S0D0_links, nq(5) >> >> . tab Q5 >> >> 5 quantiles | >> of | >> S0D0_links | Freq. Percent Cum. >> ------------+----------------------------------- >> 1 | 69 37.30 37.30 >> 2 | 7 3.78 41.08 >> 3 | 40 21.62 62.70 >> 4 | 53 28.65 91.35 >> 5 | 16 8.65 100.00 >> ------------+----------------------------------- >> Total | 185 100.00 * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: how to force cutpoint in xtile***From:*"Skiles, Martha Priedeman" <skiles@live.unc.edu>

**Re: st: how to force cutpoint in xtile***From:*Nick Cox <njcoxstata@GMAIL.COM>

**RE: st: how to force cutpoint in xtile***From:*"Skiles, Martha Priedeman" <skiles@live.unc.edu>

- Prev by Date:
**Re: st: Regular expressions with locals** - Next by Date:
**st: Re lincom not recognising macros** - Previous by thread:
**RE: st: how to force cutpoint in xtile** - Next by thread:
**st: esttab and p-values question** - Index(es):