[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: problem with dividing dataset into equally sized groups

From   "David Elliott" <>
Subject   Re: st: RE: problem with dividing dataset into equally sized groups
Date   Wed, 3 Dec 2008 21:19:49 -0400

Unique ranking before cutting will force as equal size groups as
possible while simply using -egen ..cut()- will not.

sysuse auto, clear
* Per Martin's suggestion
egen group1 = cut(mpg), group(4)
lab var group1 "Just use cut(mpg)"
tab group1
* Alternate using ranking first
egen rank = rank(mpg), unique
egen group2=cut(rank), group(4)
lab var group2 "Use rank(mpg), then cut(rank)"
tab group2
table group2 group1,stubw(15) row col


On Tue, Dec 2, 2008 at 10:45 AM, Nick Cox <> wrote:
> Exactly equal-sized groups are only guaranteed if
> 1. the number of observations is an exact multiple of the number of
> groups (which usually bites minutely)
> 2. there are no problems with ties (which often bites substantially).
> Your problem is evidently #2.
> You can only force equal-sized groups if you assign the same value to
> different groups in at least some cases. You can always force that by
> perturbing your data with random noise before passing them to -xtile-,
> but that's hardly a satisfactory approach.
> But the whole approach is pretty unsatisfactory anyway: this kind of
> subdivision throws away information which is not obviously dispensable.
> I've not been following this thread carefully but my impression is that
> you've had some excellent advice from Maarten Buis that you've chosen to
> ignore. That's your prerogative, but you'll get diminishing returns from
> asking small variants on the same question. A modern approach to this
> uses some kind of smoothing to try to get over the granularity in your
> data, which you can do in a controlled way.
> Nick
> Gisella Young
> I am trying to divide my dataset into equally sized groups on the basis
> of an income variable (eg 100 groups from lowest to highest income). I
> have tried several methods but the groups are not equally sized. For
> example,
> -xtile cat=income, n(100)-
>  (similarly with pctile)
> and
> -sumdist income, n(100) qgp(cat)-
> It produces the desired number of groups but they are not equally sized.
> (Which I see by looking at the frequencies when I say -tab cat-
> thereafter). The differences are not small - some groups are many times
> larger than others. This is not because of weighting as I have tried
> even without weights. It is also not related to the size of groups. I
> wonder whether it might be because of clustering of incomes around
> certain values (e.g. 10 000, 15 000) and all of those values being
> lumped into certain categories.
> Can anyone suggest a way to partition the sample into equally sized
> groups?
> This actually stems from an earlier thread (but no need to read that for
> the above) about plotting a chart of income distribution with the
> occupational composition of each percentile. Austin's suggestion (below)
> comes close to that. However, even with his code the groups are not
> equally sized, but they are sized the same as when I use the sumdist or
> xtile commands mentioned above.
> *
> *   For searches and help try:
> *
> *
> *

David Elliott
*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index