# Re: st: RE: problem with dividing dataset into equally sized groups

 From "David Elliott" To statalist@hsphsun2.harvard.edu Subject Re: st: RE: problem with dividing dataset into equally sized groups Date Wed, 3 Dec 2008 21:19:49 -0400

```Unique ranking before cutting will force as equal size groups as
possible while simply using -egen ..cut()- will not.

Eg:
sysuse auto, clear
* Per Martin's suggestion
egen group1 = cut(mpg), group(4)
lab var group1 "Just use cut(mpg)"
tab group1
* Alternate using ranking first
egen rank = rank(mpg), unique
egen group2=cut(rank), group(4)
lab var group2 "Use rank(mpg), then cut(rank)"
tab group2
table group2 group1,stubw(15) row col

DCE

On Tue, Dec 2, 2008 at 10:45 AM, Nick Cox <n.j.cox@durham.ac.uk> wrote:
> Exactly equal-sized groups are only guaranteed if
>
> 1. the number of observations is an exact multiple of the number of
> groups (which usually bites minutely)
>
> 2. there are no problems with ties (which often bites substantially).
>
> Your problem is evidently #2.
>
> You can only force equal-sized groups if you assign the same value to
> different groups in at least some cases. You can always force that by
> perturbing your data with random noise before passing them to -xtile-,
> but that's hardly a satisfactory approach.
>
> But the whole approach is pretty unsatisfactory anyway: this kind of
> subdivision throws away information which is not obviously dispensable.
>
> I've not been following this thread carefully but my impression is that
> you've had some excellent advice from Maarten Buis that you've chosen to
> ignore. That's your prerogative, but you'll get diminishing returns from
> asking small variants on the same question. A modern approach to this
> uses some kind of smoothing to try to get over the granularity in your
> data, which you can do in a controlled way.
>
> Nick
> n.j.cox@durham.ac.uk
>
> Gisella Young
>
> I am trying to divide my dataset into equally sized groups on the basis
> of an income variable (eg 100 groups from lowest to highest income). I
> have tried several methods but the groups are not equally sized. For
> example,
>
> -xtile cat=income, n(100)-
>  (similarly with pctile)
> and
> -sumdist income, n(100) qgp(cat)-
>
> It produces the desired number of groups but they are not equally sized.
> (Which I see by looking at the frequencies when I say -tab cat-
> thereafter). The differences are not small - some groups are many times
> larger than others. This is not because of weighting as I have tried
> even without weights. It is also not related to the size of groups. I
> wonder whether it might be because of clustering of incomes around
> certain values (e.g. 10 000, 15 000) and all of those values being
> lumped into certain categories.
>
> Can anyone suggest a way to partition the sample into equally sized
> groups?
>
> This actually stems from an earlier thread (but no need to read that for
> the above) about plotting a chart of income distribution with the
> occupational composition of each percentile. Austin's suggestion (below)
> comes close to that. However, even with his code the groups are not
> equally sized, but they are sized the same as when I use the sumdist or
> xtile commands mentioned above.
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

--
David Elliott
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```