# st: RE: problem with dividing dataset into equally sized groups

 From "Nick Cox" To Subject st: RE: problem with dividing dataset into equally sized groups Date Tue, 2 Dec 2008 14:45:58 -0000

```Exactly equal-sized groups are only guaranteed if

1. the number of observations is an exact multiple of the number of
groups (which usually bites minutely)

2. there are no problems with ties (which often bites substantially).

You can only force equal-sized groups if you assign the same value to
different groups in at least some cases. You can always force that by
perturbing your data with random noise before passing them to -xtile-,
but that's hardly a satisfactory approach.

But the whole approach is pretty unsatisfactory anyway: this kind of
subdivision throws away information which is not obviously dispensable.

I've not been following this thread carefully but my impression is that
you've had some excellent advice from Maarten Buis that you've chosen to
ignore. That's your prerogative, but you'll get diminishing returns from
asking small variants on the same question. A modern approach to this
uses some kind of smoothing to try to get over the granularity in your
data, which you can do in a controlled way.

Nick
n.j.cox@durham.ac.uk

Gisella Young

I am trying to divide my dataset into equally sized groups on the basis
of an income variable (eg 100 groups from lowest to highest income). I
have tried several methods but the groups are not equally sized. For
example,

-xtile cat=income, n(100)-
(similarly with pctile)
and
-sumdist income, n(100) qgp(cat)-

It produces the desired number of groups but they are not equally sized.
(Which I see by looking at the frequencies when I say -tab cat-
thereafter). The differences are not small - some groups are many times
larger than others. This is not because of weighting as I have tried
even without weights. It is also not related to the size of groups. I
wonder whether it might be because of clustering of incomes around
certain values (e.g. 10 000, 15 000) and all of those values being
lumped into certain categories.

Can anyone suggest a way to partition the sample into equally sized
groups?

This actually stems from an earlier thread (but no need to read that for
the above) about plotting a chart of income distribution with the
occupational composition of each percentile. Austin's suggestion (below)
comes close to that. However, even with his code the groups are not
equally sized, but they are sized the same as when I use the sumdist or
xtile commands mentioned above.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```