# st: problem with dividing dataset into equally sized groups

 From Gisella Young To statalist@hsphsun2.harvard.edu Subject st: problem with dividing dataset into equally sized groups Date Tue, 2 Dec 2008 06:28:41 -0800 (PST)

```I am trying to divide my dataset into equally sized groups on the basis of an income variable (eg 100 groups from lowest to highest income). I have tried several methods but the groups are not equally sized. For example,

-xtile cat=income, n(100)-
(similarly with pctile)
and
-sumdist income, n(100) qgp(cat)-

It produces the desired number of groups but they are not equally sized. (Which I see by looking at the frequencies when I say -tab cat- thereafter). The differences are not small - some groups are many times larger than others. This is not because of weighting as I have tried even without weights. It is also not related to the size of groups. I wonder whether it might be because of clustering of incomes around certain values (e.g. 10 000, 15 000) and all of those values being lumped into certain categories.

Can anyone suggest a way to partition the sample into equally sized groups?

This actually stems from an earlier thread (but no need to read that for the above) about plotting a chart of income distribution with the occupational composition of each percentile. Austin's suggestion (below) comes close to that. However, even with his code the groups are not equally sized, but they are sized the same as when I use the sumdist or xtile commands mentioned above.

best,
Gisella

--- On Mon, 12/1/08, Austin Nichols <austinnichols@gmail.com> wrote:

> From: Austin Nichols <austinnichols@gmail.com>
> Subject: Re: st: how to make an area graph showing distribution?
> To: statalist@hsphsun2.harvard.edu
> Date: Monday, December 1, 2008, 2:02 AM
> Gisella Young <gisellayoung@yahoo.com>:
> It may be that you are looking for a simple stacked bar
> graph over
> income quintiles or deciles or the like, as opposed to a
> parametric
> smooth over income quantiles.  If so, you might want to
> this pair of example graphs to your needs:
>
> clear all
> sysuse nlsw88
> ren industry i
> tab i, g(ind)
> g w=round(uniform()*20)
> la var w "fake survey weight"
> _pctile wage [pw=w], nq(5)
> g q=1 if wage<=r(r1)
> forv i=2/5 {
>  replace q=`i' if wage>r(r`=`i'-1') &
> wage<=r(r`i')
>  }
> loc y
> forv i=1/12 {
>  loc l "`=substr("`: var la
> ind`i''",4,.)'"
>  loc y `"`y' lab(`i'
> "`l'")"'
>  loc lv`i' `"la var ind`i' "`l'"
> "'
>  }
> gr bar ind* [pw=w], stack over(q) name(b) leg(`y')
> collapse ind* [pw=w], by(q)
> forv i=2/12 {
>  replace ind`i'=ind`i'+ind`=`i'-1'
>  }
> loc v
> forv i=1/12 {
>  `lv`i''
>  loc v "ind`i' `v'"
>  }
> tw bar `v' q, name(tw)
>
> Note that the commands above destroy the data in memory, so
> make sure
> you -preserve- or -save- first as appropriate.  Also note
> that there
> is no guarantee that the distributions of income by
> occupation, or
> occupation by income category, display any sort of
> stochastic
> dominance that would allow easy ranking of occupations.
>
> http://www.stata.com/capabilities/graphexamples.html
>
>
> On Sun, Nov 30, 2008 at 10:37 AM, Maarten buis
> <maartenbuis@yahoo.co.uk> wrote:
> > --- Gisella Young <gisellayoung@yahoo.com>
> wrote:
> >> On Maarten Buis's suggestion, I am not sure
> why I would really need
> >> a regression - I get from his email that this is
> basically for
> >> smoothing?
> >
> > Yes, as income in the example dataset (and I assume in
> > well) is a continuous variable, there just aren't
> enough cases for each
> > income value to estimate the proportions.
> >
> >> Since I actually want to plot the actual data (but
> realise
> >> that this needs smoothing),
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```