[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: RE: problem with dividing dataset into equally sized groups |

Date |
Fri, 5 Dec 2008 16:53:07 -0000 |

Yes indeed; but this is still arbitrary and (I believe) not reproducible. Inside -egen, rank()- there is a -sort- that can not be made stable. Normally this does not bite but with "unique" ranks it could. Fuzzing with random noise as suggested is at least reproducible in that you can set the seed. Nick n.j.cox@durham.ac.uk David Elliott Unique ranking before cutting will force as equal size groups as possible while simply using -egen ..cut()- will not. Eg: sysuse auto, clear * Per Martin's suggestion egen group1 = cut(mpg), group(4) lab var group1 "Just use cut(mpg)" tab group1 * Alternate using ranking first egen rank = rank(mpg), unique egen group2=cut(rank), group(4) lab var group2 "Use rank(mpg), then cut(rank)" tab group2 table group2 group1,stubw(15) row col On Tue, Dec 2, 2008 at 10:45 AM, Nick Cox <n.j.cox@durham.ac.uk> wrote: > Exactly equal-sized groups are only guaranteed if > > 1. the number of observations is an exact multiple of the number of > groups (which usually bites minutely) > > 2. there are no problems with ties (which often bites substantially). > > Your problem is evidently #2. > > You can only force equal-sized groups if you assign the same value to > different groups in at least some cases. You can always force that by > perturbing your data with random noise before passing them to -xtile-, > but that's hardly a satisfactory approach. > > But the whole approach is pretty unsatisfactory anyway: this kind of > subdivision throws away information which is not obviously dispensable. > > I've not been following this thread carefully but my impression is that > you've had some excellent advice from Maarten Buis that you've chosen to > ignore. That's your prerogative, but you'll get diminishing returns from > asking small variants on the same question. A modern approach to this > uses some kind of smoothing to try to get over the granularity in your > data, which you can do in a controlled way. > > Gisella Young > > I am trying to divide my dataset into equally sized groups on the basis > of an income variable (eg 100 groups from lowest to highest income). I > have tried several methods but the groups are not equally sized. For > example, > > -xtile cat=income, n(100)- > (similarly with pctile) > and > -sumdist income, n(100) qgp(cat)- > > It produces the desired number of groups but they are not equally sized. > (Which I see by looking at the frequencies when I say -tab cat- > thereafter). The differences are not small - some groups are many times > larger than others. This is not because of weighting as I have tried > even without weights. It is also not related to the size of groups. I > wonder whether it might be because of clustering of incomes around > certain values (e.g. 10 000, 15 000) and all of those values being > lumped into certain categories. > > Can anyone suggest a way to partition the sample into equally sized > groups? > > This actually stems from an earlier thread (but no need to read that for > the above) about plotting a chart of income distribution with the > occupational composition of each percentile. Austin's suggestion (below) > comes close to that. However, even with his code the groups are not > equally sized, but they are sized the same as when I use the sumdist or > xtile commands mentioned above. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: problem with dividing dataset into equally sized groups***From:*Gisella Young <gisellayoung@yahoo.com>

**st: RE: problem with dividing dataset into equally sized groups***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

**Re: st: RE: problem with dividing dataset into equally sized groups***From:*"David Elliott" <dcelliott@gmail.com>

- Prev by Date:
**st: RE: managing data** - Next by Date:
**RE: st: STATA loop terminating over missing variables** - Previous by thread:
**Re: st: RE: problem with dividing dataset into equally sized groups** - Next by thread:
**st: stata graphics: how to do a line break in labels?** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |