Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: correct way to divide the sample into deciles?


From   n j cox <n.j.cox@durham.ac.uk>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: correct way to divide the sample into deciles?
Date   Tue, 31 Oct 2006 22:00:49 +0000

You are correct. -group()- went undocumented in Stata 9.
This means the function -group()-, not the -egen- function
-group()-.

It's still there, naturally, so that previous programs
and do files are not broken. But...

In essence, as I interpret the situation, -group()- is problematic
for various reasons. Here I take the standard use to be yours, namely

sort myvar
gen group = group(#)

1. The definition wasn't nearly precise enough to be any
use for really careful work. The on-line help for Stata 8
says

"group(x) creates a categorical variable that divides the data into x as
nearly equal-sized subsamples as possible, numbering the first group 1, the second group 2, etc."

but that's too vague for anyone to understand or reproduce. As -group()- is part of the executable, the code is not inspectable. The documentation could have been fixed, but that wasn't the only problem.

2. Examples show that -group()- can assign observations with
the same value of -myvar- to different groups. That would be widely
be considered pathological, i.e. bad. It's only reproducible,
presumably, if you -set seed- and record that.

3. -group()- doesn't seem to pay special attention to missing values.

4. The name is overloaded. There is, as said, an -egen, group()- which
is different. Svend Juul threw some stones at StataCorp which
started a small avalanche of re-naming, in which StataCorp
tried to tackle various inconsistencies whereby the same
thing had different names and different things had the same
name among various functions and -egen- functions. -group()-
is far less useful, really, than -egen, group()-, so it was
a marked function from the start.

5. As in your case, people who want this really want quantiles
(e.g. deciles) instead, in most cases, and there are much better
documented Stata commands to do that. -search quantile- to
get some suggestions, but be warned that agreed-to-be-correct methods don't exist. There is a literature on different definitions of quantile,
hinging on what is to be done about ties and what you do when
the number of values is too awkward to be divisible in the way
you want.

6. Probably some more problems. Really, -group()- had passed its
sell-by date.

Nick
n.j.cox@durham.ac.uk

Shourun Guo

I am wondering what is the correct way to divide a sample into 10 deciles based on the value of variable xyz. What I would do is:

sort xyz
gen decile=group(10)

The 'group' function wil divide the sample into 10 as-nearly equal size
subgroups. Given the variable in interest is sorted beforhand, it looks fine to me. I am not sure whether this is the right way. Is there any other more accurate way to do the job?

Another question is that I just upgraded from STATA7 to STATA9. I couldn't find explanation on function 'group' in STATA9 manuals or online document. The 'group' function under 'generate' still works as under STATA7 though. I am wondering whether 'group' is called another name under STATA9.

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index