[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
st: RE: gen newvar=3Dgroup()
First, please note that you are sending MIME to
the list, contrary to a frequent and permanent request.
The rule is plain text only. Your posting is speckled
with e-gunk making it difficult to read.
The situation of the -group()- function has been explained
various times on this list, for example in
Here is a reprise based on those posts.
A note on the -group()- function (N.B. not -egen, group()-)
The -group()- function went undocumented in Stata 9.
You can blame Svend Juul. Svend gave a very
witty talk at a Berlin users' meeting pointing
out functions and -egen- functions with the same name, but
different definitions; the same definition,
but different names; and much else besides.
His talk is here:
So, StataCorp looked in the stables, and saw
that he was right. It was a mess. So, they
went to work renaming, and tidying a few
ancient oddities out of sight.
Actually, you can blame Nick Cox, who recalls
suggesting that -group()- be hidden, the
argument being that -egen, group()-,
although it came later, was far more
useful and far more widely used.
-group()-, which is a function,
is documented at version 8 [R] p.454.
group(n) divides the data into n nearly
equal groups, with integer values 1 to ceil(n).
That depends on the current sort order.
-group(varname)- appears equivalent to
... = group(max of varname)
<sort back again>
It's still there, naturally, so that previous programs
and do files are not broken. But -group()- is problematic
for various reasons.
1. The definition wasn't nearly precise enough to be any
use for really careful work. The on-line help for Stata 8
"group(x) creates a categorical variable that divides the
data into x as nearly equal-sized subsamples as possible,
numbering the first group 1, the second group 2, etc."
but that's too vague for anyone to understand or reproduce.
As -group()- is part of the executable, the code is not
inspectable. The documentation could have been fixed,
but that wasn't the only problem.
2. Examples show that -group()- can assign observations with
the same value of -myvar- to different groups. That would be widely
be considered pathological, i.e. bad. It's only reproducible,
presumably, if you -set seed- and record that.
3. -group()- doesn't seem to pay special attention to missing values.
That's very bad.
4. The name is overloaded. There is, as stated, an -egen, group()- which
5. In most cases, people who want this really want quantiles
(e.g. deciles) instead, and there are much better
documented Stata commands to do that. -search quantile- to
get some suggestions, but be warned that agreed-to-be-correct
methods don't exist. There is a literature on different definitions
of quantile, hinging on what is to be done about ties and what you do when
the number of values is too awkward to be divisible in the way
> Content-Transfer-Encoding: 8bit
> X-MIME-Autoconverted: from quoted-printable to 8bit by
> id l2AEeHJ07461
> Sender: email@example.com
> Precedence: bulk
> Reply-To: firstname.lastname@example.org
> Errors-To: email@example.com
> Dear Stata Users
> I am using this commands to group observations for risk score
> stcox age treat
> predict xb,xb
> drop if xb=3D=3D.
> gen gr_risk =3D group(5)
> This last command works but I do not find where
> the function group() is documented (and why I still use it)
* For searches and help try: