Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: gen newvar=3Dgroup()

From   "Nick Cox" <>
To   <>
Subject   st: RE: gen newvar=3Dgroup()
Date   Sun, 11 Mar 2007 17:04:43 -0000

First, please note that you are sending MIME to 
the list, contrary to a frequent and permanent request. 
The rule is plain text only. Your posting is speckled
with e-gunk making it difficult to read. 

The situation of the -group()- function has been explained 
various times on this list, for example in

Here is a reprise based on those posts. 

A note on the -group()- function (N.B. not -egen, group()-)

The -group()- function went undocumented in Stata 9. 

You can blame Svend Juul. Svend gave a very 
witty talk at a Berlin users' meeting pointing 
out functions and -egen- functions with the same name, but 
different definitions; the same definition, 
but different names; and much else besides. 

His talk is here:

So, StataCorp looked in the stables, and saw
that he was right. It was a mess. So, they 
went to work renaming, and tidying a few 
ancient oddities out of sight. 

Actually, you can blame Nick Cox, who recalls
suggesting that -group()- be hidden, the 
argument being that -egen, group()-, 
although it came later, was far more 
useful and far more widely used. 

-group()-, which is a function,  
is documented at version 8 [R] p.454. 

group(n) divides the data into n nearly 
equal groups, with integer values 1 to ceil(n). 
That depends on the current sort order. 

-group(varname)- appears equivalent to 

sort varname 
... = group(max of varname)
<sort back again>

It's still there, naturally, so that previous programs
and do files are not broken. But -group()- is problematic
for various reasons. 

1. The definition wasn't nearly precise enough to be any
use for really careful work. The on-line help for Stata 8

"group(x) creates a categorical variable that divides the 
data into x as nearly equal-sized subsamples as possible, 
numbering the first group 1, the second group 2, etc."

but that's too vague for anyone to understand or reproduce. 
As -group()- is part of the executable, the code is not 
inspectable. The documentation could have been fixed, 
but that wasn't the only problem.

2. Examples show that -group()- can assign observations with
the same value of -myvar- to different groups. That would be widely
be considered pathological, i.e. bad. It's only reproducible,
presumably, if you -set seed- and record that.

3. -group()- doesn't seem to pay special attention to missing values.
That's very bad. 

4. The name is overloaded. There is, as stated, an -egen, group()- which
is different. 

5. In most cases, people who want this really want quantiles
(e.g. deciles) instead, and there are much better
documented Stata commands to do that. -search quantile- to
get some suggestions, but be warned that agreed-to-be-correct 
methods don't exist. There is a literature on different definitions 
of quantile, hinging on what is to be done about ties and what you do when
the number of values is too awkward to be divisible in the way
you want.


Enzo Coviello 
> :01C76321]
> Content-Transfer-Encoding: 8bit
> X-MIME-Autoconverted: from quoted-printable to 8bit by 
>   id l2AEeHJ07461
> Sender:
> Precedence: bulk
> Reply-To:
> Errors-To:
> Dear Stata Users
> I am using this commands to group observations for risk score
> stcox age treat
> predict xb,xb
> drop if xb=3D=3D.
> gen gr_risk =3D group(5)
> This last command works but I do not find where
> the function group() is documented (and why I still use it)

*   For searches and help try:

© Copyright 1996–2020 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index