# st: RE: how to generate groups based on some characteristics and obtain the mean/median value for each group

 From "Paul Seed" To Subject st: RE: how to generate groups based on some characteristics and obtain the mean/median value for each group Date Thu, 20 Jun 2002 15:41:49 +0100

```Yi, Bingsheng <byi@coba.usf.edu> wrote:

> I've tried the codes provided by N Winter, it works but there is still a
> problem. The codes cannot ensure that there are at least 10 firms within
> each final group, I also tried other way, but the results are similar. I
> can't figure out the reason and have to seek your helps again.
>
<snip>

The issue seems to be that the code splits a group into smaller groups
if _any_ subgroup has 10 or more members, while what is wanted is to split
it only if _all_ subgroups are that big.

To makes this clearer:

Consider a much simpler data set:
Suppose (for simplicity) that there are 11 with code 1041, 1
with code 1044 and no others.

This is what you get:
ind4 ind3 ind2 ind1 f	industry
1041 104  10   1    11  1041
1044 104  10   1    1   104

The program first correctly identifies 104 as a proper group with
more than 10 members, and gives all 12 the industry code 104
It then identifies a further group 1041 with more than 10 members,
and gives them the code 1041, but does not change the code of the
other entry.  This, it seems, is not wanted.

Presumably what is wanted is this:

ind4 ind3 ind2 ind1 f	industry
1041 104  10   1    11  104
1044 104  10   1    1   104

This can be achieved by changing the code only slightly.

gen str4 ind3=substr(ind4,1,3)
gen str4 ind2=substr(ind4,1,2)
gen str4 ind1=substr(ind4,1,1)
forval i=1/4 {
sort ind`i'
by ind`i': gen num`i'=_N

* group the records
gen str4 industry=ind1
drop if num1<10 * exclude an industry if it contains less than 10 firms*
forval i=2/4 {
local j = `i' - 1
egen num`i'_min = min(num`i'), by(ind`j')}
replace industry=ind`i' if num`i'_min>=10
}
sort industry
by industry: gen _freq=_N
list ind4 industry _freq if _freq<10

However, it might be that what is wanted is:

ind4 ind3 ind2 ind1 f	industry
1041 104  10   1    11  1041

<dropped>
1044 104  10   1    1   104

In that case, it is sufficient to follow the old code with

bysort industry : gen num = _N
drop if num < 10

Paul Seed
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```