Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# Re: st: how to group variables into equal number groups

 From Nick Cox To statalist@hsphsun2.harvard.edu Subject Re: st: how to group variables into equal number groups Date Tue, 26 Mar 2013 15:25:56 +0000

```Thanks to Marcello for the mention, but I think at best that kind of
graph will illustrate the problem, not solve it.

However, the problem is, as I understand it, at root insoluble. There
is a longer discussion in

SJ-12-4 pr0054  . . . . . . . . . . Speaking Stata: Matrices as look-up tables
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
Q4/12   SJ 12(4):748--758                                (no commands)
illustrates the use of matrices as look-up tables

but the nub of the matter is a single word: ties!

Here is the example from my paper above. If you want to get the
executive summary now, my advice is

1. Don't use this lousy method. It entails discarding information.

2. If you ignore #'1, it is possible that you might improve on -xtile-
by using a different criterion at bin boundaries.

First, we use a moderately large dataset as example, so no one can
dismiss the phenomenon as characteristic of small datasets.

.  webuse nlswork, clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

We use 10 groups.

.  xtile q_age=age, nq(10)

We first show that -xtile- is using the same results as -_pctile-:

.  _pctile age, nq(10)

.  ret li

scalars:
r(r1) =  21
r(r2) =  23
r(r3) =  24
r(r4) =  26
r(r5) =  28
r(r6) =  31
r(r7) =  33
r(r8) =  36
r(r9) =  38

and put these in a matrix:

.  matrix q = r(r1), r(r2), r(r3), r(r4), r(r5), r(r6), r(r7), r(r8), r(r9)

What did -xtile- do? This is a long way from equal frequencies! But
clearly if someone is (say) 24, they must be in the same group as
everybody else of the same age.

.  tab q_age

10 |
quantiles |
of age |      Freq.     Percent        Cum.
------------+-----------------------------------
1 |      4,122       14.46       14.46
2 |      3,062       10.74       25.20
3 |      1,636        5.74       30.94
4 |      2,980       10.45       41.39
5 |      2,567        9.00       50.39
6 |      3,614       12.68       63.07
7 |      2,357        8.27       71.34
8 |      3,543       12.43       83.76
9 |      1,824        6.40       90.16
10 |      2,805        9.84      100.00
------------+-----------------------------------
Total |     28,510      100.00

We can reproduce that using the results of -_pctile-.

.  gen q_age2 = 10 if age < .
(24 missing values generated)

.  quietly forval i = 9(-1)1 {
replace q_age2 = `i' if age <= q[1, `i']
}

.  assert q_age == q_age2

No news is good news here.

I have _one_ suggestion here (apart from not using this lousy method).
Try a different criterion at the boundary. .

.  gen q_age3 = 10 if age < .
(24 missing values generated)

.  quietly forval i = 9(-1)1 {
replace q_age3 = `i' if age < q[1, `i']
}

We now have a different classification.

.  tab q_age3

q_age3 |      Freq.     Percent        Cum.
------------+-----------------------------------
1 |      2,805        9.84        9.84
2 |      2,775        9.73       19.57
3 |      1,604        5.63       25.20
4 |      3,202       11.23       36.43
5 |      2,731        9.58       46.01
6 |      3,662       12.84       58.85
7 |      2,314        8.12       66.97
8 |      3,677       12.90       79.87
9 |      2,067        7.25       87.12
10 |      3,673       12.88      100.00
------------+-----------------------------------
Total |     28,510      100.00

But many people changed decile groups!

.  count if q_age != q_age3
11647

.  qui tab q_age, matcell(freq)

.  qui tab q_age3, matcell(freq3)

.  gen freq = freq[_n,1]
(28524 missing values generated)

.  gen freq3 = freq3[_n,1]
(28524 missing values generated)

.  su freq*

Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
freq |        10        2851    788.4865       1636       4122
freq3 |        10        2851    716.4114       1604       3677

At best, we see that the second classification has groups of rather
more equal size (as measured by the SD of group frequency).

Here is the code in one:

webuse nlswork, clear
xtile q_age=age, nq(10)
_pctile age, nq(10)
ret li
matrix q = r(r1), r(r2), r(r3), r(r4), r(r5), r(r6), r(r7), r(r8), r(r9)
tab q_age
gen q_age2 = 10 if age < .
quietly forval i = 9(-1)1 {
replace q_age2 = `i' if age <= q[1, `i']
}
assert q_age == q_age2

gen q_age3 = 10 if age < .
quietly forval i = 9(-1)1 {
replace q_age3 = `i' if age < q[1, `i']
}

tab q_age3
count if q_age != q_age3
qui tab q_age, matcell(freq)
qui tab q_age3, matcell(freq3)
gen freq = freq[_n,1]
gen freq3 = freq3[_n,1]
su freq*

On Tue, Mar 26, 2013 at 2:43 PM, Marcello Pagano
<pagano@hsph.harvard.edu> wrote:
> Try
>
> findit eqprhistogram
>
> it will lead you to Nick Cox's plot of what you are looking for.
>
> m.p.
>
>
>
> On 3/26/2013 10:30 AM, Xixi Lin wrote:

>>   I am trying to make independent variables into decile groups, and I
>> used xtile decile=x1 if Period==`z', nq(10); however, it turns out
>> that xtile does not make equal number of the 10 groups, is there any
>> way to force stata to divide them into equal number of obs or almost
>> equal number of obs?
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
```