Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: how to group variables into equal number groups


From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: how to group variables into equal number groups
Date   Tue, 26 Mar 2013 15:25:56 +0000

Thanks to Marcello for the mention, but I think at best that kind of
graph will illustrate the problem, not solve it.

However, the problem is, as I understand it, at root insoluble. There
is a longer discussion in

SJ-12-4 pr0054  . . . . . . . . . . Speaking Stata: Matrices as look-up tables
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J. Cox
        Q4/12   SJ 12(4):748--758                                (no commands)
        illustrates the use of matrices as look-up tables

but the nub of the matter is a single word: ties!

Here is the example from my paper above. If you want to get the
executive summary now, my advice is

1. Don't use this lousy method. It entails discarding information.

2. If you ignore #'1, it is possible that you might improve on -xtile-
by using a different criterion at bin boundaries.

First, we use a moderately large dataset as example, so no one can
dismiss the phenomenon as characteristic of small datasets.

.  webuse nlswork, clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

We use 10 groups.

.  xtile q_age=age, nq(10)

We first show that -xtile- is using the same results as -_pctile-:

.  _pctile age, nq(10)

.  ret li

scalars:
                 r(r1) =  21
                 r(r2) =  23
                 r(r3) =  24
                 r(r4) =  26
                 r(r5) =  28
                 r(r6) =  31
                 r(r7) =  33
                 r(r8) =  36
                 r(r9) =  38

and put these in a matrix:

.  matrix q = r(r1), r(r2), r(r3), r(r4), r(r5), r(r6), r(r7), r(r8), r(r9)

What did -xtile- do? This is a long way from equal frequencies! But
clearly if someone is (say) 24, they must be in the same group as
everybody else of the same age.

.  tab q_age

         10 |
  quantiles |
     of age |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      4,122       14.46       14.46
          2 |      3,062       10.74       25.20
          3 |      1,636        5.74       30.94
          4 |      2,980       10.45       41.39
          5 |      2,567        9.00       50.39
          6 |      3,614       12.68       63.07
          7 |      2,357        8.27       71.34
          8 |      3,543       12.43       83.76
          9 |      1,824        6.40       90.16
         10 |      2,805        9.84      100.00
------------+-----------------------------------
      Total |     28,510      100.00

We can reproduce that using the results of -_pctile-.

.  gen q_age2 = 10 if age < .
(24 missing values generated)

.  quietly forval i = 9(-1)1 {
replace q_age2 = `i' if age <= q[1, `i']
}

.  assert q_age == q_age2

No news is good news here.

I have _one_ suggestion here (apart from not using this lousy method).
Try a different criterion at the boundary. .

.  gen q_age3 = 10 if age < .
(24 missing values generated)

.  quietly forval i = 9(-1)1 {
replace q_age3 = `i' if age < q[1, `i']
}

We now have a different classification.

.  tab q_age3

     q_age3 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      2,805        9.84        9.84
          2 |      2,775        9.73       19.57
          3 |      1,604        5.63       25.20
          4 |      3,202       11.23       36.43
          5 |      2,731        9.58       46.01
          6 |      3,662       12.84       58.85
          7 |      2,314        8.12       66.97
          8 |      3,677       12.90       79.87
          9 |      2,067        7.25       87.12
         10 |      3,673       12.88      100.00
------------+-----------------------------------
      Total |     28,510      100.00

But many people changed decile groups!

.  count if q_age != q_age3
11647

.  qui tab q_age, matcell(freq)

.  qui tab q_age3, matcell(freq3)

.  gen freq = freq[_n,1]
(28524 missing values generated)

.  gen freq3 = freq3[_n,1]
(28524 missing values generated)

.  su freq*

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        freq |        10        2851    788.4865       1636       4122
       freq3 |        10        2851    716.4114       1604       3677

At best, we see that the second classification has groups of rather
more equal size (as measured by the SD of group frequency).

Here is the code in one:

webuse nlswork, clear
 xtile q_age=age, nq(10)
 _pctile age, nq(10)
 ret li
 matrix q = r(r1), r(r2), r(r3), r(r4), r(r5), r(r6), r(r7), r(r8), r(r9)
 tab q_age
 gen q_age2 = 10 if age < .
 quietly forval i = 9(-1)1 {
	replace q_age2 = `i' if age <= q[1, `i']
 }
 assert q_age == q_age2

 gen q_age3 = 10 if age < .
 quietly forval i = 9(-1)1 {
 	replace q_age3 = `i' if age < q[1, `i']
 }

 tab q_age3
 count if q_age != q_age3
 qui tab q_age, matcell(freq)
 qui tab q_age3, matcell(freq3)
 gen freq = freq[_n,1]
 gen freq3 = freq3[_n,1]
 su freq*




On Tue, Mar 26, 2013 at 2:43 PM, Marcello Pagano
<[email protected]> wrote:
> Try
>
> findit eqprhistogram
>
> it will lead you to Nick Cox's plot of what you are looking for.
>
> m.p.
>
>
>
> On 3/26/2013 10:30 AM, Xixi Lin wrote:

>>   I am trying to make independent variables into decile groups, and I
>> used xtile decile=x1 if Period==`z', nq(10); however, it turns out
>> that xtile does not make equal number of the 10 groups, is there any
>> way to force stata to divide them into equal number of obs or almost
>> equal number of obs?
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index