Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: how to group variables into equal number groups |

Date |
Tue, 26 Mar 2013 15:25:56 +0000 |

Thanks to Marcello for the mention, but I think at best that kind of graph will illustrate the problem, not solve it. However, the problem is, as I understand it, at root insoluble. There is a longer discussion in SJ-12-4 pr0054 . . . . . . . . . . Speaking Stata: Matrices as look-up tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox Q4/12 SJ 12(4):748--758 (no commands) illustrates the use of matrices as look-up tables but the nub of the matter is a single word: ties! Here is the example from my paper above. If you want to get the executive summary now, my advice is 1. Don't use this lousy method. It entails discarding information. 2. If you ignore #'1, it is possible that you might improve on -xtile- by using a different criterion at bin boundaries. First, we use a moderately large dataset as example, so no one can dismiss the phenomenon as characteristic of small datasets. . webuse nlswork, clear (National Longitudinal Survey. Young Women 14-26 years of age in 1968) We use 10 groups. . xtile q_age=age, nq(10) We first show that -xtile- is using the same results as -_pctile-: . _pctile age, nq(10) . ret li scalars: r(r1) = 21 r(r2) = 23 r(r3) = 24 r(r4) = 26 r(r5) = 28 r(r6) = 31 r(r7) = 33 r(r8) = 36 r(r9) = 38 and put these in a matrix: . matrix q = r(r1), r(r2), r(r3), r(r4), r(r5), r(r6), r(r7), r(r8), r(r9) What did -xtile- do? This is a long way from equal frequencies! But clearly if someone is (say) 24, they must be in the same group as everybody else of the same age. . tab q_age 10 | quantiles | of age | Freq. Percent Cum. ------------+----------------------------------- 1 | 4,122 14.46 14.46 2 | 3,062 10.74 25.20 3 | 1,636 5.74 30.94 4 | 2,980 10.45 41.39 5 | 2,567 9.00 50.39 6 | 3,614 12.68 63.07 7 | 2,357 8.27 71.34 8 | 3,543 12.43 83.76 9 | 1,824 6.40 90.16 10 | 2,805 9.84 100.00 ------------+----------------------------------- Total | 28,510 100.00 We can reproduce that using the results of -_pctile-. . gen q_age2 = 10 if age < . (24 missing values generated) . quietly forval i = 9(-1)1 { replace q_age2 = `i' if age <= q[1, `i'] } . assert q_age == q_age2 No news is good news here. I have _one_ suggestion here (apart from not using this lousy method). Try a different criterion at the boundary. . . gen q_age3 = 10 if age < . (24 missing values generated) . quietly forval i = 9(-1)1 { replace q_age3 = `i' if age < q[1, `i'] } We now have a different classification. . tab q_age3 q_age3 | Freq. Percent Cum. ------------+----------------------------------- 1 | 2,805 9.84 9.84 2 | 2,775 9.73 19.57 3 | 1,604 5.63 25.20 4 | 3,202 11.23 36.43 5 | 2,731 9.58 46.01 6 | 3,662 12.84 58.85 7 | 2,314 8.12 66.97 8 | 3,677 12.90 79.87 9 | 2,067 7.25 87.12 10 | 3,673 12.88 100.00 ------------+----------------------------------- Total | 28,510 100.00 But many people changed decile groups! . count if q_age != q_age3 11647 . qui tab q_age, matcell(freq) . qui tab q_age3, matcell(freq3) . gen freq = freq[_n,1] (28524 missing values generated) . gen freq3 = freq3[_n,1] (28524 missing values generated) . su freq* Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- freq | 10 2851 788.4865 1636 4122 freq3 | 10 2851 716.4114 1604 3677 At best, we see that the second classification has groups of rather more equal size (as measured by the SD of group frequency). Here is the code in one: webuse nlswork, clear xtile q_age=age, nq(10) _pctile age, nq(10) ret li matrix q = r(r1), r(r2), r(r3), r(r4), r(r5), r(r6), r(r7), r(r8), r(r9) tab q_age gen q_age2 = 10 if age < . quietly forval i = 9(-1)1 { replace q_age2 = `i' if age <= q[1, `i'] } assert q_age == q_age2 gen q_age3 = 10 if age < . quietly forval i = 9(-1)1 { replace q_age3 = `i' if age < q[1, `i'] } tab q_age3 count if q_age != q_age3 qui tab q_age, matcell(freq) qui tab q_age3, matcell(freq3) gen freq = freq[_n,1] gen freq3 = freq3[_n,1] su freq* On Tue, Mar 26, 2013 at 2:43 PM, Marcello Pagano <pagano@hsph.harvard.edu> wrote: > Try > > findit eqprhistogram > > it will lead you to Nick Cox's plot of what you are looking for. > > m.p. > > > > On 3/26/2013 10:30 AM, Xixi Lin wrote: >> I am trying to make independent variables into decile groups, and I >> used xtile decile=x1 if Period==`z', nq(10); however, it turns out >> that xtile does not make equal number of the 10 groups, is there any >> way to force stata to divide them into equal number of obs or almost >> equal number of obs? * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**References**:**st: how to group variables into equal number groups***From:*Xixi Lin <winnielxx@gmail.com>

**Re: st: how to group variables into equal number groups***From:*Marcello Pagano <pagano@hsph.harvard.edu>

- Prev by Date:
**Re: st: how to group variables into equal number groups** - Next by Date:
**Re: st: Using foreach and forval to append data files** - Previous by thread:
**Re: st: how to group variables into equal number groups** - Next by thread:
**Re: st: how to group variables into equal number groups** - Index(es):