Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: xtile and "by" question


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: xtile and "by" question
Date   Wed, 9 May 2007 13:41:39 +0100

As it happens, there is a better way here, Uli Kohler's 
-egen, xtile()- as part of -egenmore- on SSC. 

But an approach from first principles, as in Sergio's post,
is always welcome. Nevertheless, it can be improved. 

The point I am going to make arises frequently, and has featured
more than once recently on Statalist. So, perhaps 
I should not raise it again, except that some
inefficient habits are being encouraged here 
(and elsewhere). At worst, this approach may not 
work with a very large number of groups. 

The point is the choice between 

* -foreach- and cycling over a list (including, 
possibly, some use of -levelsof-) 

and 

* -forval- and looping over an integer range. 

The advice is: Whenever you have a choice, always
go for -forval-. 

Part of the attraction of -egen, group()- 
is that the resulting groups are guaranteed 
to run from 1 upwards, as consecutive integers. 
This can, and should, be exploited. 

Now I yield to no-one in admiring -levelsof-, but
its use is overkill here, given that. 

Also, why does -forval- exist at all, as examples
like 

foreach i of num 1/1000 { 
	...
}

show that everything -forval- can do can also be done by
-foreach-? 

The answer is efficiency. -forval- is set up to be very 
fast. The list of arguments is not constructed in total as a 
macro, or even any equivalent, so it can be as fast as 
possible, and you never hit limits on size of macro. 

Suppose that there are 5000 distinct levels in this
example. Then -levelsof- will construct a macro with
5000 elements: 

1 2 3 4 ... 5000

and -foreach- cycles over that. And so on. In some datasets, the 
macro constructed may be too big to handle. 

So, I would rewrite Sergio's example 

egen levels = group(date city)
su levels, meanonly 
gen output = .
forval l = 1/`r(max)' {
 	xtile temp=income if levels==`l'
 	replace output = temp if levels==`l'
 	drop temp
}

Doing it that way will cut down on the number of times
you get bitten. 

Nick 
n.j.cox@durham.ac.uk 

Sergio Correia
 
> Well, it seems that you should hack around that problem.
> Something like this may help
> 
> * 1) Let's create a variable with the levels, and save them 
> on a local macro
> egen levels = group(date city)
> levelsof levels, local(levels)
> 
> * 2) Now let's run a loop
> gen output = .
> foreach l of local levels {
> 	xtile temp=income if levels==`l'
> 	replace output = temp if output==.
> 	drop temp
> }
> 
> I tried it on some dummy data and it appeared to work.

> On 5/8/07, Edgard Alfonso Polanco Aguilar 
> <e-polanc@uniandes.edu.co> wrote:

> > I'm working with a database which consists on income by month and
> > region. I need to classify the income by percentiles by 
> city and month
> > because there's a wide dispersion on income levels across 
> cities, to the
> > point that the highest income in one is on the median of 
> another. I know
> > the command of choice for this task is xtile, but Stata 
> doesn't allow me
> > to use it with "by" like in "by city date: xtile var=income 
> nq(10)". I'm
> > working in Stata 9 SE for Windows.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index