[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: xtile and "by" question

From	"Nick Cox" <[email protected]>
To	<[email protected]>
Subject	RE: st: xtile and "by" question
Date	Wed, 9 May 2007 13:41:39 +0100

As it happens, there is a better way here, Uli Kohler's 
-egen, xtile()- as part of -egenmore- on SSC. 

But an approach from first principles, as in Sergio's post,
is always welcome. Nevertheless, it can be improved. 

The point I am going to make arises frequently, and has featured
more than once recently on Statalist. So, perhaps 
I should not raise it again, except that some
inefficient habits are being encouraged here 
(and elsewhere). At worst, this approach may not 
work with a very large number of groups. 

The point is the choice between 

* -foreach- and cycling over a list (including, 
possibly, some use of -levelsof-) 

and 

* -forval- and looping over an integer range. 

The advice is: Whenever you have a choice, always
go for -forval-. 

Part of the attraction of -egen, group()- 
is that the resulting groups are guaranteed 
to run from 1 upwards, as consecutive integers. 
This can, and should, be exploited. 

Now I yield to no-one in admiring -levelsof-, but
its use is overkill here, given that. 

Also, why does -forval- exist at all, as examples
like 

foreach i of num 1/1000 { 
	...
}

show that everything -forval- can do can also be done by
-foreach-? 

The answer is efficiency. -forval- is set up to be very 
fast. The list of arguments is not constructed in total as a 
macro, or even any equivalent, so it can be as fast as 
possible, and you never hit limits on size of macro. 

Suppose that there are 5000 distinct levels in this
example. Then -levelsof- will construct a macro with
5000 elements: 

1 2 3 4 ... 5000

and -foreach- cycles over that. And so on. In some datasets, the 
macro constructed may be too big to handle. 

So, I would rewrite Sergio's example 

egen levels = group(date city)
su levels, meanonly 
gen output = .
forval l = 1/`r(max)' {
 	xtile temp=income if levels==`l'
 	replace output = temp if levels==`l'
 	drop temp
}

Doing it that way will cut down on the number of times
you get bitten. 

Nick 
[email protected] 

Sergio Correia

> Well, it seems that you should hack around that problem.
> Something like this may help
> 
> * 1) Let's create a variable with the levels, and save them 
> on a local macro
> egen levels = group(date city)
> levelsof levels, local(levels)
> 
> * 2) Now let's run a loop
> gen output = .
> foreach l of local levels {
> 	xtile temp=income if levels==`l'
> 	replace output = temp if output==.
> 	drop temp
> }
> 
> I tried it on some dummy data and it appeared to work.

> On 5/8/07, Edgard Alfonso Polanco Aguilar 
> <[email protected]> wrote:

> > I'm working with a database which consists on income by month and
> > region. I need to classify the income by percentiles by 
> city and month
> > because there's a wide dispersion on income levels across 
> cities, to the
> > point that the highest income in one is on the median of 
> another. I know
> > the command of choice for this task is xtile, but Stata 
> doesn't allow me
> > to use it with "by" like in "by city date: xtile var=income 
> nq(10)". I'm
> > working in Stata 9 SE for Windows.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: xtile and "by" question
  - From: "Sergio Correia" <[email protected]>

References:
- Re: st: xtile and "by" question
  - From: "Sergio Correia" <[email protected]>

Prev by Date: RE: st: Average partial effects after oprobit using margeff
Next by Date: Re: st: Merging of two datasets
Previous by thread: Re: st: xtile and "by" question
Next by thread: Re: st: xtile and "by" question
Index(es):
- Date
- Thread