[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: xtile and "by" question

From	"Sergio Correia" <[email protected]>
To	[email protected]
Subject	Re: st: xtile and "by" question
Date	Fri, 11 May 2007 17:53:32 -0400

Nick,

Thanks for the reply.
As usual, very instructional!

Sergio


On 5/9/07, Nick Cox <[email protected]> wrote:

As it happens, there is a better way here, Uli Kohler's
-egen, xtile()- as part of -egenmore- on SSC.

But an approach from first principles, as in Sergio's post,
is always welcome. Nevertheless, it can be improved.

The point I am going to make arises frequently, and has featured
more than once recently on Statalist. So, perhaps
I should not raise it again, except that some
inefficient habits are being encouraged here
(and elsewhere). At worst, this approach may not
work with a very large number of groups.

The point is the choice between

* -foreach- and cycling over a list (including,
possibly, some use of -levelsof-)

and

* -forval- and looping over an integer range.

The advice is: Whenever you have a choice, always
go for -forval-.

Part of the attraction of -egen, group()-
is that the resulting groups are guaranteed
to run from 1 upwards, as consecutive integers.
This can, and should, be exploited.

Now I yield to no-one in admiring -levelsof-, but
its use is overkill here, given that.

Also, why does -forval- exist at all, as examples
like

foreach i of num 1/1000 {
        ...
}

show that everything -forval- can do can also be done by
-foreach-?

The answer is efficiency. -forval- is set up to be very
fast. The list of arguments is not constructed in total as a
macro, or even any equivalent, so it can be as fast as
possible, and you never hit limits on size of macro.

Suppose that there are 5000 distinct levels in this
example. Then -levelsof- will construct a macro with
5000 elements:

1 2 3 4 ... 5000

and -foreach- cycles over that. And so on. In some datasets, the
macro constructed may be too big to handle.

So, I would rewrite Sergio's example

egen levels = group(date city)
su levels, meanonly
gen output = .
forval l = 1/`r(max)' {
        xtile temp=income if levels==`l'
        replace output = temp if levels==`l'
        drop temp
}

Doing it that way will cut down on the number of times
you get bitten.

Nick
[email protected]

Sergio Correia

> Well, it seems that you should hack around that problem.
> Something like this may help
>
> * 1) Let's create a variable with the levels, and save them
> on a local macro
> egen levels = group(date city)
> levelsof levels, local(levels)
>
> * 2) Now let's run a loop
> gen output = .
> foreach l of local levels {
>       xtile temp=income if levels==`l'
>       replace output = temp if output==.
>       drop temp
> }
>
> I tried it on some dummy data and it appeared to work.

> On 5/8/07, Edgard Alfonso Polanco Aguilar
> <[email protected]> wrote:

> > I'm working with a database which consists on income by month and
> > region. I need to classify the income by percentiles by
> city and month
> > because there's a wide dispersion on income levels across
> cities, to the
> > point that the highest income in one is on the median of
> another. I know
> > the command of choice for this task is xtile, but Stata
> doesn't allow me
> > to use it with "by" like in "by city date: xtile var=income
> nq(10)". I'm
> > working in Stata 9 SE for Windows.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- Re: st: xtile and "by" question
  - From: "Sergio Correia" <[email protected]>
- RE: st: xtile and "by" question
  - From: "Nick Cox" <[email protected]>

Prev by Date: Re: st: Cluster analysis - cluster kmeans-
Next by Date: Re: st: Cluster analysis - cluster kmeans-
Previous by thread: RE: st: xtile and "by" question
Next by thread: st: spline regression
Index(es):
- Date
- Thread