[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: RE: RE: pctile and xtile question again

From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: RE: RE: pctile and xtile question again
Date   Fri, 18 Jan 2008 19:21:04 -0000

I am happy if my code was helpful or instructive, but I don't see much
connection between your problem as stated and the code.

First, let's be clear on terminology. A quantile is a particular value x
on a variable X which has an associated probability pr(X <= x). That is
somewhere within any number of categories based on the ordered data.
Thus on some variable we might find that Foobar Corp is at the 70%
point, meaning 70% of values are less than Foobar's and 30% are greater.
But that is within any number of categories, e.g. those based on 68%-72%
or 66%-74% or 64%-76%, to mention only some centred on 70%. This
arbitrariness is what worries me most, before the problem is made
bivariate or multivariate by combining categories for different

It would seem to me more direct to model Foobar together with other
firms and assess Foobar in terms of its residual given a model for all
those firms. Exercising economic judgement on what firms are
(qualitatively) comparable then would seem essential as well as

(I did study economics fairly intensively in my youth.) 

[email protected] 

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Rajesh
Sent: 17 January 2008 23:42
To: [email protected]
Subject: st: RE: RE: pctile and xtile question again

Thanks very much for the suggestion nick. That is very elegant and
straightforward. I will remember to explain user written commands in the

As for why I am doing it. In finance area this sort of analysis is quite
common. One common application is to assess the performance of a
companies shares, by comparing its performance with the performance of a
portfolio of shares of companies which fall in the same size quantile as
that company for that during that period. (assuming size is the main
factor determining the performance). If you believed there were two
factors which are important, then you could create quantiles based on
two variable. Of course after three variable things become quite
complicated and one runs out of companies. So you are clearly right,
there are other (better) ways of doing this.

Thank you very much indeed

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Nick Cox
Sent: 17 January 2008 17:45
To: [email protected]
Subject: st: RE: pctile and xtile question again

I have comments on two levels. 

First, on how to do this. As always, it is easiest for list members to
see code in terms of datasets everyone can use. 

Your first bit seems rather indirect. I would use -centile- instead.
Individual percentiles are left behind in memory as r class results by 
-centile-. Thus you need not put them into a variable and then take them
out again, or create any variables you only need for one purpose. 

. sysuse auto 
. centile weight, centile(70) 
. gen byte weight_group = weight > r(c_1) if weight < . 

Then you can proceed directly to something like 

. egen mpg_group = xtile(mpg), by(weight_group) nq(3) 
. egen both_group = group(mpg_group weight_group) label 

Remember the request to explain where non-official commands you use come
from. Thus -egen, xtile()- is a user-written function (by Ulrich Kohler)
in the -egenmore- package on SSC. 

Extending this to two percentiles: 

. centile weight, centile(30 70) 
. gen byte weight_group = cond(weight < r(c_1), 1, 
                          cond(weight < r(c_2), 2, 3)) if weight < . 

and you can proceed as before

. egen mpg_group = xtile(mpg), by(weight_group) nq(3) 
. egen both_group = group(mpg_group weight_group) label

Note that in the auto dataset there are not in fact any missing values
-weight- but excluding them explicitly is usually going to be the right
thing in most problems, and at worst does nothing. In fact, with two
variables, a double restriction 

... if weight < . & mpg < . 

is usually going to be the right thing, and at worst it does nothing and
will not bite. 

Second, on why you are doing this. It may be impertinent, but I am
curious. Under what circumstances must you do precisely this?
Categorisation by quantiles throws away data. Seemingly arbitrary
quantiles or numbers of quantiles do that capriciously. When is this the
right thing to do in any data analysis? 

[email protected] 

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index