Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: How to calculate 75 percentile of other individuals on thesame


From   n j cox <n.j.cox@durham.ac.uk>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: How to calculate 75 percentile of other individuals on thesame
Date   Tue, 02 Oct 2007 19:47:46 +0100

Note that the general issue is also discussed at

How do I create variables summarizing for each individual properties of the other members of a group?
http://www.stata.com/support/faqs/data/members.html

Apart from sums and means -- when we can use short-cuts hased
on some rearrangement of, or implication of,

sum for everyone = sum for others + value for this individual

-- this kind of problem usually requires a loop. In the FAQ
just cited, it is shown that you can do by it looping
over within-group identifiers, rather than the whole
dataset.

However, the trade-offs are not very clear to me.

-_pctile- is built in, while any call to -egen- involves
an interpretative overhead. On the other hand, -_pctile-
can only emit one 75th percentile at a time, and -egen-
with -by()- can calculate several at a time by side-stepping
-_pctile-. The precise trade-offs would probably depend on the size of the dataset and the number of groups.

No doubt you could also speed it up using Mata or writing
more direct code.

Nick
n.j.cox@durham.ac.uk

Quang Nguyen asked

A simplified version of my data looks as follows:

ID Group X
1 a 5
2 a 7
3 a 9
4 a 8
5 b 3
6 b 4
7 b 9
..........................

I would like to generate a new variable whose value is the 75 percentile of
other individuals in the same group as the concerned individual. For
example, for the first individual (ID=1), this will be: 75 percentile
of {7, 9, 8}.

and Joseph Coveney replied

-findit percentile- turns up a lot to pore over. But among the results
is -egen <varname> = pctile(exp), p(#)-, which can take a -by- varlist.

Try something like:
bysort Group: egen p75 = pctile(X), p(75)

To finish: an observation is going to lie beneath, above or on a given
percentile for its group, so there's a smarter (more efficient) algorithm, but a brute-force approach is shown below.

clear *
set more off
set seed `=date("2007-09-29", "YMD")'
set obs 100
generate byte pid = _n
generate byte group = mod(_n, 10)
generate double response = uniform()
*
* Begin here
*
tempvar tmpvar0 tmpvar1
sort group
generate double p75 = .
generate double `tmpvar0' = .
quietly forvalues i = 1/`=_N' {
replace `tmpvar0' = response if _n != `i'
by group: egen double `tmpvar1' = pctile(`tmpvar0'), p(75)
replace p75 = `tmpvar1' in `i'
drop `tmpvar1'
replace `tmpvar0' = .
}
drop `tmpvar0'
list in 1/20, noobs sepby(group)
exit

Although my suggestion was centered around -egen-, which is very often a
convenience, you can usually do things more efficiently. For example, in this case, -_pctile if . . ., percentiles(75)- and then -replace p75 = r() in . . . - would avoid redundancy of -by . . .: egen . . . pctile()- where all of the other groups' results are calculated and discarded each time. There are other ways to polish the suggestion, too, and difference would be noticeable with large datasets and many groups.

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index