Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: RE: cumulative distribution function (assigned values)


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: RE: cumulative distribution function (assigned values)
Date   Tue, 3 Jun 2003 16:58:57 +0100

Manuel Kast
  
> > how can I generate in Stata 8.0 a new variable which 
> > contains the values 
> > assigned by the observed cumulative distribution of one 
> > variable? In other 
> > words, I would like to get those values stored in a new 
> > variable, that are 
> > used by the command "cdf varname" to plot the sample cumulative 
> > distribution function of varname.
> > I don't think the command "cumul varname" will not work for 
> > my case, since 
> > my variable contains several observations with the same 
> > values, but "cumul 
> > varname" assigns different values to to these, depending 
> > how they were 
> > initially ordered.

Nick Cox 

> sort varname 
> gen cumul = sum(varname < .) 
> by varname: replace cumul = cumul[_N] 
> replace cumul = cumul / cumul[_N] 

Here is another way to do it, exploiting 
the fact that ranking and calculation 
of cumulative probabilities are sibling 
problems. (The messy small details arise 
from ways of handling ties.) 

The FAQ at 
http://www.stata.com/support/faqs/stat/pcrank.html
explores other connections. 

First the code: 

egen cumul = rank(varname), field 
egen n = count(varname) 
replace cumul = (n + 1 - cumul) / n 

A way to see this is that for cumulative 
probabilities we want a "rank" that looks 
like this: 

data "rank" 
1     1 
2     2
3     5
3     5 
3     5  
4     6 
5     7 

Here "rank"(x) is the number <= x. This 
doesn't look like any of the usual ranks, 
until you compare it with the field rank, 
i.e. the ranking if this were a field event 
in which highest value wins: 

data "rank" field rank 
1     1        7
2     2        6 
3     5        3
3     5        3
3     5        3
4     6        2
5     7        1

from which it is clear that 

"rank" + field rank = n + 1 

and the rest is immediate. 

The advantage of an -egen- approach 
here is that it easily takes 
care of any or all of 

	-if- or -in- restrictions 
	doing it -by:- 
	missing values 

Having said all that, there should, in my 
view, be an option (say -equal-) to -cumul- 
which ensures that equal values get 
equal probabilities (frequencies) assigned. 

Nick 
n.j.cox@durham.ac.uk 
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index