# st: RE: RE: cumulative distribution function (assigned values)

 From "Nick Cox" To Subject st: RE: RE: cumulative distribution function (assigned values) Date Tue, 3 Jun 2003 16:58:57 +0100

Manuel Kast

> > how can I generate in Stata 8.0 a new variable which
> > contains the values
> > assigned by the observed cumulative distribution of one
> > variable? In other
> > words, I would like to get those values stored in a new
> > variable, that are
> > used by the command "cdf varname" to plot the sample cumulative
> > distribution function of varname.
> > I don't think the command "cumul varname" will not work for
> > my case, since
> > my variable contains several observations with the same
> > values, but "cumul
> > varname" assigns different values to to these, depending
> > how they were
> > initially ordered.

Nick Cox

> sort varname
> gen cumul = sum(varname < .)
> by varname: replace cumul = cumul[_N]
> replace cumul = cumul / cumul[_N]

Here is another way to do it, exploiting
the fact that ranking and calculation
of cumulative probabilities are sibling
problems. (The messy small details arise
from ways of handling ties.)

The FAQ at
http://www.stata.com/support/faqs/stat/pcrank.html
explores other connections.

First the code:

egen cumul = rank(varname), field
egen n = count(varname)
replace cumul = (n + 1 - cumul) / n

A way to see this is that for cumulative
probabilities we want a "rank" that looks
like this:

data "rank"
1     1
2     2
3     5
3     5
3     5
4     6
5     7

Here "rank"(x) is the number <= x. This
doesn't look like any of the usual ranks,
until you compare it with the field rank,
i.e. the ranking if this were a field event
in which highest value wins:

data "rank" field rank
1     1        7
2     2        6
3     5        3
3     5        3
3     5        3
4     6        2
5     7        1

from which it is clear that

"rank" + field rank = n + 1

and the rest is immediate.

The advantage of an -egen- approach
here is that it easily takes
care of any or all of

-if- or -in- restrictions
doing it -by:-
missing values

Having said all that, there should, in my
view, be an option (say -equal-) to -cumul-
which ensures that equal values get
equal probabilities (frequencies) assigned.

Nick
n.j.cox@durham.ac.uk
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/