Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Generate Random Variable Given Percentile Values


From   wgould@stata.com (William Gould, StataCorp LP)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Generate Random Variable Given Percentile Values
Date   Tue, 17 Feb 2009 13:27:44 -0600

Peter Graven <grave165@umn.edu> asks, 

> How do you create a random variable given just the percentile
> statistics for a distribution? For example the 1%, 5%, 10%, 25%, 50%,
> 75,%, 90%, 95%, 99% percentiles are, respectively, 795, 1834, 2172,
> 2641, 3174, 3822, 4900, 5981, 8522. That is, I would like to create a
> variable with this distribution in my dataset.

What a great question!  Stata really should have a function that will 
do this for you, but it doesn't.  So we'll have to write our own.

First, let's figure out how we are going to proceed.  Forgive me for 
being so didactic;I'm talking to myself.  We have, 

    q   .01   .05   .10   .25   .50   .75   .90  .95   .99
    ------------------------------------------------------
    y   795  1834  2172  2641  3174  3822  4900 4981  8522

Let's imagining pulling a value u at random from uniform().  If that 
value were .01, or .05, or ..., we know what our funtion should 
return:  795, or 1834, or ...

What if we get a value between two of the predefined quantiles?  
Well, we could assume the values that y can take on are fixed and 
those values are 695, 1834, ..., but looking at those values, I 
bet y is continuous, so let's linearly interpolate.

That leaves on more problem.  What if we u=.002 or u=.995?  Peter 
didn't tell us the range of y, but we'll need to know it.  I'm going 
to pretend the table reads

    q   0  .01   .05   .10   .25   .50   .75   .90  .95   .99      1
    ----------------------------------------------------------------
    y   0  795  1834  2172  2641  3174  3822  4900 4981  8522  10000

So I'm asserting that y is bounded by [0, 10000).  Peter can substitute 
more reasonable end points.

Okay, I know what I need to code.  We could write this as a Stata program --
an ado-file -- but the result would run slowly and be difficult to read.  I'm
going to use Mata.  Here's my program:

------------------------------------------------ myprog.do -------------------
capture mata mata drop u_empirical()
mata:
real colvector u_empirical(real scalar n, real vector q, real vector y)
{
        real scalar     i, k
        real scalar     u, a
        real colvector  result

        if (length(q)!=length(y)) _error(3200)
        if (length(q)<2) _error(3300)

        result = J(n, 1, .)
        for (i=1; i<=n; i++) {
                u = runiform(1,1)
                if (u==0) result[i] = y[1]
                else {
                        for (k=2; 1; k++) {
                                if (q[k] > u) break
                        }
                        a = (u-q[k-1])/(q[k]-q[k-1])
                        result[i] = (1-a)*y[k-1] + a*y[k]
                }
        }
        return(result)
}
end
------------------------------------------------------------------------------

Let's try it, 

        . do myprog 
        (output omitted)

        . mata

        : q = (   0,  .1,  .5,  .1, .25,  .5, .75, .90, .95, .99, 1)

        : y = (   0, 795,1834,2172,2641,3174,3822,4900,5981,8522,10000)

        : random = u_empirical(5, q, y)
                         1
            +---------------+
          1 |  891.0661438  |
          2 |  3545.227971  |
          3 |  3323.821994  |
          4 |  3445.628493  |
          5 |  3651.384132  |
            +---------------+

        : end

I think that's right.  Actually, I'm pretty sure it's right because,
behind the scenes, I included the line -u, result[i]- at the end of the
outside for loop so that I could see the value of u chosen and verify that
that the program calculated the correct result.

Doubtlessly Peter wants these values a Stata dataset:

        . drop _all

        . set obs 50 
        obs was 0, now 50

        . gen x = . 
        (50 missing values generated)

	. set seed 20399

	. mata

        : st_view(myx=., ., "x")

        : myx[.] = u_empirical(50, q, y)

	: end

	. list in 1/5

             +----------+
             |        x |
             |----------|
          1. | 3954.497 |
          2. | 1015.753 |
          3. | 7914.299 |
          4. | 1564.397 |
          5. | 995.6785 |
             +----------+

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index