Re: st: Generate Random Variable Given Percentile Values

 From wgould@stata.com (William Gould, StataCorp LP) To statalist@hsphsun2.harvard.edu Subject Re: st: Generate Random Variable Given Percentile Values Date Tue, 17 Feb 2009 13:27:44 -0600

```Peter Graven <grave165@umn.edu> asks,

> How do you create a random variable given just the percentile
> statistics for a distribution? For example the 1%, 5%, 10%, 25%, 50%,
> 75,%, 90%, 95%, 99% percentiles are, respectively, 795, 1834, 2172,
> 2641, 3174, 3822, 4900, 5981, 8522. That is, I would like to create a
> variable with this distribution in my dataset.

What a great question!  Stata really should have a function that will
do this for you, but it doesn't.  So we'll have to write our own.

First, let's figure out how we are going to proceed.  Forgive me for
being so didactic;I'm talking to myself.  We have,

q   .01   .05   .10   .25   .50   .75   .90  .95   .99
------------------------------------------------------
y   795  1834  2172  2641  3174  3822  4900 4981  8522

Let's imagining pulling a value u at random from uniform().  If that
value were .01, or .05, or ..., we know what our funtion should
return:  795, or 1834, or ...

What if we get a value between two of the predefined quantiles?
Well, we could assume the values that y can take on are fixed and
those values are 695, 1834, ..., but looking at those values, I
bet y is continuous, so let's linearly interpolate.

That leaves on more problem.  What if we u=.002 or u=.995?  Peter
didn't tell us the range of y, but we'll need to know it.  I'm going

q   0  .01   .05   .10   .25   .50   .75   .90  .95   .99      1
----------------------------------------------------------------
y   0  795  1834  2172  2641  3174  3822  4900 4981  8522  10000

So I'm asserting that y is bounded by [0, 10000).  Peter can substitute
more reasonable end points.

Okay, I know what I need to code.  We could write this as a Stata program --
an ado-file -- but the result would run slowly and be difficult to read.  I'm
going to use Mata.  Here's my program:

------------------------------------------------ myprog.do -------------------
capture mata mata drop u_empirical()
mata:
real colvector u_empirical(real scalar n, real vector q, real vector y)
{
real scalar     i, k
real scalar     u, a
real colvector  result

if (length(q)!=length(y)) _error(3200)
if (length(q)<2) _error(3300)

result = J(n, 1, .)
for (i=1; i<=n; i++) {
u = runiform(1,1)
if (u==0) result[i] = y[1]
else {
for (k=2; 1; k++) {
if (q[k] > u) break
}
a = (u-q[k-1])/(q[k]-q[k-1])
result[i] = (1-a)*y[k-1] + a*y[k]
}
}
return(result)
}
end
------------------------------------------------------------------------------

Let's try it,

. do myprog
(output omitted)

. mata

: q = (   0,  .1,  .5,  .1, .25,  .5, .75, .90, .95, .99, 1)

: y = (   0, 795,1834,2172,2641,3174,3822,4900,5981,8522,10000)

: random = u_empirical(5, q, y)
1
+---------------+
1 |  891.0661438  |
2 |  3545.227971  |
3 |  3323.821994  |
4 |  3445.628493  |
5 |  3651.384132  |
+---------------+

: end

I think that's right.  Actually, I'm pretty sure it's right because,
behind the scenes, I included the line -u, result[i]- at the end of the
outside for loop so that I could see the value of u chosen and verify that
that the program calculated the correct result.

Doubtlessly Peter wants these values a Stata dataset:

. drop _all

. set obs 50
obs was 0, now 50

. gen x = .
(50 missing values generated)

. set seed 20399

. mata

: st_view(myx=., ., "x")

: myx[.] = u_empirical(50, q, y)

: end

. list in 1/5

+----------+
|        x |
|----------|
1. | 3954.497 |
2. | 1015.753 |
3. | 7914.299 |
4. | 1564.397 |
5. | 995.6785 |
+----------+

-- Bill
wgould@stata.com
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```