# Random number routines [was: st: Simple Question - Use of rndbin]

 From n j cox To statalist@hsphsun2.harvard.edu, pr@ctu.mrc.ac.uk Subject Random number routines [was: st: Simple Question - Use of rndbin] Date Fri, 21 Sep 2007 16:16:39 +0100

This is long. It starts off grumpy but ends with positive
solutions and suggestions. People interested in generating random
numbers will, with probability almost 1, find something
interesting or useful here.

Tiago Pereira wrote
===================

I am writing a do file that uses -rndbin- to generate a second
variable, xb. As you all know -rndbin- runs as:

rndbin obs prob numb

However, the do file repeats that task X times and, in some cases,
_n is greater than obs. Hence, for example, the following error
shows up:

obs must be between 32 and 582539
r(198);

Indeed, in one out of X runs of the loop, the local `observations'
assumes value of 19. The problem is that _n=32 in the dataset.

Thence,

rndbin 19 0.05 1

obs must be between 32 and 582539
r(198);

So, I would like to know if there is a way to solve this
problem... say... run -rndbin- in a way that the variable xb is
not created.

"Simple questions"
==================

I'd advise strongly against labelling something a "simple
question". It doesn't usually make posts more appealing.

* You can't know that your question really is simple until it's

* For many Statalisters, the likely response is going to be
"Well, if it's simple, someone else will answer this" or "If it's
simple, answer it yourself". Such responses will be private
thoughts, followed by deletion of your post. You decrease interest
in your post, as many Statalisters filter on titles.

* In this case, the question isn't simple at all, or at least not
especially simple to explain.

Tiago's troubles with -rndbin-
==============================

-rndbin-, contrary to statement, is unlikely to be familiar to all
Statalisters.

"Say what command(s) you are using. If they are not part of
official Stata, say where they come from: the STB/SJ, SSC, or
other archives."

-rndbin- is a program written by Joe Hilbe, published in the -rnd-
package (as we now say) in STB-28 in 1995 and revised in STB-41 in
1998 and also accessible from SSC, as -findit rndbin- will show.
W. Linde-Zwirble is also credited as joint author of this package.
Joe is, at least intermittently, a member of Statalist, but my
Statascope or iStata indicates that he may be travelling, en route
to the Italian Stata Users Group meeting next week.

Be that as it may, Tiago's issue is getting -rndbin- to do what
is wanted in a .do file. I am not very clear on what Tiago is
trying to do, in the absence of code. Again, there is advice in
the Statalist FAQ, which is probably the most quoted part of that
FAQ:

"Say exactly what you typed and exactly what Stata typed (or did)
in response. N.B. exactly!"

The local -observations- must be Tiago's, as it does not appear in
-rndbin-. For mentions of _n, understand _N.

However, the code of -rndbin- is accessible to me, so let me work
from that towards the question. Here it is, so anyone interested
can see what we are talking about:

*!version 1.2 1999 Joseph Hilbe
* version 1.1.1 1993 Joseph Hilbe rev:7-7-95 (sg44: STB-28)
* binomial distribution random number generator
* Example: rndbin 10000 .5 1 [set obs 10000; p = 0.5; n = 1]

program define rndbin
version 4.0
set type double
cap drop xb
qui {
local cases `1'
set obs `cases'
mac shift
local pp `1'
mac shift
local n `1'
if `pp' < 0.5 { local p = `pp' }
else { local p = 1.0 - `pp' }
local am = `n'*`p'
noi di in gr "( Generating " _c
if `n' < 25 {
tempvar bn1 ran1
gen `bn1' = 0
gen `ran1' = uniform()
local j=0
while `j' < `n' {
replace `bn1' = `bn1' + 1 if (`ran1' < `p')
local j = `j' + 1
replace `ran1' = uniform()
noi di in gr "." _c
}
}
else if `am' < 1.0 {
local g = exp(-`am')
tempvar t em ds sum1 ran1 bn1
gen `t' = 1.0
gen `em' = -1
gen `ran1'=uniform()
gen `ds' = 1
egen `sum1' = sum(`ds')
while `sum1' > 0 {
replace `em' = `em' + 1 if (`ds'==1)
replace `t' = `t' * `ran1' if (`ds'==1)
replace `ds' = 0 if (`g' > `t')
replace `ran1' = uniform()
drop `sum1'
egen `sum1' = sum(`ds')
}
gen `bn1' = `em'
replace `bn1' = `n' if (`em' > `n')
}
else {
tempvar ran1 ran2 ds ts sum1 e y em bn1
local en = `n'
local oldg = lngamma(`en'+1.0)
local pc=1.0-`p'
local plog = log(`p')
local pclog = log(`pc')
local sq = sqrt(2.0*`am'*`pc')
gen `em' = -1
gen `e' = -1
gen `ran1' = uniform()
gen `ran2' = uniform()
gen `ds' = 1
gen `ts' = 1
gen `y' = -1
egen `sum1' = sum(`ds')
while `sum1' > 0 {
replace `y' = sin(_pi*`ran1')/cos(_pi*`ran1')
replace `em' = `sq'*`y' + `am' if (`ds'==1)
replace `ts' =0 if (((0>`em') | (`em' >=(`en'+1.0))) & (`ds'==1))
#delimit ;
replace `e' = 1.2*`sq'*(1.0+(`y'*`y'))*exp(`oldg'-lngamma(`em'+1.0)
-lngamma(`en'-`em'+1.0) + (`em' *`plog')+(`en'-`em')*`pclog') if
(( `ds'==1) & (`ts'==1));
#delimit cr
replace `ds'=0 if ((`ran2'<`e') & (`ds'==1) & (`ts'==1))
replace `ran1' = uniform()
replace `ran2' = uniform()
replace `e'=-1
replace `ts' = 1
drop `sum1'
egen `sum1' = sum(`ds')
noi di in gr "." _c
}
gen `bn1' = int(`em'+.5)
}
replace `bn1' = `n'-`bn1' if `p' ~= `pp'
gen xb = `bn1'
noi di in gr " )"
noi di in bl "Variable " in ye "xb " in bl "created."
lab var xb "Binomial random variable"
set type float
}
end
----------------------------------

Some key facts:

1. -rndbin- produces a variable -xb- which is a binomial random
variable. That's wired in. If you have an existing variable -xb-,
it will get -drop-ped. I can't see that behaviour being explained
in the help. (Also, if you had say an existing variable -xbin- and
no other -xb*-, that would get -drop-ped, unless you had -set
varabbrev off-.) One of Tiago's questions is whether this can be
changed, and the answer is: No, not without rewriting -rndbin-
(which is not ours to rewrite). (You could clone and change, but
the end of this story is that you don't need to do that.)

2. -rndbin- tacitly assumes that you have -set type float- and
temporarily issues -set type double-. There are various small
issues here:

* If you had earlier -set type double-, or anything else other
than -float-, -rndbin- would override that earlier setting on
exit. In practice, this is unlikely (but certainly not
impossible).

* If -rndbin- exits prematurely, then (depending on when it
exits) you might well be working under -set type double- from then
on in your session. It is very likely to exit prematurely if you
misunderstand its expectations.

(Some readers may want more explanation here. -set type- sets the
default type for new variables. By default this is -float-. Most
users probably never touch this, nor should they. The best
practice is usually to indicate, explicitly in individual
commands, that you want a datatype other than -float-, as in -gen
double foo = <whatever>-.)

3. -rndbin- early on issues

set obs `cases'

where `cases' is the first argument you fed it. Consider some
possibilities:

* You specify `cases' to be the existing number of
observations. -set obs- changes nothing. No problem.

* You specify `cases' larger than the number of
observations. -set obs- increases it. No problem. (The first
possibility is a special case of this one.)

* You specify `cases' smaller than the number of
observations. -set obs- refuses, and crashes -rndbin-. This is
what is biting Tiago.

An immediate solution to this problem is: Don't do that then! If
you want _fewer_ binomial random numbers than your number of
observations, just -generate- as many as the number of
observations, then ignore what you don't want after running
-rndbin-, using -if- or -in- to select.

However, I've started, so I'll finish. We are all on learning
curves. The Stata code I was writing in 1995 might look pretty
lousy to me now if I looked at it again. As a matter of style or
taste I suggest -- in 2007, clearly --

1'. A program should let you specify your own variable name and
never -drop- an existing variable, unless that is part of its
purpose and that is explicitly documented.

2',3'. A program should not mess with settings except temporarily,
unless (again) that is part of its purpose and that is explicitly
documented.

Alternatives to -rndbin-
========================

In any case, a program is barely necessary if there is a direct
alternative. I will now show that there are currently at least two
direct ways of doing it.

I started reading -rndbin- until it got complicated. I guess the
main complications in the algorithm are not to do things the
direct way when in principle there is a faster way. As the
complication involves a lot more code, which also includes much
looping, so there is extra interpretative overhead, it is not
obvious to me that you gain overall.

Incidentally, the 3rd edition of "Numerical recipes" from William
H. Press and friends, Cambridge University Press, 2007 goes over
the top in showing how binomial randoms can be done ultra-fast
with some convoluted code (in C++).

There is a large interpretative overhead in -rndbin- from repeated
calls to -egen, sum()- to get sums. That is, Stata works through
invoked. (In fact, in Stata 9 up, a call to -egen, sum()- also
-summarize, meanonly- and save on dozens if not hundreds of lines
of code.

Let's start again from first principles. A canonical binomial
problem I take to be something like this. I have 1000 families
with 6 people. The probability of being left-handed I take to be
0.1. What is the distribution of the number of left-handed people
in each family? (I don't know enough genetics to know if this null
model is as gauche as it may sound, but there is no sinister
motive here.)

How should you do this the modern Stata way? We can use -forval-,
not available to Joe in the mid-1990s:

set obs 1000
gen xb = 0
qui forval i = 1/6 {
replace xb = xb + (uniform() < 0.1)
}

Even without -forval-, this was possible in early Stata:

set obs 1000
gen xb = 0
local i = 0
qui while `i' < 6 {
local i = `i' + 1
replace xb = xb + (uniform() < 0.1)
}

As from Stata 9, we can also use Mata as our favourite calculator:

set obs 1000
gen xb = 0
mata: st_store(., "xb", ., rowsum(uniform(1000, 6) :<= 0.1))

In the code, ":<=" means either "Bill Gould is grinning at you,
because Mata is smart" or "elementwise <=" or both.

Naturally, in most cases something like -set obs 1000- will not
be necessary, as the number of observations is already what we want.

There is no need to speculate about speed. We can experiment:

-------------------------------------
clear
set obs 1000
set rmsg on
forval j = 1/100 {
gen xb`j' = 0
qui forval i = 1/6 {
replace xb`j' = xb`j' + (uniform() < 0.1)
}
}

drop xb*
set obs 1000

forval j = 1/100 {
gen xb`j' = 0
mata: st_store(., "xb`j'", ., rowsum(uniform(1000, 6) :<= 0.1))
}

qui forval j = 1/100 {
rndbin 1000 0.1 6
}
--------------------------------------

The timings on my somewhat mesolithic machine are
suppressed to save my embarrassment, but the order I get is
fastest is the Mata route (surprise)
second is a direct solution with -forval-
third is -rndbin-.

In short, direct alternatives to -rndbin- are possible, explicit
and fast. (As indicated, -rndbin- could be speeded up, but
that doesn't affect the main point.)

However, don't pester me for simple solutions in the same spirit
for your favourite distribution. Manifestly, the binomial is
especially easy.

======================

(* any big boss, and some little ones)

But there's a broader issue that will interest many.

Isn't it high time for a decent set of routines from StataCorp for
random numbers from the more basic distributions? This grumble
keeps re-surfacing, for example at the 2007 Boston users' meeting.
Sometimes it seems that StataCorp find the request too boring to
be really interesting, compared with some exotic modish model that
1% of the community really, really want. (Or say they do: next
year, something else will be hot, or is it cool.)

For a long time, StataCorp have been relying, at least implicitly,
on three lines of defence when asked similar questions:

1. Given the existence of -uniform()-, random numbers for many
distributions can be got in just one line, as
-invnorm(uniform())-, -exp(invnorm(uniform()))-, etc., etc.

2. It's rather difficult to decide on a standard list. Weibull or
Pareto? Does anybody but you and two friends really work with the
Singh-Maddala distribution? And the issue is complicated in
many instances by tribal preferences for two or even more different
parameterisations.

3. See Joe Hilbe's commands.

To which the answers are, Sure, sure, sure, but StataCorp could now
look at it this way:

* Start with issuing Mata functions. That cuts out any need for the
programming of inbuilt Stata functions, strict sense, which
can only be done by StataCorp. (For example, -log()- is an inbuilt strict sense function; -egen- functions are not; commands are commands, and not functions.)

* Publicise for those not yet into Mata how easy it is to feed a
Mata vector into a variable. Or, better, from most points of view,
this could all be gracefully hidden by something like

rgen mygamma = rgamma(<alpha>, <beta>)

which could be a wrapper for

* example with 1000 obs
gen mygamma = .
mata : st_store(., "mygamma", ., rgamma(1000, <alpha>, <beta>))

and all the extensions that needs (support for -if-, -in-, default
parameter values, options, treatment of missing values). All you then need are the Mata functions like -rgamma()-. (Homage to S and R, for
which this has been standard for (say) 20 years.)

* Once a standard syntax is public with a decent set of examples,
user-programmers would happily add other distributions. Different
parameterisations would not be an issue. (No-one need tussle over
the better -beta- parameterisation, for example. It could be handled via
options or alternative -rgen- functions.)

In short, -rgen- could flourish in the way that -egen- has
flourished. (There are more user-written -egen- functions than
official, and more official -egen- functions that were once
user-written than were originally written by StataCorp.
StataCorp's key -- indeed essential -- contribution was to provide
the -egen- scaffolding and some examples to emulate.)

Various conversations with Kit Baum helped crystallise some of the
ideas in this last section.

Nick
n.j.cox@durham.ac.uk

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/