Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: New command -fastsample- available from SSC

From	Andrew Maurer <[email protected]>
To	"[email protected]" <[email protected]>
Subject	st: New command -fastsample- available from SSC
Date	Tue, 1 Apr 2014 14:17:55 +0000
Thanks to Kit Baum, -fastsample- is now available from the SSC.

fastsample: module to draw random samples 100x faster than Stata's sample and bsample commands

Description:
fastsample and fastbsample were written using an algorithm that is significantly faster than the algorithm behind sample and bsample. The performance gains of fastsample and fastbsample relative to their built-in counterparts increases greater than linearly with population size. Thus, the commands are particularly useful for sampling large datasets.

For a population size of 10,000,000; sample size of 100,000; and 5 repetitions, computation times were recorded as:

sample: 111.4 seconds vs fastsample: 3.4 seconds
bsample: 211.5 seconds vs fastbsample: 1.0 seconds

Best,
Andrew Maurer

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Joe Canner
Sent: Thursday, March 27, 2014 11:10 AM
To: [email protected]
Subject: st: RE: fastsample: Sampling 100x faster than Stata's sample and bsample

Andrew et al.,

I take back what I said about -selectindex()-.  The benchmark worked in Stata 12 for -fastbsample- because that routine does not use -selectindex()-.  The benchmark failed when I ran -fastsample- because -selectindex()- is new with Stata 13.

Regards,
Joe

-----Original Message-----
From: Joe Canner 
Sent: Thursday, March 27, 2014 11:50 AM
To: '[email protected]'
Subject: RE: fastsample: Sampling 100x faster than Stata's sample and bsample

Andrew,

This looks a very useful discovery, one that might be worth making available for public download.  It would also make a good Stata Conference presentation, but the abstract deadline closed last month.

One thing I have found is that it is often possible to improve on Stata's performance if you simplify the problem.  Stata devotes a lot of overhead to allowing for a variety of situations, error-checking, etc.  Accordingly, I wouldn't hold my breath waiting for them to adopt this method.  On the other hand, Stata is only recently devoting some effort to making code more efficient for large data sets and since your idea goes more to the underlying algorithm it may be possible for Stata to incorporate a new algorithm without mucking up their flexibility and error-checking.  I will be interested to see if others have any comments on any potential pitfalls with your algorithm.

One small curiosity I noticed.  Since you mentioned the -selectindex()- Mata function, I was looking for information about it in the Stata 12 help and the Stata 12 Mata manual and couldn't find it.  I found it in the online manual for Stata 13, so I expected that I wouldn't be able to run your benchmark in Stata 12.  However, it ran just fine, so it appears that-selectindex()- was an undocumented feature prior to Stata 13.

Regards,
Joe Canner
Johns Hopkins University School of Medicine

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Andrew Maurer
Sent: Thursday, March 27, 2014 11:09 AM
To: [email protected]
Subject: st: fastsample: Sampling 100x faster than Stata's sample and bsample

Hi Statalist,

Results:
fastbsample: 0.2 seconds vs bsample: 55.6 seconds
fastsample: 1.6 seconds vs sample 58.9 seconds
(note: relative performance increases with population size)

This is partly a follow up to yesterday's thread "st: Sampling problem". Daniela was asking how to efficiently sample with replacement. I took a look at Stata's sample.ado and bsample.ado files and realized that the algorithms are painfully inefficient (at least in the simple case with no "by", no weights, etc). Both programs essentially sort on a newly generated runiform() variate and, I'm simplifying, keep the first N observations, for sample size N.

Sorting large datasets is a very intensive process. For bsample with sample size 3 on a population of 1million, the computer must 1) create 2 million random variates (see bsample.ado), 2) compound sort those million into ascending order, and 3) keep the subset of size 3. Why not just make a list of 3 random integers between 1 and 1 million and select those observations? Ie: obstokeep = ceil(10^6*runiform(3,1)) (using mata's runiform()).

Sampling without replacement is slightly harder. If we follow the above idea, it's possible that of sample size 3, we selected 2 of the same integer. Eg randomly sampled observation numbers = {14, 66, 66}. In this case, we can calculate the difference between the number of distinct values in the sampled set (2) and the requested sample size (3), and then iterate over the set {1..1mil} - {14,66,66} until we have the size of the sampled set equal to the requested sample size.

See below for the code. I'd be very interested to hear if anyone has any thoughts on its efficiency. One thing that I don't fully understand is why the line -R = selectindex(!I)- is so fast. Even with !I on the order of 10^7 or 10^8, selectindex(!I) is almost instant (assuming the required memory is already allocated to Stata). I don't understand why it's so much faster than a for loop, looping through the rows of !I and putting the indices into R.

************** Define mata functions **********************
mata

void fastbsample(real scalar n)
// faster alternative to stata's bsample.
// .2s vs 55.6s in one test
{
	
	real scalar origN
	real vector allnum, allstr
	
	// check for errors
	if ((n-ceil(n)!=0) | (n <= 0)) _error(9,"n must be a positive integer")
	if (st_nobs() == 0) _error(9,"cannot sample empty dataset")
	
	// declare objects
	allnum = J(1,0,.)
	allstr = J(1,0,.)
	v = st_nvar()

	// separate string and numeric variables
	for (i=1;i<=v;i++) if (st_isnumvar(i)==1) allnum = allnum, i; else allstr = allstr, i;
	
	st_view(Nvars=.,.,allnum)
	st_sview(Svars=.,.,allstr)
	
	origN = rows(Nvars)
	
	// manually add extra obs if origN < n
	if (origN < n) {
		st_addobs(n - origN,1)
		st_view(Nvars=.,.,allnum)
		st_sview(Svars=.,.,allstr)
	}

	/*
	Below: slightly less efficient to store vector obstokeep than direct
	subscripting. We only have to if Nvars and Svars are both nonempty
	*/
	if (cols(allstr) == 0) Nvars[|1,.\n,.|] = Nvars[ceil(origN*runiform(n,1)),.]
	else if (cols(allnum) == 0) Svars[|1,.\n,.|] = Svars[ceil(origN*runiform(n,1)),.]
	else {
		real vector obstokeep
		obstokeep = ceil(origN*runiform(n,1))
		Nvars[|1,.\n,.|] = Nvars[obstokeep,.]
		Svars[|1,.\n,.|] = Svars[obstokeep,.]
	}

	st_keepobsin((1,n))
	
}

void fastsample(real scalar N)
// faster alternative to stata's sample.
// 1.6s vs 58.9s in one test
{
	
	real scalar origN, L, n, i
	real vector allnum, allstr, obstokeep
	
	// check for errors
	L = st_nobs()
	if ((N-ceil(N)!=0) | (N <= 0)) _error(9,"N must be a positive integer")
	if (L < N) _error(9,"cannot sample more observations than entire dataset")
	
	// initialize index	
	I = J(L,1,0) // index of rows to keep
	i = 0

	// first try - there could be collisions (the same row being called twice)
	obstokeep = ceil(L*runiform(N,1))
	I[obstokeep] = J(N,1,1)

	// iterate until we have a sample of N index values of L
	while (n!=0) {
		R = selectindex(!I) // remaining indices that may be chosen
		l = length(R) // total subindices
		n = N - (L-l) // remaining obs to get

		obstokeep = R[ceil(l*runiform(n,1))]
		I[obstokeep] = J(n,1,1)
		i++
	}
	printf("total iterations: %f\n", i)

	obstokeep = selectindex(I)
	st_keepobsin(obstokeep)
		
}

end
************** End function definitions *******************


*********** Benchmark Stata's bsample vs fastbsample ********
local reps 5
tempfile temp

set obs 10000000
gen r=runiform()
save `temp', replace


forval i = 1/`reps' {
	use `temp', clear
	timer on 1
	bsample 5000
	//sample 5000, count
	timer off 1
}

forval i = 1/`reps' {
	use `temp', clear
	timer on 2
	mata: fastbsample(5000)
	//mata: fastsample(5000)
	timer off 2
}

timer list
*********** End benchmark ***********************************

Thank you,

Andrew Maurer



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: RE: st: RE: eform
Next by Date: Re: st: Specifying Regression model as an equation
Previous by thread: RE: st: RE: BHPS
Next by thread: Re: st: Specifying Regression model as an equation
Index(es):
- Date
- Thread