Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: fastsample: Sampling 100x faster than Stata's sample and bsample

From	Andrew Maurer <[email protected]>
To	"[email protected]" <[email protected]>
Subject	st: fastsample: Sampling 100x faster than Stata's sample and bsample
Date	Thu, 27 Mar 2014 15:09:19 +0000

Hi Statalist,

Results:
fastbsample: 0.2 seconds vs bsample: 55.6 seconds
fastsample: 1.6 seconds vs sample 58.9 seconds
(note: relative performance increases with population size)

This is partly a follow up to yesterday's thread "st: Sampling problem". Daniela was asking how to efficiently sample with replacement. I took a look at Stata's sample.ado and bsample.ado files and realized that the algorithms are painfully inefficient (at least in the simple case with no "by", no weights, etc). Both programs essentially sort on a newly generated runiform() variate and, I'm simplifying, keep the first N observations, for sample size N.

Sorting large datasets is a very intensive process. For bsample with sample size 3 on a population of 1million, the computer must 1) create 2 million random variates (see bsample.ado), 2) compound sort those million into ascending order, and 3) keep the subset of size 3. Why not just make a list of 3 random integers between 1 and 1 million and select those observations? Ie: obstokeep = ceil(10^6*runiform(3,1)) (using mata's runiform()).

Sampling without replacement is slightly harder. If we follow the above idea, it's possible that of sample size 3, we selected 2 of the same integer. Eg randomly sampled observation numbers = {14, 66, 66}. In this case, we can calculate the difference between the number of distinct values in the sampled set (2) and the requested sample size (3), and then iterate over the set {1..1mil} - {14,66,66} until we have the size of the sampled set equal to the requested sample size.

See below for the code. I'd be very interested to hear if anyone has any thoughts on its efficiency. One thing that I don't fully understand is why the line -R = selectindex(!I)- is so fast. Even with !I on the order of 10^7 or 10^8, selectindex(!I) is almost instant (assuming the required memory is already allocated to Stata). I don't understand why it's so much faster than a for loop, looping through the rows of !I and putting the indices into R.

************** Define mata functions **********************
mata

void fastbsample(real scalar n)
// faster alternative to stata's bsample.
// .2s vs 55.6s in one test
{
	
	real scalar origN
	real vector allnum, allstr
	
	// check for errors
	if ((n-ceil(n)!=0) | (n <= 0)) _error(9,"n must be a positive integer")
	if (st_nobs() == 0) _error(9,"cannot sample empty dataset")
	
	// declare objects
	allnum = J(1,0,.)
	allstr = J(1,0,.)
	v = st_nvar()

	// separate string and numeric variables
	for (i=1;i<=v;i++) if (st_isnumvar(i)==1) allnum = allnum, i; else allstr = allstr, i;
	
	st_view(Nvars=.,.,allnum)
	st_sview(Svars=.,.,allstr)
	
	origN = rows(Nvars)
	
	// manually add extra obs if origN < n
	if (origN < n) {
		st_addobs(n - origN,1)
		st_view(Nvars=.,.,allnum)
		st_sview(Svars=.,.,allstr)
	}

	/*
	Below: slightly less efficient to store vector obstokeep than direct
	subscripting. We only have to if Nvars and Svars are both nonempty
	*/
	if (cols(allstr) == 0) Nvars[|1,.\n,.|] = Nvars[ceil(origN*runiform(n,1)),.]
	else if (cols(allnum) == 0) Svars[|1,.\n,.|] = Svars[ceil(origN*runiform(n,1)),.]
	else {
		real vector obstokeep
		obstokeep = ceil(origN*runiform(n,1))
		Nvars[|1,.\n,.|] = Nvars[obstokeep,.]
		Svars[|1,.\n,.|] = Svars[obstokeep,.]
	}

	st_keepobsin((1,n))
	
}

void fastsample(real scalar N)
// faster alternative to stata's sample.
// 1.6s vs 58.9s in one test
{
	
	real scalar origN, L, n, i
	real vector allnum, allstr, obstokeep
	
	// check for errors
	L = st_nobs()
	if ((N-ceil(N)!=0) | (N <= 0)) _error(9,"N must be a positive integer")
	if (L < N) _error(9,"cannot sample more observations than entire dataset")
	
	// initialize index	
	I = J(L,1,0) // index of rows to keep
	i = 0

	// first try - there could be collisions (the same row being called twice)
	obstokeep = ceil(L*runiform(N,1))
	I[obstokeep] = J(N,1,1)

	// iterate until we have a sample of N index values of L
	while (n!=0) {
		R = selectindex(!I) // remaining indices that may be chosen
		l = length(R) // total subindices
		n = N - (L-l) // remaining obs to get

		obstokeep = R[ceil(l*runiform(n,1))]
		I[obstokeep] = J(n,1,1)
		i++
	}
	printf("total iterations: %f\n", i)

	obstokeep = selectindex(I)
	st_keepobsin(obstokeep)
		
}

end
************** End function definitions *******************


*********** Benchmark Stata's bsample vs fastbsample ********
local reps 5
tempfile temp

set obs 10000000
gen r=runiform()
save `temp', replace


forval i = 1/`reps' {
	use `temp', clear
	timer on 1
	bsample 5000
	//sample 5000, count
	timer off 1
}

forval i = 1/`reps' {
	use `temp', clear
	timer on 2
	mata: fastbsample(5000)
	//mata: fastsample(5000)
	timer off 2
}

timer list
*********** End benchmark ***********************************

Thank you,

Andrew Maurer



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- st: RE: fastsample: Sampling 100x faster than Stata's sample and bsample
  - From: Joe Canner <[email protected]>
- st: RE: fastsample: Sampling 100x faster than Stata's sample and bsample
  - From: Joe Canner <[email protected]>

Prev by Date: Re: st: syntax and gettoken with parentheses and equal signs
Next by Date: st: GSEM References
Previous by thread: st: heckprobit r(3499) error
Next by thread: st: RE: fastsample: Sampling 100x faster than Stata's sample and bsample
Index(es):
- Date
- Thread