Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: RE: distance calculation and reshape


From   Martin Hällsten <martin.hallsten@sofi.su.se>
To   "'Austin Nichols'" <austinnichols@gmail.com>, <statalist@hsphsun2.harvard.edu>
Subject   RE: st: RE: distance calculation and reshape
Date   Fri, 19 Oct 2007 21:52:20 +0200

Austin, 

First, the machine I use has 34 Gb of physical RAM so it shouldn't depend on the hard drive. 

Second, this seems to be exactly what I need. Some preliminary runs on my own desktop (not on the
more powerful machine used previously) revealed that the computation time of the "post" procedure is
a perfectly linear function of the number of distances, whereas the time for my lousy "reshape"
procedure is a quadratic function. So the estimated time for the "post" with 9 230 points is
slightly below 4 hrs. 

n points		100	200	350	500	750	1000
n distances	10 000	40 000	122 500	250 000	562 500	1 000 000
seconds:
post		1	6	20	39	90	161
reshape		2	12	41	96	324	850

I can't wait to try this next week. Thank you very much!

Martin Hällsten

BTW, isn't a "linear" version of reshape warranted? 

-----Original Message-----
From: Austin Nichols [mailto:austinnichols@gmail.com] 
Sent: den 19 oktober 2007 16:25
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: RE: distance calculation and reshape

Martin Hällsten <martin.hallsten@sofi.su.se>:

Plain text email only, please!

You want to wind up with 87m obs, which may tax your
computer no matter how you do it. That said, you are
calculating _N^2 (100 in the example, 87m in the data) x
and y differences, then computing the distance off those.
However, there are only _N*(_N-1)/2 distinct values (45 or
43m) to calculate for the x and y differences, as the
relevant matrix of calculations is symmetric.  I.e. the
distance from i to j is the same as the distance from j to
i. It would be faster to do fewer calculations, then
expand the data, since copying values is faster than
computing them.

But let's do every calculation as you request, and just
use -post- to write the results of calculations to disk,
instead of making variables and using -reshape-.
Depending on your memory and hard
disk configuration, you may get a big speed improvement
this way.  I am guessing you don't have 25GB of physical
memory, which means Stata is using your hard drive as
memory, which makes everything much slower.  Try setting
memory to no more than one half of your physical RAM.

clear
set mem 60m
set seed 12347
local n = 500
range point 1 `n' `n'
gen long x = int(abs(uniform()*10000000))
gen long y = int(abs(uniform()*10000000))
local rows = _N
loc tm=real(substr("$S_TIME",4,2))
loc t=60*`tm'+real(substr("$S_TIME",7,2))
set rmsg off
tempfile dta
postfile t p r px py rx ry dist using `dta', replace
forvalues n = 1/`rows' {
forvalues i = 1/`rows' {
loc p=point[`n']
loc r=point[`i']
loc px=y[`n']
loc py=x[`n']
loc rx=y[`i']
loc ry=x[`i']
loc d=sqrt(((`px'-`rx')^2)+((`py'-`ry')^2))
post t (`p') (`r') (`px') (`py')  (`rx') (`ry') (`d')
}
}
postclose t
loc tm=real(substr("$S_TIME",4,2))
loc t=60*`tm'+real(substr("$S_TIME",7,2))-`t'

loc tm=real(substr("$S_TIME",4,2))
loc s=60*`tm'+real(substr("$S_TIME",7,2))
forvalues n1 = 1/`rows' {
     gen int point_`n1' = point[`n1']
     gen long x_`n1' = x[`n1']
     gen long y_`n1' = y[`n1']
}
reshape long point_ x_ y_ , i(point) j(r)
gen xdiff = abs(x-x_)
gen ydiff = abs(y-y_)
gen distance  = sqrt((xdiff^2)+(ydiff^2))
loc tm=real(substr("$S_TIME",4,2))
loc s=60*`tm'+real(substr("$S_TIME",7,2))-`s'

ren point p
sort p r
joinby p r using `dta'
compare d*
di as res "Timings:"
di "Post: " `t' _n "Reshape: " `s'



*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index