Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Does Blasnik's Law apply to -use-?


From   "Newson, Roger B" <r.newson@imperial.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: Does Blasnik's Law apply to -use-?
Date   Sun, 16 Sep 2007 15:19:09 +0100

Thanks to David Elliot, Mike Blasnik and David Airey for their very
helpful and detailed replies to my query. These shall be used to inform
the first Stata 10 update to -parmby-, when I have Stata 10.

And thanks also to Vince Wiggins, who warned me (during the 13th UK
Stata User Meeting last week) of the dangers of ordinary users trying to
get too deep into the undocumented _prefix suite of commands, used
internally by StataCorp for -statsby- and other prefixes. (In Stata,
type

whelp _prefix

to find out more about these.)

Best wishes

Roger


Roger Newson
Lecturer in Medical Statistics
Respiratory Epidemiology and Public Health Group
National Heart and Lung Institute
Imperial College London
Royal Brompton campus
Room 33, Emmanuel Kaye Building
1B Manresa Road
London SW3 6LR
UNITED KINGDOM
Tel: +44 (0)20 7352 8121 ext 3381
Fax: +44 (0)20 7351 8322
Email: r.newson@imperial.ac.uk 
Web page: www.imperial.ac.uk/nhli/r.newson/
Departmental Web page:
http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/pop
genetics/reph/

Opinions expressed are those of the author, not of the institution.

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of David Elliott
Sent: 14 September 2007 15:07
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: Does Blasnik's Law apply to -use-?

Being Stata users, we should approach this in a rigorous scientific
fashion:

X-----begin-----X

program define intest
version 9.0

*! version 1.0.0  2007.09.13
*! Simulate using part of file with in #/##
*! by David C. Elliott
*!
*! using name of trial dataset
*! postname specifies filename of postfile
*! numblocks is number of file blocks to create


syntax using/ ,POSTname(string) NUMblocks(int)

local more `c(more)'
set more off

use `using', clear //Load first to eliminate any first pass caching
effects
local recblock = round(`c(N)'/`numblocks',1)

tempname post
postfile  `post' double block float timein timeif using `postname',
every(10) replace

timer clear 1
n di _n(2) "{txt}{col 11}{center 10:-- IF --}{center 10:-- IN --}" _n
///
  "{center 10:Block}{center 10:Time}{center 10:Time}" _n ///
  "{hline 30}"
local lastblock = `c(N)' - `recblock'
forvalues i=1(`recblock')`lastblock ' {
	local block = `i'
	foreach I in if in {
		if "`I'" == "in" {
			local ifin in `i'/`=`i'+`recblock''
			}
			else {
				local ifin if inrange(_n, `i',
`=`i'+`recblock'')
				}
		timer on 1
		use `using' `ifin', clear
		timer off 1
		qui timer list 1
		local time`I' :display %5.2f round(`r(t1)',.01)
		timer clear 1
		}
	post `post' (`block') (`timein') (`timeif')
	n di "{res}{ralign 10:`block'}{ralign 10:`timeif'}{ralign
10:`timein'}"
	}
postclose `post'
set more `more'
use `postname', clear
lab var block "Record Block"
lab var timein "Load Time using IN"
lab var timeif "Load Time using IF"
tw line timein block || line timeif block
end

X-----end-----X

eg:

. intest using dss_data_06_07.dta , postname(intest.dta) numblocks(100)


           -- IN --  -- IF --
  Block      Time      Time
------------------------------
         1      0.64      0.88
     17278      0.47      0.77
     34555      0.47      0.77
     51832      0.47      0.78
     69109      0.45      0.78
     86386      0.45      0.78
    103663      0.47      0.78
    120940      0.47      0.77
 ...

This adofile will run an -if- versus -in- simulation and graph the
results.  From my findings I can confirm a speed advantage of about
50% using -in- on dataset with obs:1,727,673 vars:28 size:266,061,642

However, things get murkier.  Run a simulation, then max out Stata's
memory setting with as much memory as the system will give you and run
the simulation again.  When you do this, you eliminate the system's
ability to cache the file.  Ordinarily, subject to filesize and
available memory, Stata may be reading the file from cache.  If this
is the case, one will see an advantage to using -in-.  However, if the
caching advantage is eliminated by increasing Stata memory, my
simulations show the speed reduction using -in- is negated.  I also
tested this on large network databases and was unable to demonstrate
any advantage to -in-.

So back to Roger's initial question.  It would appear that for
cacheable filesizes and large numbers of bygroups a strategy using
-in- might be feasible.  There is an overhead penalty of setting up
the bygroups to make them selectable using -in- involving sorts and
the like.  For a small number of bygroups the speed advantages might
be lost, but for many levels and a large number of iterations there
would be an advantage.

DC Elliott
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index