Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Re: Does Blasnik's Law apply to -use-?


From   "Michael Blasnik" <michael.blasnik@verizon.net>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: Re: Does Blasnik's Law apply to -use-?
Date   Wed, 12 Sep 2007 13:45:21 -0400

...
Based on a few tests, it does appears to apply. The -in- approach reduced execution time by about 50% when selecting 100K observations from the middle of a file with 7 million obs.

In many cases, the difference in execution speed for each command is fairly trivial -- in my tests the difference was only about 0.8 seconds. The real speed benefits occur when the command is executed many times in a loop using a large dataset -- such as identifying members of a each panel in a dataset with 1000's of panels. If -parmby- is similar to -statsby- then the speed benefits will be substantial for users working with large datasets with many levels of the -by- variable, but not very large for those with few levels or smaller datasets.

Michael Blasnik
of Blasnik's law ;)


----- Original Message ----- From: "Newson, Roger B" <r.newson@imperial.ac.uk>
To: <statalist@hsphsun2.harvard.edu>
Sent: Wednesday, September 12, 2007 10:03 AM
Subject: st: Does Blasnik's Law apply to -use-?



I have a query re Blasnik's Law, first named in the Statalist archives
by Nick Cox at
http://www.stata.com/statalist/archive/2007-08/msg00668.html
which states that using the -in- qualifier uses less computing time than
the equivalent -if- qualifier. For instance

regress mpg weight in 53/74

uses less time than

regress mpg weight if _n>=53 & _n<=74

because Stata does not have to check every observation in the dataset in
memory the first way, but has to do so the second way. My query is: Does
Blasnik's Law apply to the -use- command? That is to say, does the
statement

use mybigdata.dta in 3959/4030

use much less computing time than the statement

use mybigdata.dta if _n>=3959 & _n<=4030

which should input the same data into the memory? I ask because, as I
understand it, Stata datasets are sequential-access files (unlike SAS
datasets which I understand are random-access, with the option of having
multiple indices), and this should imply that Stata has to read through
observations 1 to 3958 before reading observation 3959.

My motivation is that I wish to streamline the command -parmby-, which
currently processes multiple by-groups by inputting the whole dataset
repeatedly, using the -restore, preserve- command, and then dropping all
by-groups except one. I am trying to think of a better way.

Best wishes (and thanks in advance)

Roger
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index