[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Does Blasnik's Law apply to -use-?

From   "Newson, Roger B" <[email protected]>
To   <[email protected]>
Subject   RE: st: Does Blasnik's Law apply to -use-?
Date   Thu, 13 Sep 2007 19:18:45 +0100

IMHO, Michael's results can be rationalized by hypothesizing that the
-in- qualifier causes -use- to read until it gets to the beginning of
the -in- range (throwing the input away), and then to read the -in-
range (copying the input to the dataset in memory), and then to close
the file containing the dataset. This would not have much effect on the
amount of file input required to read the last 1000 observations from a
file dataset containing millions of observations, but will approximately
halve the amount of file input required to read the middle 1000
observations, and have a specatcular effect on the time required to read
the first 1000 observations, which might then be negligible compared to
the fixed cost of opening and closing the file.

If my hypothesis is correct, then using -in- to read every one of a
large number of small by-groups will approximately halve the total
required file input. Unfortunately, the time taken will still be
quadratic in the number of by-groups. That is to say, doubling the
number of by-groups (and keeping the average by-group size constant)
will approximately quadruple the file input, not approximately double
it. This would be different from Blasnik's law (as I have always
understood it to apply to datasets already in memory), which implies
that -statsby- can process each by-group without processing any of the
other by-groups, implying an execution time linear in the number of
by-groups. Therefore, using the -in- qualifier with -use- will not have
the spectacular effect observed earlier with -statsby-.

The bottom-line consequence of my hypothesis appears to be that, if the
user is working for the Office of Galactic Statistics and has a by-group
for each of millions of planets, then the user should use a conventional
indexed SQL-based database to create a separate Stata dataset for each
planetary by-group, and then call -parmby- separately for each planetary
dataset (either serially or in parallel).

Is my hypothesis correct?

Best wishes


Roger Newson
Lecturer in Medical Statistics
Respiratory Epidemiology and Public Health Group
National Heart and Lung Institute
Imperial College London
Royal Brompton campus
Room 33, Emmanuel Kaye Building
1B Manresa Road
London SW3 6LR
Tel: +44 (0)20 7352 8121 ext 3381
Fax: +44 (0)20 7351 8322
Email: [email protected] 
Web page:
Departmental Web page:

Opinions expressed are those of the author, not of the institution.

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Michael
Sent: 13 September 2007 17:52
To: [email protected]
Subject: Re: st: Does Blasnik's Law apply to -use-?

These results are different than mine and do not directly address the
You compare opening the entire file vs. opening a part of the file using
But the goal is to select only a subset of observations.  For that, you
need a second command after opening the entire file or you would need to
the -use if _n>xxx & _n<yyy- construct.  I find that using the -if-
takes more time than using -in- or simply opening the file.  By the way,
you can 
more accurately test the timing of individual commands using -set rmsg
rather than simply displaying the time

M Blasnik

----- Original Message ----- 
From: "David Elliott" <[email protected]>
To: <[email protected]>
Sent: Thursday, September 13, 2007 12:28 PM
Subject: Re: st: Does Blasnik's Law apply to -use-?

>I was alerted offlist by a member that the mailer had truncated my
> previous reply in this thread - here it is again:
> Having used -parmby- recently and having some understanding of what
> Roger is discussing, I'd like to offer the following.
> From my interpretation of how Stata stores data, the ability to -use
> in ##/##- would require the record indexes to be created by completely
> loading the data.  I am currently working on a 4 million record
> dataset and was able to run a quick test with a little program:
> n di "Begin: " _n c(current_date) " " c(current_time) _n
> use dss_data_05_06 in 1/1000, clear
> n di "Load using in 1/1000" _n c(current_date) " " c(current_time) _n
> use dss_data_05_06, clear
> n di "Ordinary load" _n c(current_date) " " c(current_time)
> Output:
> Begin:
> 12 Sep 2007 15:02:46
> Load using in 1/1000
> 12 Sep 2007 15:02:56
> Ordinary load
> 12 Sep 2007 15:03:06
> I switched the loading order and regardless, the load took 10 seconds
> either way.  I don't think you can use this optimization.
> DC Elliott

*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index