[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Does Blasnik's Law apply to -use-?

From   "Michael Blasnik" <[email protected]>
To   <[email protected]>
Subject   Re: st: Does Blasnik's Law apply to -use-?
Date   Thu, 13 Sep 2007 14:40:48 -0400

No, at least not based on my testing. The position of the observations within the file has no impact. The -in- construct takes the same time to load the last 1000 observations as the first 1000 observations. I would think that the file does not need to read sequentially since the byte position of any observation can be calculated directly based on the width of an observation (which is the same for all records).

My tests on a different 7 million observation file finds a speed improvement of 80%. Perhaps dataset width has an impact.

I think you would need to test the two approaches on your problem to find out the speed improvement. I wouldn't be surprised with a dramatic time savings for large files with 1000's of panels. I don't think your concern about execution time growing in N vs. N^2 is the real issue -- it is the relative speed improvement of one method vs. another. Even with the -in- construct, -statsby- processing time increases much faster than linear in N, but it still provides the greatest time savings over -if- in those large files.

M Blasnik

----- Original Message ----- From: "Newson, Roger B" <[email protected]>
To: <[email protected]>
Sent: Thursday, September 13, 2007 2:18 PM
Subject: RE: st: Does Blasnik's Law apply to -use-?

IMHO, Michael's results can be rationalized by hypothesizing that the
-in- qualifier causes -use- to read until it gets to the beginning of
the -in- range (throwing the input away), and then to read the -in-
range (copying the input to the dataset in memory), and then to close
the file containing the dataset. This would not have much effect on the
amount of file input required to read the last 1000 observations from a
file dataset containing millions of observations, but will approximately
halve the amount of file input required to read the middle 1000
observations, and have a specatcular effect on the time required to read
the first 1000 observations, which might then be negligible compared to
the fixed cost of opening and closing the file.

If my hypothesis is correct, then using -in- to read every one of a
large number of small by-groups will approximately halve the total
required file input. Unfortunately, the time taken will still be
quadratic in the number of by-groups. That is to say, doubling the
number of by-groups (and keeping the average by-group size constant)
will approximately quadruple the file input, not approximately double
it. This would be different from Blasnik's law (as I have always
understood it to apply to datasets already in memory), which implies
that -statsby- can process each by-group without processing any of the
other by-groups, implying an execution time linear in the number of
by-groups. Therefore, using the -in- qualifier with -use- will not have
the spectacular effect observed earlier with -statsby-.

The bottom-line consequence of my hypothesis appears to be that, if the
user is working for the Office of Galactic Statistics and has a by-group
for each of millions of planets, then the user should use a conventional
indexed SQL-based database to create a separate Stata dataset for each
planetary by-group, and then call -parmby- separately for each planetary
dataset (either serially or in parallel).

Is my hypothesis correct?

Best wishes

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index