No, at least not based on my testing. The position of the observations within the file has no impact. The -in- construct takes the same time to load the last 1000 observations as the first 1000 observations. I would think that the file does not need to read sequentially since the byte position of any observation can be calculated directly based on the width of an observation (which is the same for all records).

My tests on a different 7 million observation file finds a speed improvement of 80%. Perhaps dataset width has an impact.

I think you would need to test the two approaches on your problem to find out the speed improvement. I wouldn't be surprised with a dramatic time savings for large files with 1000's of panels. I don't think your concern about execution time growing in N vs. N^2 is the real issue -- it is the relative speed improvement of one method vs. another. Even with the -in- construct, -statsby- processing time increases much faster than linear in N, but it still provides the greatest time savings over -if- in those large files.

M Blasnik

----- Original Message ----- From: "Newson, Roger B" <r.newson@imperial.ac.uk>

To: <statalist@hsphsun2.harvard.edu>

Sent: Thursday, September 13, 2007 2:18 PM

Subject: RE: st: Does Blasnik's Law apply to -use-?

IMHO, Michael's results can be rationalized by hypothesizing that the -in- qualifier causes -use- to read until it gets to the beginning of the -in- range (throwing the input away), and then to read the -in- range (copying the input to the dataset in memory), and then to close the file containing the dataset. This would not have much effect on the amount of file input required to read the last 1000 observations from a file dataset containing millions of observations, but will approximately halve the amount of file input required to read the middle 1000 observations, and have a specatcular effect on the time required to read the first 1000 observations, which might then be negligible compared to the fixed cost of opening and closing the file. If my hypothesis is correct, then using -in- to read every one of a large number of small by-groups will approximately halve the total required file input. Unfortunately, the time taken will still be quadratic in the number of by-groups. That is to say, doubling the number of by-groups (and keeping the average by-group size constant) will approximately quadruple the file input, not approximately double it. This would be different from Blasnik's law (as I have always understood it to apply to datasets already in memory), which implies that -statsby- can process each by-group without processing any of the other by-groups, implying an execution time linear in the number of by-groups. Therefore, using the -in- qualifier with -use- will not have the spectacular effect observed earlier with -statsby-. The bottom-line consequence of my hypothesis appears to be that, if the user is working for the Office of Galactic Statistics and has a by-group for each of millions of planets, then the user should use a conventional indexed SQL-based database to create a separate Stata dataset for each planetary by-group, and then call -parmby- separately for each planetary dataset (either serially or in parallel). Is my hypothesis correct? Best wishes Roger

