[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Newson, Roger B" <r.newson@imperial.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: Does Blasnik's Law apply to -use-? |

Date |
Thu, 13 Sep 2007 19:18:45 +0100 |

IMHO, Michael's results can be rationalized by hypothesizing that the -in- qualifier causes -use- to read until it gets to the beginning of the -in- range (throwing the input away), and then to read the -in- range (copying the input to the dataset in memory), and then to close the file containing the dataset. This would not have much effect on the amount of file input required to read the last 1000 observations from a file dataset containing millions of observations, but will approximately halve the amount of file input required to read the middle 1000 observations, and have a specatcular effect on the time required to read the first 1000 observations, which might then be negligible compared to the fixed cost of opening and closing the file. If my hypothesis is correct, then using -in- to read every one of a large number of small by-groups will approximately halve the total required file input. Unfortunately, the time taken will still be quadratic in the number of by-groups. That is to say, doubling the number of by-groups (and keeping the average by-group size constant) will approximately quadruple the file input, not approximately double it. This would be different from Blasnik's law (as I have always understood it to apply to datasets already in memory), which implies that -statsby- can process each by-group without processing any of the other by-groups, implying an execution time linear in the number of by-groups. Therefore, using the -in- qualifier with -use- will not have the spectacular effect observed earlier with -statsby-. The bottom-line consequence of my hypothesis appears to be that, if the user is working for the Office of Galactic Statistics and has a by-group for each of millions of planets, then the user should use a conventional indexed SQL-based database to create a separate Stata dataset for each planetary by-group, and then call -parmby- separately for each planetary dataset (either serially or in parallel). Is my hypothesis correct? Best wishes Roger Roger Newson Lecturer in Medical Statistics Respiratory Epidemiology and Public Health Group National Heart and Lung Institute Imperial College London Royal Brompton campus Room 33, Emmanuel Kaye Building 1B Manresa Road London SW3 6LR UNITED KINGDOM Tel: +44 (0)20 7352 8121 ext 3381 Fax: +44 (0)20 7351 8322 Email: r.newson@imperial.ac.uk Web page: www.imperial.ac.uk/nhli/r.newson/ Departmental Web page: http://www1.imperial.ac.uk/medicine/about/divisions/nhli/respiration/pop genetics/reph/ Opinions expressed are those of the author, not of the institution. -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Michael Blasnik Sent: 13 September 2007 17:52 To: statalist@hsphsun2.harvard.edu Subject: Re: st: Does Blasnik's Law apply to -use-? These results are different than mine and do not directly address the question. You compare opening the entire file vs. opening a part of the file using -in-. But the goal is to select only a subset of observations. For that, you would need a second command after opening the entire file or you would need to use the -use if _n>xxx & _n<yyy- construct. I find that using the -if- approach takes more time than using -in- or simply opening the file. By the way, you can more accurately test the timing of individual commands using -set rmsg on- rather than simply displaying the time M Blasnik ----- Original Message ----- From: "David Elliott" <dcelliott@gmail.com> To: <statalist@hsphsun2.harvard.edu> Sent: Thursday, September 13, 2007 12:28 PM Subject: Re: st: Does Blasnik's Law apply to -use-? >I was alerted offlist by a member that the mailer had truncated my > previous reply in this thread - here it is again: > > Having used -parmby- recently and having some understanding of what > Roger is discussing, I'd like to offer the following. > > From my interpretation of how Stata stores data, the ability to -use > in ##/##- would require the record indexes to be created by completely > loading the data. I am currently working on a 4 million record > dataset and was able to run a quick test with a little program: > > n di "Begin: " _n c(current_date) " " c(current_time) _n > use dss_data_05_06 in 1/1000, clear > n di "Load using in 1/1000" _n c(current_date) " " c(current_time) _n > use dss_data_05_06, clear > n di "Ordinary load" _n c(current_date) " " c(current_time) > > Output: > > Begin: > 12 Sep 2007 15:02:46 > > Load using in 1/1000 > 12 Sep 2007 15:02:56 > > Ordinary load > 12 Sep 2007 15:03:06 > > I switched the loading order and regardless, the load took 10 seconds > either way. I don't think you can use this optimization. > > DC Elliott * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Does Blasnik's Law apply to -use-?***From:*"Michael Blasnik" <michael.blasnik@verizon.net>

**References**:**st: Does Blasnik's Law apply to -use-?***From:*"Newson, Roger B" <r.newson@imperial.ac.uk>

**Re: st: Does Blasnik's Law apply to -use-?***From:*"David Elliott" <dcelliott@gmail.com>

**Re: st: Does Blasnik's Law apply to -use-?***From:*"David Elliott" <dcelliott@gmail.com>

**Re: st: Does Blasnik's Law apply to -use-?***From:*"Michael Blasnik" <michael.blasnik@verizon.net>

- Prev by Date:
**Re: st: Does Blasnik's Law apply to -use-?** - Next by Date:
**Re: st: suest with large number of fixed effects** - Previous by thread:
**Re: st: Does Blasnik's Law apply to -use-?** - Next by thread:
**Re: st: Does Blasnik's Law apply to -use-?** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |