Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: using the first n observations in a dataset w/o evaluating the whole thing?


From   "Rodini, Mark" <[email protected]>
To   <[email protected]>
Subject   RE: st: using the first n observations in a dataset w/o evaluating the whole thing?
Date   Thu, 3 Apr 2008 17:28:04 -0700

Thanks for the replies, and the long explanation.

(Yes I did mean little _n: typo!)

Anyway, I tried the suggestion: use in 1/99 using mydata

and I did indeed find it took time.  In fact, I tried to apply the idea
to a 20MB dataset after having only set the memory to 10MB, and it
completely froze up.  It only worked on the 20MB dataset if I set memory
to >20MB, and it was slow --as though it were reading the whole thing
first.

Oh well.


-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of David Kantor
Sent: Thursday, April 03, 2008 5:20 PM
To: [email protected]
Subject: Re: st: using the first n observations in a dataset w/o
evaluating the whole thing?

At 07:31 PM 4/3/2008, Mark Rodini wrote:
>Greetings.
>
>Suppose I have a large Stata dataset (e.g. 3,000,000 observations) and
I
>only with to read in the first, say, 100 observations.
>
>I have tried the code, which works:
>
>use mydata if ( _N<100 )
>
>However, evidently, this code goes through ALL 3 million observations
to
>evaluate the expression in parentheses, which can be very time
consuming
>(and sort of defeats the purpose).  Is there a way to only read the
>first 100 observations without having to evaluate the entire dataset?
>
>Perhaps some application of the "set obs 100"?  But I have not been
>successful.
>
>Thank you.
>-Mark

First, that is not officially valid syntax, though it is accepted. I 
find that it gets you 0 observations, though it does read through the 
whole file.
You probably mean _n (little n), rather than _N. (I suppose _N is . 
during the loading process, so _N <100 is false).

Officially correct syntax is,
use if _n <100 using mydata

or, better yet,
use in 1/99 using mydata

(This latter syntax is much more efficient.)

But in any case, my experience has been that it always reads through 
the whole file. And you can tell it's dong that if you have 3000000 
observations. The reason is that, in the file structure, there are 
some important elements that come after the data (values labels, I 
believe, for example), so there is a reason to have to read the whole 
file.  At least that's how it's been as far as I know; I don't know 
if they've changed the file structure in that regard in version 10.

I may have written to Stata Corp. about this some time in the past; 
if I had my way, there would either...
  be nothing after the end of the data segment, or
  be some way to jump directly to the part of the file that lies 
after the data.
(The latter idea may or may not work, depending on file-system issues.)
In either case, I would want it to not read the whole file if you 
asked for an initial subset.

But as things stand now, we are stuck with this behavior.

The only thing you can do is, if you plan to experiment on a small 
segment of the file (and want to load it many times), load a small 
segment and save it under a different name. Thus, you go through the 
lengthy process just once.

use in 1/99 using mydata
save mydata_shortversion

Later...
use mydata_shortversion
-- should load quickly.

Hope this helps.
--David

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index