Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: using the first n observations in a dataset w/o evaluating the whole thing?


From   David Kantor <kantor.d@att.net>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: using the first n observations in a dataset w/o evaluating the whole thing?
Date   Thu, 03 Apr 2008 20:19:34 -0400

At 07:31 PM 4/3/2008, Mark Rodini wrote:
Greetings.

Suppose I have a large Stata dataset (e.g. 3,000,000 observations) and I
only with to read in the first, say, 100 observations.

I have tried the code, which works:

use mydata if ( _N<100 )

However, evidently, this code goes through ALL 3 million observations to
evaluate the expression in parentheses, which can be very time consuming
(and sort of defeats the purpose).  Is there a way to only read the
first 100 observations without having to evaluate the entire dataset?

Perhaps some application of the "set obs 100"?  But I have not been
successful.

Thank you.
-Mark
First, that is not officially valid syntax, though it is accepted. I find that it gets you 0 observations, though it does read through the whole file.
You probably mean _n (little n), rather than _N. (I suppose _N is . during the loading process, so _N <100 is false).

Officially correct syntax is,
use if _n <100 using mydata

or, better yet,
use in 1/99 using mydata

(This latter syntax is much more efficient.)

But in any case, my experience has been that it always reads through the whole file. And you can tell it's dong that if you have 3000000 observations. The reason is that, in the file structure, there are some important elements that come after the data (values labels, I believe, for example), so there is a reason to have to read the whole file. At least that's how it's been as far as I know; I don't know if they've changed the file structure in that regard in version 10.

I may have written to Stata Corp. about this some time in the past; if I had my way, there would either...
be nothing after the end of the data segment, or
be some way to jump directly to the part of the file that lies after the data.
(The latter idea may or may not work, depending on file-system issues.)
In either case, I would want it to not read the whole file if you asked for an initial subset.

But as things stand now, we are stuck with this behavior.

The only thing you can do is, if you plan to experiment on a small segment of the file (and want to load it many times), load a small segment and save it under a different name. Thus, you go through the lengthy process just once.

use in 1/99 using mydata
save mydata_shortversion

Later...
use mydata_shortversion
-- should load quickly.

Hope this helps.
--David

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index