Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Missed opportunities for Stata I/O


From   Daniel Feenberg <[email protected]>
To   [email protected]
Subject   Re: st: Missed opportunities for Stata I/O
Date   Mon, 9 Sep 2013 20:07:42 -0400 (EDT)


On Mon, 9 Sep 2013, David Kantor wrote:

At 06:18 PM 9/8/2013, Daniel Feenberg wrote, among many other interesting things:

I should note that the -in- qualifier isn't as good as it could be. That is:

  use med2009 in 1/100

doesn't stop reading at record 100. Instead it seems to read all 143 million records, but then drops the records past 100.

I have noticed this problem myself when loading large files, though not quite that large. I understand that the reason it reads the entire file is that the file format puts the value labels at the end. The file format has several segments, of which the data is the second-to-last; the final segment holds the value abels. (See -help dta-.) So to properly load a file, the -use- routine must read through the entire file. I think that that was a poor choice. (Stata Corp, please pay attention.) It would have been preferable to place the data as the final segment, so that all the ancillary information could be read before the data, and the command...
       use med2009 in 1/100
would be able to quit reading after the 100th record; it should take negligible time.

Alternatively, without changing the file format, it may be possible to calculate where the value labels are located and skip directly to that location; whether this is possible may depend on the operating system. (The recent trend has been to view a file as a stream. This has some advantages, but has cast aside features such as the ability to read a specified location directly.)


We like to read compressed data from a pipe - so random access to the using file would be a great disadvantage to us. Other users have used this feature for encryption, and it has many other uses. I would rather see a "nolabel" option that would suppress reading the labels. -Append- already has such an option.


Note that the assessment that -use- "then drops the records past 100" may be a bit off-the-mark. I believe that it stores only the first 100; the rest are read but ignored. Also, Daniel's remark is not so much about the -in- qualifier in general, but about the -in- qualifier in the -use- command. In all other contexts -- when addressing data already in memory -- it is very effective.

Yes, that was a thinko - no memory is used by the unused records.


As long as this problem persists, and if you frequently need that initial segment (say, for testing of code), then, at the risk of telling you what you already know, the thing to do is to run that command once and save the results in a separate file with a distinct name (e.g., med2009_short).


There is a workaround for every problem, of course. In or SAS implementation of this system we maintain .01%, 5%, 20% and 100% subsets to satisfy different levels of user patience, but it would be nice to avoid that extra complication. In fact every comment in my posting was about avoiding a complication.

I posted a slightly revised version of my comments as the latest entry
in my collection of pieces on working with large datasets. It is at

  http://www.nber.org/stata/efficient

It now includes a link to David's insightful explanation of why -merge- takes so much memory from a long-ago Statalist.

Daniel Feenberg
NBER


HTH
--David

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index