Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Action of -preserve-


From   "David Elliott" <dcelliott@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   st: Action of -preserve-
Date   Mon, 5 Feb 2007 23:34:17 -0400

I'm writing a routine to infile about 20K scans of questionnaires
which are saved 1 questionnaire per file by the scanning service.
Obviously a bit of appending will be needed.

My question surrounds the action of -preserve-.  When a dataset in
memory is -preserve-ed, is it just written to disk in a temporary
file, or is it tucked into a corner of Stata's memory?

Consider the following:

/* Edit the following directory to the root of the scans*/
set more off
capture file close scan
cd working "c:\data\scans"
local firsttime 1
!dir *.txt /s/b >scans.txt  //create a textfile of all scan filenames
tempname scan
file open `scan' using scans.txt, read text
file read `scan' line
while !r(eof) {
	if `firsttime' {
		infile using survey_dict, using(`"`macval(line)'"') clear
		save survey_2007 , replace
		local firsttime 0
		}
	preserve
	tempfile next
	infile using dsurvey_dict,, using(`"`macval(line)'"') clear
	save `next'
	restore
	append using `next'
	n di `"read `macval(line)'"'
	file read `scan' line
	}
file close `scan'

versus

/* Edit the following directory to the root of the scans*/
set more off
capture file close scan
cd working "c:\data\scans"
local firsttime 1
!dir *.txt /s/b >scans.txt  //create a textfile of all scan filenames
tempname scan
file open `scan' using scans.txt, read text
file read `scan' line
while !r(eof) {
	if `firsttime' {
		infile using survey_dict,, using(`"`macval(line)'"') clear
		save survey_2007 , replace
		local firsttime 0
		}
	infile using survey_dict,, using(`"`macval(line)'"') clear
	append using survey_2007
	save survey_2007, replace
	n di `"read `macval(line)'"'
	file read `scan' line
	}
file close `scan'

In the first case, if -preserve- keeps the data in memory, then only
the file to be appended needs to be written and subsequently appended
following a -restore-.  In the second case, the latest infile becomes
the dataset in memory to which the steadily accumulating dataset is
written.  This becomes a larger and larger file I/O overhead on the
routine as the dataset gets larger. I have used -preserve- & -restore-
with large (>200MB) files before and it has been my impression that
the first preserve takes longer than subsequent ones which suggests to
me that either Stata does something in memory with a -preserve- or
what I am seeing is an effect of caching by the OS and/or HDD
controller.

Most times people aren't infiling 10s of thousands of files, but when
faced with this situation, taking some time to achieve efficiency can
pay off.

--
David Elliott
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index