Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Problems in load large data or read several fields from CSV data


From   David Elliott <[email protected]>
To   [email protected]
Subject   Re: st: Problems in load large data or read several fields from CSV data
Date   Wed, 21 Jan 2009 11:16:39 -0400

Here is a revised version of my file chunking program, chunky.ado :

*----------------- Begin listing ----------------*
program define chunky, rclass
version 8.0

*! version 1.0.0  2008.04.26
*! version 1.0.1  2009.01.20
*!
*! by David C. Elliott
*! Text file chunking algorithm
*!
*! syntax:
*! using filename
*! index() is starting line in file to be read
*! chunk() is the number of lines to be read
*! saving( [,replace]) is file name of chunk to be saved,
*!   defaults to chunk.txt
*! list displays line by line listing of file to screen
*!   used to display first line or in debugging
*! returns r(index) as the index of the last line read+1
*!
*! note - this works on text files only

syntax using [, Index(numlist max=1 >0 integer) ///
    Chunk(numlist max=1 >0 integer) Saving(string) List]
local infile `using', read

if `"`saving'"'=="" {
    local savefile  using chunk.txt, write replace
    }
	else {
		local 0 `saving'
		syntax [anything(name=savefile id="file to save")][,REPLACE]
		local savefile using `savefile', write `replace'
		}

tempname in out
file open `in' `infile'
file open `out' `savefile'

if "`index'"=="" {
    local index 1
    }
if "`chunk'"==""  {
    local chunk 5
    }
if "`list'" == "list" {
    local list
    }
    else {
        local list *
        }
local end = `index' + `chunk'
local i 0

while `i++'<`index' {  // Move pointer to index line
        file read `in' line
        if r(eof) != 0 {
            di _n "{err:Index `index' is past end of file}" ///
            _n "{err:Last line attempted was `i'}" _n
			return scalar eof=1
            exit
            }
	}

while r(eof) == 0 & `index' < `end' {
    file write `out' `"`macval(line)'"' _n
    `list'    di in ye `index' `" `line'"'
    local ++index
    file read `in' line
    }

file close `in'
file close `out'

return scalar index = `index'
return scalar eof = 0

end

*-----------------End listing ----------------*

And a revised version of a program using it to chunk and reassemble a
large file keeping only certain variables:

*----------------- Begin listing ----------------*
// Do file using chunky.ado to piece together
//   parts of a very large file
// Pay particular attention to the edit points
//   marked with ****
//   for infile and chunksize and keep

**** edit VeryLargeFile.csv on the following line to your filename
local infile VeryLargeFile.csv

// edit to size of chunk you want
local chunksize 100000

// Get just the first line if it has variable names
chunky using `"`infile'"', index(1) chunk(1) ///
  saving("varnames.csv",replace) list

local chunk 1
local nextrow 2
tempfile chunkfile chunkappend
while !`r(eof)' {  // abort when end of file reached
       chunky using `"`infile'"', ///
         index(`r(index)') chunk(`chunksize') saving("`chunkfile'", replace)
       if `r(eof)' {
               continue, break
               }
               else {
                       local nextrow `=`r(index)'+1'
                       }
       // shell command to append the varnames with the chunk
       !copy varnames.csv+`chunkfile' `chunkappend'

       **** edit the following to conform to your csv delimiter
       insheet using "`chunkappend'", clear comma names

       **** edit the following to keep specific variables
       keep *

       // save part of file and increment chunk count
       save part`chunk++', replace
       }
// Append parts together
local nparts `--chunk'
use part1, clear
forvalues i=2/`nparts' {
       append using part`i'
       **** uncomment the following line to erase part2.dta...part##.dta
       // erase part`i'.dta
       }
describe
// You will probably want to save part1.dta to a different name
//   once all the parts are appended to it.
*-----------------End listing ----------------*


Typical output:

(88 vars, 100000 obs)
file part1.dta saved
(88 vars, 100000 obs)
file part2.dta saved
(88 vars, 100000 obs)
file part3.dta saved
(88 vars, 100000 obs)
file part4.dta saved
(88 vars, 100000 obs)
...

Contains data from part1.dta
  obs:      1,244,282
 vars:             28                          21 Jan 2009 10:50
 size:    562,519,688 (27.8% of memory free)

As you can see, you can get truly large datasets processed in this
manner and it is done entirely from within Stata.

As an aside - is this ado possibly useful enough to write a help file
and submit to SSC?

DC Elliott
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index