Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: avoiding StatTransfer: huge / large / big dataset from SAS / csv


From   Nick Winter <nw53@cornell.edu>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: avoiding StatTransfer: huge / large / big dataset from SAS / csv
Date   Tue, 26 Oct 2004 12:23:41 -0400

One approach is to use Stata to split up the giant CSV file into chunks, using the -file- command. The program pasted below should do it:

program splitmyfiles
* splitmyfiles infilename outputstub chunk_size
version 8.2

args input outstub size

tempname in out
file open `in' using `input' , read text
qui file open `out' using `outstub'_1.csv , write text replace

local fnum 1
local i 1
file read `in' line
while !r(eof) {
file write `out' `"`line'"' _n
local ++i
if !mod(`i'-1,`size') {
file close `out'
local ++fnum
qui file open `out' using `outstub'_`fnum'.csv , write text replace
di "." _c
}
file read `in' line
}
file close `in'
file close `out'

end


The syntax would be something like:

. splitmyfiles rawdata.csv piece 1000000

This would take "rawdata.csv" and split it into piece_1.csv, piece_2.csv, etc., each with 1 million lines.

There may be better ways, of course.

--NW






At 11:23 AM 10/26/2004 -0400, you wrote:

Hello,

I am trying to get a ~3 GB .csv dataset into Stata. I don't think it
will be anywhere near 3 GB once in Stata, but there it is, on my
computer, taunting me. It is too big to open even using a text editor.
When I set my memory to 750M, I am able to read in nearly 7 million
observations into Stata, and then its full.

The original dataset is actually in EBCDIC. I used a very simple SAS
routine to read the zoned decimal data (that is key), and then
exported the dataset to a .csv file. I have pretty much _no_
experience with SAS whatsoever. The only reason I got involved with it
is because it can read zoned/packed decimal data.


I believe I have X options to get the data into Stata, all of which
are missing a vital step that I am not sure how to do, or have
available:

1) Export the data to csv files from SAS in segments, i.e. 1st
1million obs, 2 millions obs etc... Then import each of these into
Stata and merge. I am not sure how to tell SAS to sort and then export
based on a criteria however.

2) Do the analogous method in Stata, but using -infile-. The problem
is that -infile- with [in] requires the data to be in a fixed format.
As far as I know, SAS can only export delimited. If I could export the
data from SAS in a fixed format, that would work.

3) I have seen various work-arounds in Statalist/FAQs with large
datasets using OBDC. I do not know anything about OBDC, but if its the
only way to go, I will learn.

4) I know about StatTransfer, but I am not the one making decisions
about buying new software/licenses, and don't particularly want to go
through that if I don't have to.


Any guidance, suggestions, or clever responses are very much appreciated.

Regards,
Dan
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
--------------------------------------------------------
Nicholas Winter 607.255.8819 t
Assistant Professor 607.255.4530 f
Department of Government nw53@cornell.edu e
308 White Hall falcon.arts.cornell.edu/nw53 w
Cornell University
Ithaca, NY 14853-4601

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/




© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index