Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: CSV read with limits

From	David Elliott <[email protected]>
To	[email protected]
Subject	Re: st: CSV read with limits
Date	Sun, 13 Mar 2011 14:07:04 -0300

Thank you for your interest and mention of -chunky- in this context.
I believe -chunky- will perform the tasks requested in Mike Lacy's
post, or that the users can set up the necessary conditions.

>>1) allow reading of just one chunk, with size specified by line length;
The -peek()- option was designed to do this.  While intended to let
one look at the header and an arbitrary number of lines, one can
specify a full chunk after doing a bit of math.  The -analyze- option
was designed to give some chunksize options and give the expected
number of chunks.

>>2))use  tempfiles to store the pieces.
This would actually be the user's responsibility at present.  If you
have done the -analyze- and know the number of chunks, you could do a
-tempfile chunk001.csv chunk002.csv... chunknnn.csv- beforehand (I use
"chunk" as the default stubname - the user could specify another).

I will give the tempfiles issue some further thought, however, and
consider the logistics of how one might use tempfiles.  Normally I
would not want to use -tempfile- except once I had my import routine
debugged and was convinced that I had properly imported all the rows
of the original source file - one of the reasons to leave behind
permanent chunks at the beginning.

I welcome further discussion on this.  -chunky- was originally a
really pitifully slow clunky routine that I had created for my own
use, but user feedback spurred me to redevelop it using Mata to the
point where it is almost respectable. Those of us still toiling within
the constraints of Win32 systems will continue to need to chunk,
infile, drop, save and append to create workable datasets from
multi-GB dumps.  I welcome any further user suggestions and bug
reports.

Regards,

DC Elliott

On 12 March 2011 19:24, Mike Lacy <[email protected]> wrote:
>
> Argyn Kuketayev <[email protected]> wrote:
>
> >I have CSV file (comma separated). I need Stata to read the 1st line
> >with variable names, then import only selected variables. Also I want
> >to limit the number of observations to read.
> >
> >I cant figure out how to do it in Stata. In SAS it would be easy with
> >DATA, var list and OBS option.
> >
> >thanks
> >
> >- --
> >Argyn Kuketayev
>
> I agree with Argyn that this is more difficult than it should be, as I've face similar problems myself.
> Considering that many large data sets are distributed as csv, having an option on insheet to read a limited number of lines and/or variables would be natural.  That being said, here's a way to solve Argyn's problem with available tools.
>
> The user written program -chunky- can break CSV files into chunks, while retaining the header on each one.  It can be used easily (if a bit inefficiently) to address the current problem, as follows:
>
>
> // Make a large-ish CSV file to work with
> clear
> set obs 10000
> forval i = 1/200 {
>   gen x`i' = 100 * runiform()
> }
> outsheet using c:\temp\big.csv, comma names nolabel replace
> //
> //
> // Get the user-written program -chunky- and use it to break up the file into chunks
> ssc install chunky
> // ********** Real work starts here
> cd c:\temp  // need somewhere to put the chunks
> // choose # of bytes in each chunk; larger is faster
> local size = 10000000
> chunky using c:\temp\big.csv, chunksize(`size') header(include) stub(piece) replace
> insheet using "piece0001.txt", clear comma names  // chunky names files consecutively
> keep x1 x100 x150 x200
> keep in 1/500 // retain the desired lines
> //
> foreach f in `s(filelist)' {
>   erase `f'
> }
>
> The preceding would be less clumsy if -chunky- had options to
> 1) allow reading of just one chunk, with size specified by line length;
> 2))use  tempfiles to store the pieces.
>
> These are not complaints, just some thoughts about about useful options that I suspect are consistent with the way -chunky- works.
>
> Regards,
>
>
>
>
> =-=-=-=-=-=-=-=-=-=-=-=-=
> Mike Lacy, Assoc. Prof.
> Soc. Dept., Colo. State. Univ.
> Fort Collins CO 80523 USA
> (970)-491-6721

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- Re: st: CSV read with limits
  - From: Mike Lacy <[email protected]>

Prev by Date: Re: st: survival analysis in the presence of competing risks and multi-level data
Next by Date: st: Re: pweight, aweight, and survey data
Previous by thread: Re: st: CSV read with limits
Next by thread: st: Question about correlated variables in zinb regression
Index(es):
- Date
- Thread