Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Use a few observations from a tab-delimited or csv file


From   Maarten buis <maartenbuis@yahoo.co.uk>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Use a few observations from a tab-delimited or csv file
Date   Wed, 20 Aug 2008 15:59:42 +0100 (BST)

It should not be too much of a problem: Using the formulae from
http://www.stata.com/support/faqs/data/howbig.html you can see that if
-insheet- stores your variables as floats your dataset should be about
672 MB, which is too much for computer, but if the variables can be
stored as bytes the size reduces to 168 mb, which is well within the
limit of your computer. If your dataset contains many dummy variables
you could import the data easily using stattransfer, or you could split
the data up into two parts, -insheet- them separetely, -compress- them,
and -merge-/-append- the parts to create one dataset. Even before you
start considering this, you should seriously think if you really need
all 2200 variables...

-- Maarten

--- "Todd D. Kendall" <todddavidkendall@gmail.com> wrote:

> Dear Statlisters,
> 
> I have a file that is currently in csv format (or I could easily
> convert it to tab-delimited).  It is fairly large: roughly 80,000
> observations and 2,200 variables.
> 
> In fact, it is too large to fit into Stata (I am running Stata 9.2 on
> a Windows XP machine with 1 GB of RAM).  The maximum memory I can
> allocate to Stata is -set mem 636m-.  When I try to simply insheet
> the
> file at this setting, I get only 16,276 observations read in -- not
> anywhere close to the whole file, so I don't think there are any easy
> tweaks to make this work.
> 
> However, it turns out that, for roughly the last 2,000 variables, I
> really don't need every single variable; instead, I just need a few
> summary statistics calculated over these 2,000 variables (e.g., the
> mean or standard deviation).  My idea is to write a simple do file
> that loads in, say, the first 15,000 observations, computes the mean
> and standard deviation of the 2,000 variables, then drops these
> variabes and saves as a .dta file.  I would then repeat on the next
> 15,000 observations, and so on.  Then I could just append all the
> little files together, and I would assume I could fit this into
> Stata,
> as it would only have around 200 variables instead of 2,200.
> 
> My problem is that insheet doesn't work with "in" -- i.e., I can't
> write -insheet filename.csv in 1/15000-.  Alternatively, if I could
> convert the file from csv into a fixed format, I could write a
> dictionary and use infix, but my Google search for how to convert a
> csv file into a fixed-column file has come up pretty dry.
> 
> Am I barking up the wrong tree completely here, or am I missing
> something obvious?  I greatly appreciate any suggestions.
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
> 


-----------------------------------------
Maarten L. Buis
Department of Social Research Methodology
Vrije Universiteit Amsterdam
Boelelaan 1081
1081 HV Amsterdam
The Netherlands

visiting address:
Buitenveldertselaan 3 (Metropolitan), room Z434

+31 20 5986715

http://home.fsw.vu.nl/m.buis/
-----------------------------------------

Send instant messages to your online friends http://uk.messenger.yahoo.com 
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index