"Sergiy Radyakin" <serjradyakin@gmail.com>

statalist@hsphsun2.harvard.edu

Re: st: Use a few observations from a tab-delimited or csv file

Wed, 20 Aug 2008 11:06:15 -0400

Dear Todd, option 1 - consider Stata 64bit if you are working with large files option 2 - split the file into manageable chunks (say, 10000obs, and read them in separately) write it out in the Stata format keeping only variables that you need option 3 - split the file "vertically" by reading only required variables first Stat\Transfer (if available) might be a good option too. It does not load all the data into the memory but ruther process it by small chunks, and it does allow you to select particular variables - so you can prepare a batch file and run it from Stata when needed. Regards, Sergiy Radyakin On 8/20/08, Todd D. Kendall <todddavidkendall@gmail.com> wrote: > Dear Statlisters, > > I have a file that is currently in csv format (or I could easily > convert it to tab-delimited). It is fairly large: roughly 80,000 > observations and 2,200 variables. > > In fact, it is too large to fit into Stata (I am running Stata 9.2 on > a Windows XP machine with 1 GB of RAM). The maximum memory I can > allocate to Stata is -set mem 636m-. When I try to simply insheet the > file at this setting, I get only 16,276 observations read in -- not > anywhere close to the whole file, so I don't think there are any easy > tweaks to make this work. > > However, it turns out that, for roughly the last 2,000 variables, I > really don't need every single variable; instead, I just need a few > summary statistics calculated over these 2,000 variables (e.g., the > mean or standard deviation). My idea is to write a simple do file > that loads in, say, the first 15,000 observations, computes the mean > and standard deviation of the 2,000 variables, then drops these > variabes and saves as a .dta file. I would then repeat on the next > 15,000 observations, and so on. Then I could just append all the > little files together, and I would assume I could fit this into Stata, > as it would only have around 200 variables instead of 2,200. > > My problem is that insheet doesn't work with "in" -- i.e., I can't > write -insheet filename.csv in 1/15000-. Alternatively, if I could > convert the file from csv into a fixed format, I could write a > dictionary and use infix, but my Google search for how to convert a > csv file into a fixed-column file has come up pretty dry. > > Am I barking up the wrong tree completely here, or am I missing > something obvious? I greatly appreciate any suggestions. > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

