Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: How to infile/insheet a huge csv that has a few problems...

From   Sergio Correia <[email protected]>
To   [email protected]
Subject   st: How to infile/insheet a huge csv that has a few problems...
Date   Tue, 19 Jul 2005 16:54:35 -0500

Hi everyone,

Sorry for bothering you again but I have a tricky problem here (at
least for me).

I'm trying to load a 500mb text file into Stata (actually, I need to
open 20 files like that). I have already read the manual / statalist
archive / faq, etc. and still cant get what I want.

Characteristics of the data:
- It is in Tab-Separated-Values.
- I don't want to load the entire variable list (only 4 variables from
a total of 50). Note that those vars are not in consecutive order.
- The first line is used for the headers. The headers CONTAIN spaces
and no quotes (eg: User Code).
- Instead of dots for missing values, there are spaces.
- When a value is missing in the last variable from a line, the file
just omits the missing value (instead of putting a TAB plus a SPACE).

Failed Attempts:
- First I tried to use -insheet- but the data is too big and won't
fit. I suspect that part of the fault is caused by the use of "float"
even for variables that are of "byte" type.
- As the data is not on fixed format, I discarded the use of infix..
- When I tried to use the free format -infile- (infile1) Stata got
confused with the use of spaces as missing values. So I used the
-filefilter- command to replace the spaces with dots.
- Afterwards, I discovered that when there was a missing value in the
last variable in observation "N", infile1 would use the first value of
observation "N+1" instead of the missing value! But as the manual
states, that behaviour is correct, since the program allows for
observations to span multiple lines.
- In a last attempt, I tried using infile2 with a dictionary. However,
I couldnt use _skip because it would skip columns, not variables.
Also, even if i only wanted 4 variables, the names of all the 50vars
need to be stated in the dictionary.

My last hope is to copy/paste names for the 50 vars, and to use -in-
to open chunks of the entire file, dropping the non-wanted variables
afterwards. However, even if it works, the solucion will probably be
very slow, and I feel like there must be a better way.

Any ideas? Or should I go with the "split/append" logic?

Thanks a lot!

Sergio Correia

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index