[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
Sergio Correia <[email protected]> |

To |
[email protected] |

Subject |
st: How to infile/insheet a huge csv that has a few problems... |

Date |
Tue, 19 Jul 2005 16:54:35 -0500 |

Hi everyone, Sorry for bothering you again but I have a tricky problem here (at least for me). I'm trying to load a 500mb text file into Stata (actually, I need to open 20 files like that). I have already read the manual / statalist archive / faq, etc. and still cant get what I want. Characteristics of the data: - It is in Tab-Separated-Values. - I don't want to load the entire variable list (only 4 variables from a total of 50). Note that those vars are not in consecutive order. - The first line is used for the headers. The headers CONTAIN spaces and no quotes (eg: User Code). - Instead of dots for missing values, there are spaces. - When a value is missing in the last variable from a line, the file just omits the missing value (instead of putting a TAB plus a SPACE). Failed Attempts: - First I tried to use -insheet- but the data is too big and won't fit. I suspect that part of the fault is caused by the use of "float" even for variables that are of "byte" type. - As the data is not on fixed format, I discarded the use of infix.. - When I tried to use the free format -infile- (infile1) Stata got confused with the use of spaces as missing values. So I used the -filefilter- command to replace the spaces with dots. - Afterwards, I discovered that when there was a missing value in the last variable in observation "N", infile1 would use the first value of observation "N+1" instead of the missing value! But as the manual states, that behaviour is correct, since the program allows for observations to span multiple lines. - In a last attempt, I tried using infile2 with a dictionary. However, I couldnt use _skip because it would skip columns, not variables. Also, even if i only wanted 4 variables, the names of all the 50vars need to be stated in the dictionary. My last hope is to copy/paste names for the 50 vars, and to use -in- to open chunks of the entire file, dropping the non-wanted variables afterwards. However, even if it works, the solucion will probably be very slow, and I feel like there must be a better way. Any ideas? Or should I go with the "split/append" logic? Thanks a lot! Sergio Correia * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: Panel data simultaneous equations** - Next by Date:
**st: how to calculate spline values?** - Previous by thread:
**st: Panel data simultaneous equations** - Next by thread:
**Re: st: How to infile/insheet a huge csv that has a few problems...** - Index(es):

© Copyright 1996–2024 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |