Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Does -insheet- read data incorrectly?


From   Johannes Geyer <JGeyer@diw.de>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Does -insheet- read data incorrectly?
Date   Fri, 13 Mar 2009 12:12:46 +0100

I don't know if this can be considered a bug. Quotes have higher priority 
than the delimiter as it seems. You could replace them with -filfilter-. 
This is maybe not the most efficient way but it keeps the data 
structured.I was not able to keep the quotes - so I first replace them 
with a particular string and use the string function -subiunstr- 
thereafter. When I tried that with -filefilter- in one step, I lost the 
quotes - don't know why.

***********************
filefilter file.txt file_cleaned.txt, from(\Q) to(SINGLEQUOTE)
insheet using file_cleaned.txt, clear
replace v5 = subinstr(v5,"SINGLEQUOTE",`"""',.)
***********************


HTH,

Johannes






----------------------
Johannes Geyer
Deutsches Institut für Wirtschaftsforschung (DIW Berlin)
German Institute for Economic Research 
Department of Public Economics
DIW Berlin
Mohrenstraße 58
10117 Berlin
Tel: +49-30-89789-258

owner-statalist@hsphsun2.harvard.edu schrieb am 12/03/2009 22:26:18:

> I encountered the following problem:
> 
> I'm using the following command to import the data of a tab-delimited 
> text file into Stata:
> 
> --------------------------------------------------------------------
> insheet using "file.txt", tab clear
> --------------------------------------------------------------------
> 
> "file.txt" contains data delimited by tabs, the first row contains the 
> following names of the variables (also separated by tabs):
> 
> --------------------------------------------------------------------
> recfile time LfdNr field note
> --------------------------------------------------------------------
> 
> Except for "LfdNr" all variables should be string variables.
> 
> In each row the "values" (better: "columns") are separated by four tabs. 

> An example of the data of a row is as follows (to show how the data look 

> like, in this mail I separate each "column" of the row by using a line 
> break, in the data file they are separated by tabs, of course):
> 
> --------------------------------------------------------------------
> D:\DATENEINGABE\HH08\HH08_SF9_05.REC
> 20 Dez 2008 15:43
> 570
> vermnb
> .; #2-3
> --------------------------------------------------------------------
> 
> The problem: In some rows the last "column" (here containing ".; #2-3") 
> contains double quotes ("), but sometimes they don't occur in pairs 
> enclosing other characters but as lonesome singles. If this is the case, 

> -insheet- does not start the new case with the new row of data but 
> continues to read the data of the text-file into the variable "note". 
> Only if again a single double quote occurs in a row of data, -insheet- 
> continues to create new cases by reading new rows.
> 
> For example, if a row contains the following data (again, in this mail 
> separated by line breaks instead of tabs to show clearly how the data 
> look like):
> 
> --------------------------------------------------------------------
> D:\DATENEINGABE\HH08\HH08_SF9_05.REC
> 13 Dez 2008 14:37
> 325
> glaeubig
> 97; "#4-5
> --------------------------------------------------------------------
> 
> ignoring line breaks or tabs all data of the text file starting with 
> "97;" #4-5" will be read into the variable "note" until another line of 
> the text file contains a string with only one double quote, such as
> 
> --------------------------------------------------------------------
> D:\DATENEINGABE\HH08\HH08_SF9_05.REC
> 15 Dez 2008 14:05
> 373
> beten
> .; "2-3
> --------------------------------------------------------------------
> 
> (of course, the length of the string variable "note" will automatically 
> be restricted to 244 and everything which exceeds this will be lost, but 

> this is not the issue).
> 
> To my mind a tab-delimited file is a tab-delimited file, i.e. data wil 
> be read as *separated* by tabs (and/or line-breaks). Obviously, 
> -insheet- does not respect the tabs as delimiters in all instances.
> 
> Is this a correct behavior of -insheet- which I don't understand 
> correctly or is it a bug? What should I do if it is the former?
> 
> Yours,
> Dirk
> 
> *************************************************
> Dr. Dirk Enzmann
> Institute of Criminal Sciences
> Dept. of Criminology
> Schlueterstr. 28
> D-20146 Hamburg
> Germany
> 
> phone: +49-(0)40-42838.7498 (office)
>         +49-(0)40-42838.4591 (Mrs Billon)
> fax:   +49-(0)40-42838.2344
> email: dirk.enzmann@uni-hamburg.de
> www: 
> http://www2.jura.uni-hamburg.
> de/instkrim/kriminologie/Mitarbeiter/Enzmann/Enzmann.html
> *************************************************
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index