Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: importing LONG string variables


From   "Mindruta, Denisa Constanta" <mindruta@uiuc.edu>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: importing LONG string variables
Date   Fri, 24 Aug 2007 14:08:29 -0500 (CDT)

Thank you both Friedrich and Sergiy!
Friedrich, you got right to the point: but, unfortunately, with the method that you suggested I won't be able to align the information properly. Is there any way of telling Stata to leave a blank on V3 Row1 and move Address1 in V4? 
Few clarifications for Sergiy:
1) The final file is intended to be used in Stata, but I need first to retrieve all the information from the long string variables. It seemed like a natural solution to split these strings in constituent words. The problem is that the number of words (delimited within the strings by "|") varies row by row ! Hence, the problem with Friedrich's current method. 
2) Comma is the current separator between columns/variables (thus, two string variables are currently separated by a comma). It could be replaced with any other delimiter ( tab, etc). "|" is a delimiter between the "words" that constitute each string. The "words" are each less than 244 characters, so if we can put these words in separate columns everything will be fine. 

Denisa

---- Original message ----
>Date: Fri, 24 Aug 2007 14:00:05 -0400
>From: "Friedrich Huebler" <fhuebler@gmail.com>  
>Subject: Re: st: importing LONG string variables  
>To: statalist@hsphsun2.harvard.edu
>
>Denisa,
>
>Is your example an accurate representation of your data? If so, you
>have a problem because there are no delimiters around fields with
>missing data. Here is a partial answer to your question that will read
>the data into Stata, but the columns won't line up.
>
>Step 1: Open the file in a text editor and replace all occurrences of
>" comma " by "|" (without quotes). This will yield the following file:
>
>Row1
>Name1|Name2|Address1|Address2|PatClass1|PatClass2|PatClass3
>Row 2
>Name3|Name4|Name5|Address3|Address4|Address5|PatClass4
>
>Step 2: Read the file into Stata with -insheet-
>
>. insheet using test.txt, delimit("|")
>. clist, noobs
>
>   v1         v2         v3         v4         v5         v6         v7
> Row1
>Name1      Name2   Address1   Address2  PatClass1  PatClass2  PatClass3
>Row 2
>Name3      Name4      Name5   Address3   Address4   Address5  PatClass4
>
>Step 3: Delete the "Row" entries.
>
>. drop if mod(_n,2)>0
>(2 observations deleted)
>
>. clist, noobs
>
>   v1         v2         v3         v4         v5         v6         v7
>Name1      Name2   Address1   Address2  PatClass1  PatClass2  PatClass3
>Name3      Name4      Name5   Address3   Address4   Address5  PatClass4
>
>Step 4: Save the data as a comma-separated file.
>
>. outsheet using test.csv, comma
>
>When you open the CSV file in a text editor you see this:
>
>v1,v2,v3,v4,v5,v6,v7
>"Name1","Name2","Address1","Address2","PatClass1","PatClass2","PatClass3"
>"Name3","Name4","Name5","Address3","Address4","Address5","PatClass4"
>
>Variable v3 should have a missing value in the first observation.
>Instead it contains Address1. Variables v4 to v7 also contain wrong
>data. I do not know how you can address this problem without
>information on missing values in your original data.
>
>Friedrich
>
>On 8/23/07, Mindruta, Denisa Constanta <mindruta@uiuc.edu> wrote:
>> Greetings!
>> I would appreciate any help on the following problem: I need to import a (.cvs) file containing several string variables that go well beyond stata limits. Is there a way to import the file, and at the same time, parse these string variables in constituent words (delimited by "|") before saving it as a stata file ?
>>
>> A simple example might help:
>> Row1
>> Name1|Name2 comma Address1|Address2 comma PatClass1|PatClass2|PatClass3
>> Row 2
>> Name3|Name4|Name5 comma Address3|Address4|Address5 comma PatClass4
>>
>> Want to get the following structure:
>> Row1
>> Name1 comma Name2 comma "missing info" comma Address1 comma Address2 comma "missing info" comma PatClass1 comma PatClass2 comma PatClass3
>> Row 2
>> Name3 comma Name4 comma Name5 comma Address3 comma Address4 comma Address5 comma PatClass4 comma "missing info" comma "missing info"
>>
>> Any suggestion on how to approach this problem? (here is just a simpe example, the text in a cell could go up to 200 words of 30 characters each, and I have 15 of these variables, and 600 files...)Thanks !
>>
>> Denisa
>> University of Illinois Urbana-Champaign
>*
>*   For searches and help try:
>*   http://www.stata.com/support/faqs/res/findit.html
>*   http://www.stata.com/support/statalist/faq
>*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index