Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Stripping ASCII characters

From	"Thomas, Anthony" <[email protected]>
To	[email protected]
Subject	Re: st: Stripping ASCII characters
Date	Tue, 25 Feb 2014 10:55:46 -0500

Hi Ronan and Sergiy,

I'm not sure if my response yesterday made it through to the list, I
got a bounce notification this morning. In any event, thanks for the
suggestions. Sergiy: perhaps I am not using filefilter correctly, I
tried the following:

 filefilter "f1.csv" "f2.csv", from(026) to() replace // 026 is ^Z's hex code

filefilter "f1.csv" "f2.csv", from(\255d) to() replace

and

filefilter "f1.csv" "f2.csv", from(^Z) to() replace // which I didn't
really expect to work

In all three cases, the number of control characters in hexdum f1.csv
== number of control characters in hexdump f2.csv. I'll give reading
the file byte-by-byte a try though. And Ronan, thanks for the
suggestion, I tried using "sed" (a command line text streaming
utility) which removed some of the "^Z" but not all.

Thanks,

Anthony

On Tue, Feb 25, 2014 at 8:52 AM, Ronan Conroy <[email protected]> wrote:
>
> Prof. Ronan Conroy
> Associate Professor of Biostatistics
>
>
> RCSI Department of Epidemiology and Public Health Medicine
> Royal College of Surgeons in Ireland
> Lower Mercer Street, Dublin 2, Ireland
> T: 01-402-2431
> E: [email protected]  W: www.rcsi.ie
>
> RCSI DEVELOPING HEALTHCARE LEADERS
> WHO MAKE A DIFFERENCE WORLDWIDE
> On 2014 Feabh 24, at 21:03, Thomas, Anthony wrote:
>
>> When insheeting a csv file using Stata 11 - Unix, Stata aborts with the error:
>>
>> too many variables specified
>> error in line 5000000 of file
>>
>> Output of "hexdump" indicated the file contained control characters
>> (^Z), and was in binary format, when it was expected to be ASCII. I
>> tried using "filefilter "f1.csv" "f2.csv", from(^Z) to() replace" to
>> strip the problem characters, but a hexdump on f2.csv indicates the
>> (^Z) are still present. From what I understand ^Z (sub) is used in
>> place of a character that cannot be read by Stata, is this the case?
>> If so, is there any way to strip these characters from my file prior
>> to import?
>
> This is the place where a good text editor comes in handy. Many have a 'strip non-ASCII' command that does what you want.
>
> I ended up with 4,500 text files of which about 10% were corrupted. BBEdit (free, lite version=TextWrangler) processed the whole lot in a second or two!
>
> r
>
> Ronán Conroy
> [email protected]
> Associate Professor
> Division of Population Health Sciences
> Royal College of Surgeons in Ireland
> Beaux Lane House
> Dublin 2
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Stripping ASCII characters
  - From: Sergiy Radyakin <[email protected]>

References:
- st: Stripping ASCII characters
  - From: "Thomas, Anthony" <[email protected]>
- Re: st: Stripping ASCII characters
  - From: Ronan Conroy <[email protected]>

Prev by Date: Re: st: row means for at least 10 observations greater than 0
Next by Date: st: Fwd: identifying duplicate entry errors
Previous by thread: Re: st: Stripping ASCII characters
Next by thread: Re: st: Stripping ASCII characters
Index(es):
- Date
- Thread