Re: st: Insheeting Japanese |

Tue, 23 Sep 2008 12:01:04 -0700

Thanks all, VERY helpful. I am going to take a crack at parsing it. I'll post the results for posterity. On Tue, Sep 23, 2008 at 11:55 AM, Sergiy Radyakin <[email protected]> wrote: > Right. Insheet is for ASCII (text) data only. > > With 2-byte codes, second byte can be a byte which has a special > (control) meaning in ASCII (e.g. fields separator, or end of line) and > this will confuse -insheet-. > > Joseph Coveney has had a presentation on using ODBC and datasets with > unicode "Working with ODBC data sources in Stata--tips and techniques" > if you have access to it, you may find some tips there. (I don't have > it, but I'd love to see it myself). Here is a link to the abstract: > http://ideas.repec.org/p/boc/asug04/10.html > > Perhaps you could just configure an ODBC datasource and read your data > from there, rather then parsing UTF-16 yourself (which should not be > very difficult anyways). Most importantly you need to know whether one character is always two bytes (according to http://en.wikipedia.org/wiki/UTF-16 it can be 4 as well). If it is always 2, determine which 2 stand for the field separator, read a line, skip until you meet the separator, start reading data and stop when you hit the next separator, decode/output the number, trash the rest of the line.

Best regards,
Sergiy Radyakin

On Tue, Sep 23, 2008 at 2:35 PM, Austin Nichols <[email protected]> wrote:
Dan Weitzenfeld :
Stata's -file- command can deal with this file; see -help file- for examples of writing a loop to process a file. But converting in another program, then using -infile- or -insheet-, is likely easier. The optimal approach depends on how often you will face this situation again in future...

On Tue, Sep 23, 2008 at 2:28 PM, Steven Samuels <[email protected]> wrote:
Dan, I don't know if Stata can read unicode. The -help- for -insheet- states it is for ASCII text. One possibility; use a text editor to add double quotes (") at the beginning and end of lines and on either side of the commas. This may read everything as character. Then convert the convert back to real only the variable you want.

-Steve

On Sep 23, 2008, at 2:19 PM, Dan Weitzenfeld wrote:

I've been informed that the files are written in unicode, utf-16. Can Stata read this?

On Tue, Sep 23, 2008 at 11:08 AM, Dan Weitzenfeld <[email protected]> wrote:

Thanks Sergiy, I did not know about that command. Below is a line from my hexdump:

130 | 304b ff1f 002c 0031 002c 0032 000d 000a | 0K...,.1.,.2....

I also noticed this when I ran with option Analyze:

Line-end characters
\r\n (Windows) 0
\r by itself (Mac) 5
\n by itself (Unix) 5

which looks suspicious to me. I'll talk to the tech guys who made this file.
Thanks again Sergiy.

On Tue, Sep 23, 2008 at 10:51 AM, Sergiy Radyakin <[email protected]> wrote:

Dear Dan,

how data "looks like" depends on, which software "looks" at it. From what I see in your message, there is double-byte encoding of letters which may cause a problem.

I suggest you first "look" at your data byte-by-byte, to find a pattern you need, then filter your data based on that pattern.
Use
-hexdump- filename
to see how your data is structured. Check that you are using correct separator "comma" and not "tab", that "comma" in your file is indeed a standard ASCII "comma" and not some weird two-bytes comma, that a "comma" byte (44) is not used for encoding other characters, etc.

Perhaps you could post a portion of output from hexdump here if this does not contradict any rules of the list.

Regards, Sergiy Radyakin

On Tue, Sep 23, 2008 at 1:09 PM, Dan Weitzenfeld <[email protected]> wrote:

Hi All,
Quick but strange question. I'm trying to insheet a comma-delimited file with Japanese in it. For example, the first line looks like:

あなたはこのＣＭが好きですか？,0,とても好き

The only information I need is the second variable, the 0, which will always be numeric.

However, when I insheet the file, I get nonsense:

þÿ0B0j0_0o0S0nÿ#ÿ-0LY}0M0g0Y0Kÿ 0h0f0‚Y}0M

which would be okay, except that the second variable always comes in as blank.

Does anyone know of a solution for this?

Thanks in advance,
Dan

