[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
Steven Samuels <sjhsamuels@earthlink.net> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Insheeting Japanese |

Date |
Tue, 23 Sep 2008 15:23:26 -0400 |

Taking Dan's sample line, below, I copied into BBEdit, saved it as UTF-8, added the double quotes at the start and end and on either side of commas, and then zapped gremlins.

Start:

あなたはこのＣＭが好きですか？,0,とても好き

Result:

"","0",""

which -insheet- was able to read.

-Steve

On Sep 23, 2008, at 2:55 PM, Sergiy Radyakin wrote:

Right. Insheet is for ASCII (text) data only.

With 2-byte codes, second byte can be a byte which has a special

(control) meaning in ASCII (e.g. fields separator, or end of line) and

this will confuse -insheet-.

Joseph Coveney has had a presentation on using ODBC and datasets with

unicode "Working with ODBC data sources in Stata--tips and techniques"

if you have access to it, you may find some tips there. (I don't have

it, but I'd love to see it myself). Here is a link to the abstract:

http://ideas.repec.org/p/boc/asug04/10.html

Perhaps you could just configure an ODBC datasource and read your data

from there, rather then parsing UTF-16 yourself (which should not be

very difficult anyways). Most importantly you need to know whether one

character is always two bytes (according to

http://en.wikipedia.org/wiki/UTF-16 it can be 4 as well). If it is

always 2, determine which 2 stand for the field separator, read a

line, skip until you meet the separator, start reading data and stop

when you hit the next separator, decode/output the number, trash the

rest of the line.

Best regards,

Sergiy Radyakin

On Tue, Sep 23, 2008 at 2:35 PM, Austin Nichols <austinnichols@gmail.com> wrote:

Dan Weitzenfeld :

Stata's -file- command can deal with this file; see -help file- for

examples of writing a loop to process a file. But converting in

another program, then using -infile- or -insheet-, is likely easier.

The optimal approach depends on how often you will face this situation

again in future...

On Tue, Sep 23, 2008 at 2:28 PM, Steven Samuels

<sjhsamuels@earthlink.net> wrote:

Dan, I don't know if Stata can read unicode. The -help- for - insheet-

states it is for ASCII text. One possibility; use a text editor to add

double quotes (") at the beginning and end of lines and on either side of

the commas. This may read everything as character. Then convert the convert

back to real only the variable you want.

-Steve

On Sep 23, 2008, at 2:19 PM, Dan Weitzenfeld wrote:

I've been informed that the files are written in unicode, utf-16. Can

Stata read this?

On Tue, Sep 23, 2008 at 11:08 AM, Dan Weitzenfeld

<dan.weitzenfeld@emsense.com> wrote:

Thanks Sergiy, I did not know about that command. Below is a line

from my hexdump:

130 | 304b ff1f 002c 0031 002c 0032 000d 000a |

0K...,.1.,.2....

I also noticed this when I ran with option Analyze:

Line-end characters

\r\n (Windows) 0

\r by itself (Mac) 5

\n by itself (Unix) 5

which looks suspicious to me. I'll talk to the tech guys who made this

file.

Thanks again Sergiy.

On Tue, Sep 23, 2008 at 10:51 AM, Sergiy Radyakin

<serjradyakin@gmail.com> wrote:

Dear Dan,

how data "looks like" depends on, which software "looks" at it. From

what I see in your message, there is double-byte encoding of letters

which may cause a problem.

I suggest you first "look" at your data byte-by-byte, to find a

pattern you need, then filter your data based on that pattern.

Use

-hexdump- filename

to see how your data is structured. Check that you are using correct

separator "comma" and not "tab", that "comma" in your file is indeed a

standard ASCII "comma" and not some weird two-bytes comma, that a

"comma" byte (44) is not used for encoding other characters, etc.

Perhaps you could post a portion of output from hexdump here if this

does not contradict any rules of the list.

Regards, Sergiy Radyakin

On Tue, Sep 23, 2008 at 1:09 PM, Dan Weitzenfeld

<dan.weitzenfeld@emsense.com> wrote:

Hi All,

Quick but strange question. I'm trying to insheet a comma- delimited

file with Japanese in it. For example, the first line looks like:

あなたはこのＣＭが好きですか？,0,とても好き

The only information I need is the second variable, the 0, which will

always be numeric.

However, when I insheet the file, I get nonsense:

þÿ0B0j0_0o0S0nÿ#ÿ-0LY}0M0g0Y0Kÿ 0h0f0‚Y} 0M

which would be okay, except that the second variable always comes in as

blank.

Does anyone know of a solution for this?

Thanks in advance,

Dan

*

* For searches and help try:

* http://www.stata.com/help.cgi?search

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Insheeting Japanese***From:*"Dan Weitzenfeld" <dan.weitzenfeld@emsense.com>

**Re: st: Insheeting Japanese***From:*"Sergiy Radyakin" <serjradyakin@gmail.com>

**Re: st: Insheeting Japanese***From:*"Dan Weitzenfeld" <dan.weitzenfeld@emsense.com>

**Re: st: Insheeting Japanese***From:*"Dan Weitzenfeld" <dan.weitzenfeld@emsense.com>

**Re: st: Insheeting Japanese***From:*Steven Samuels <sjhsamuels@earthlink.net>

**Re: st: Insheeting Japanese***From:*"Austin Nichols" <austinnichols@gmail.com>

**Re: st: Insheeting Japanese***From:*"Sergiy Radyakin" <serjradyakin@gmail.com>

- Prev by Date:
**Re: st: Apple Script to Comment Lines in Text Wranger/BBEdit** - Next by Date:
**Re: st: Reg3 indirect effect** - Previous by thread:
**Re: st: Insheeting Japanese** - Next by thread:
**RE: st: Insheeting Japanese** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |