Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Insheeting Japanese


From   Steven Samuels <[email protected]>
To   [email protected]
Subject   Re: st: Insheeting Japanese
Date   Tue, 23 Sep 2008 15:23:26 -0400

Taking Dan's sample line, below, I copied into BBEdit, saved it as UTF-8, added the double quotes at the start and end and on either side of commas, and then zapped gremlins.


Start:
あなたはこのCMが好きですか?,0,とても好き

Result:
"","0",""

which -insheet- was able to read.


-Steve

On Sep 23, 2008, at 2:55 PM, Sergiy Radyakin wrote:


Right. Insheet is for ASCII (text) data only.

With 2-byte codes, second byte can be a byte which has a special
(control) meaning in ASCII (e.g. fields separator, or end of line) and
this will confuse -insheet-.

Joseph Coveney has had a presentation on using ODBC and datasets with
unicode "Working with ODBC data sources in Stata--tips and techniques"
if you have access to it, you may find some tips there. (I don't have
it, but I'd love to see it myself). Here is a link to the abstract:
http://ideas.repec.org/p/boc/asug04/10.html

Perhaps you could just configure an ODBC datasource and read your data
from there, rather then parsing UTF-16 yourself (which should not be
very difficult anyways). Most importantly you need to know whether one
character is always two bytes (according to
http://en.wikipedia.org/wiki/UTF-16 it can be 4 as well). If it is
always 2, determine which 2 stand for the field separator, read a
line, skip until you meet the separator, start reading data and stop
when you hit the next separator, decode/output the number, trash the
rest of the line.

Best regards,
Sergiy Radyakin

On Tue, Sep 23, 2008 at 2:35 PM, Austin Nichols <[email protected]> wrote:

Dan Weitzenfeld :
Stata's -file- command can deal with this file; see -help file- for
examples of writing a loop to process a file. But converting in
another program, then using -infile- or -insheet-, is likely easier.
The optimal approach depends on how often you will face this situation
again in future...

On Tue, Sep 23, 2008 at 2:28 PM, Steven Samuels
<[email protected]> wrote:

Dan, I don't know if Stata can read unicode. The -help- for - insheet-
states it is for ASCII text. One possibility; use a text editor to add
double quotes (") at the beginning and end of lines and on either side of
the commas. This may read everything as character. Then convert the convert
back to real only the variable you want.

-Steve

On Sep 23, 2008, at 2:19 PM, Dan Weitzenfeld wrote:


I've been informed that the files are written in unicode, utf-16. Can
Stata read this?

On Tue, Sep 23, 2008 at 11:08 AM, Dan Weitzenfeld
<[email protected]> wrote:


Thanks Sergiy, I did not know about that command. Below is a line
from my hexdump:

130 | 304b ff1f 002c 0031 002c 0032 000d 000a |
0K...,.1.,.2....

I also noticed this when I ran with option Analyze:

Line-end characters
\r\n (Windows) 0
\r by itself (Mac) 5
\n by itself (Unix) 5

which looks suspicious to me. I'll talk to the tech guys who made this
file.
Thanks again Sergiy.



On Tue, Sep 23, 2008 at 10:51 AM, Sergiy Radyakin
<[email protected]> wrote:


Dear Dan,

how data "looks like" depends on, which software "looks" at it. From
what I see in your message, there is double-byte encoding of letters
which may cause a problem.

I suggest you first "look" at your data byte-by-byte, to find a
pattern you need, then filter your data based on that pattern.
Use
-hexdump- filename
to see how your data is structured. Check that you are using correct
separator "comma" and not "tab", that "comma" in your file is indeed a
standard ASCII "comma" and not some weird two-bytes comma, that a
"comma" byte (44) is not used for encoding other characters, etc.

Perhaps you could post a portion of output from hexdump here if this
does not contradict any rules of the list.

Regards, Sergiy Radyakin


On Tue, Sep 23, 2008 at 1:09 PM, Dan Weitzenfeld
<[email protected]> wrote:


Hi All,
Quick but strange question. I'm trying to insheet a comma- delimited
file with Japanese in it. For example, the first line looks like:

あなたはこのCMが好きですか?,0,とても好き

The only information I need is the second variable, the 0, which will
always be numeric.

However, when I insheet the file, I get nonsense:

þÿ0B0j0_0o0S0nÿ#ÿ-0LY}0M0g0Y0Kÿ 0h0f0‚Y} 0M

which would be okay, except that the second variable always comes in as
blank.

Does anyone know of a solution for this?

Thanks in advance,
Dan

*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index