[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Sergiy Radyakin" <serjradyakin@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Insheeting Japanese |

Date |
Tue, 23 Sep 2008 14:55:45 -0400 |

Right. Insheet is for ASCII (text) data only. With 2-byte codes, second byte can be a byte which has a special (control) meaning in ASCII (e.g. fields separator, or end of line) and this will confuse -insheet-. Joseph Coveney has had a presentation on using ODBC and datasets with unicode "Working with ODBC data sources in Stata--tips and techniques" if you have access to it, you may find some tips there. (I don't have it, but I'd love to see it myself). Here is a link to the abstract: http://ideas.repec.org/p/boc/asug04/10.html Perhaps you could just configure an ODBC datasource and read your data from there, rather then parsing UTF-16 yourself (which should not be very difficult anyways). Most importantly you need to know whether one character is always two bytes (according to http://en.wikipedia.org/wiki/UTF-16 it can be 4 as well). If it is always 2, determine which 2 stand for the field separator, read a line, skip until you meet the separator, start reading data and stop when you hit the next separator, decode/output the number, trash the rest of the line. Best regards, Sergiy Radyakin On Tue, Sep 23, 2008 at 2:35 PM, Austin Nichols <austinnichols@gmail.com> wrote: > Dan Weitzenfeld : > Stata's -file- command can deal with this file; see -help file- for > examples of writing a loop to process a file. But converting in > another program, then using -infile- or -insheet-, is likely easier. > The optimal approach depends on how often you will face this situation > again in future... > > On Tue, Sep 23, 2008 at 2:28 PM, Steven Samuels > <sjhsamuels@earthlink.net> wrote: >> Dan, I don't know if Stata can read unicode. The -help- for -insheet- >> states it is for ASCII text. One possibility; use a text editor to add >> double quotes (") at the beginning and end of lines and on either side of >> the commas. This may read everything as character. Then convert the convert >> back to real only the variable you want. >> >> -Steve >> >> On Sep 23, 2008, at 2:19 PM, Dan Weitzenfeld wrote: >> >>> I've been informed that the files are written in unicode, utf-16. Can >>> Stata read this? >>> >>> On Tue, Sep 23, 2008 at 11:08 AM, Dan Weitzenfeld >>> <dan.weitzenfeld@emsense.com> wrote: >>>> >>>> Thanks Sergiy, I did not know about that command. Below is a line >>>> from my hexdump: >>>> >>>> 130 | 304b ff1f 002c 0031 002c 0032 000d 000a | >>>> 0K...,.1.,.2.... >>>> >>>> I also noticed this when I ran with option Analyze: >>>> >>>> Line-end characters >>>> \r\n (Windows) 0 >>>> \r by itself (Mac) 5 >>>> \n by itself (Unix) 5 >>>> >>>> which looks suspicious to me. I'll talk to the tech guys who made this >>>> file. >>>> Thanks again Sergiy. >>>> >>>> >>>> >>>> On Tue, Sep 23, 2008 at 10:51 AM, Sergiy Radyakin >>>> <serjradyakin@gmail.com> wrote: >>>>> >>>>> Dear Dan, >>>>> >>>>> how data "looks like" depends on, which software "looks" at it. From >>>>> what I see in your message, there is double-byte encoding of letters >>>>> which may cause a problem. >>>>> >>>>> I suggest you first "look" at your data byte-by-byte, to find a >>>>> pattern you need, then filter your data based on that pattern. >>>>> Use >>>>> -hexdump- filename >>>>> to see how your data is structured. Check that you are using correct >>>>> separator "comma" and not "tab", that "comma" in your file is indeed a >>>>> standard ASCII "comma" and not some weird two-bytes comma, that a >>>>> "comma" byte (44) is not used for encoding other characters, etc. >>>>> >>>>> Perhaps you could post a portion of output from hexdump here if this >>>>> does not contradict any rules of the list. >>>>> >>>>> Regards, Sergiy Radyakin >>>>> >>>>> >>>>> On Tue, Sep 23, 2008 at 1:09 PM, Dan Weitzenfeld >>>>> <dan.weitzenfeld@emsense.com> wrote: >>>>>> >>>>>> Hi All, >>>>>> Quick but strange question. I'm trying to insheet a comma-delimited >>>>>> file with Japanese in it. For example, the first line looks like: >>>>>> >>>>>> あなたはこのＣＭが好きですか？,0,とても好き >>>>>> >>>>>> The only information I need is the second variable, the 0, which will >>>>>> always be numeric. >>>>>> >>>>>> However, when I insheet the file, I get nonsense: >>>>>> >>>>>> þÿ0B0j0_0o0S0nÿ#ÿ-0LY}0M0g0Y0Kÿ 0h0f0‚Y}0M >>>>>> >>>>>> which would be okay, except that the second variable always comes in as >>>>>> blank. >>>>>> >>>>>> Does anyone know of a solution for this? >>>>>> >>>>>> Thanks in advance, >>>>>> Dan >>>>>> >>>>>> * >>>>>> * For searches and help try: >>>>>> * http://www.stata.com/help.cgi?search >>>>>> * http://www.stata.com/support/statalist/faq >>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>> >>>>> >>>>> * >>>>> * For searches and help try: >>>>> * http://www.stata.com/help.cgi?search >>>>> * http://www.stata.com/support/statalist/faq >>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>> >>>> >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/statalist/faq >>> * http://www.ats.ucla.edu/stat/stata/ >> >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ >> > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Insheeting Japanese***From:*Steven Samuels <sjhsamuels@earthlink.net>

**Re: st: Insheeting Japanese***From:*"Dan Weitzenfeld" <dan.weitzenfeld@emsense.com>

**References**:**st: Insheeting Japanese***From:*"Dan Weitzenfeld" <dan.weitzenfeld@emsense.com>

**Re: st: Insheeting Japanese***From:*"Sergiy Radyakin" <serjradyakin@gmail.com>

**Re: st: Insheeting Japanese***From:*"Dan Weitzenfeld" <dan.weitzenfeld@emsense.com>

**Re: st: Insheeting Japanese***From:*"Dan Weitzenfeld" <dan.weitzenfeld@emsense.com>

**Re: st: Insheeting Japanese***From:*Steven Samuels <sjhsamuels@earthlink.net>

**Re: st: Insheeting Japanese***From:*"Austin Nichols" <austinnichols@gmail.com>

- Prev by Date:
**Re: st: Insheeting Japanese** - Next by Date:
**Re: st: Apple Script to Comment Lines in Text Wranger/BBEdit** - Previous by thread:
**Re: st: Insheeting Japanese** - Next by thread:
**Re: st: Insheeting Japanese** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |