[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Dan Weitzenfeld" <dan.weitzenfeld@emsense.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Insheeting Japanese |

Date |
Tue, 23 Sep 2008 12:01:04 -0700 |

Thanks all, VERY helpful. I am going to take a crack at parsing it. I'll post the results for posterity. On Tue, Sep 23, 2008 at 11:55 AM, Sergiy Radyakin <serjradyakin@gmail.com> wrote: > Right. Insheet is for ASCII (text) data only. > > With 2-byte codes, second byte can be a byte which has a special > (control) meaning in ASCII (e.g. fields separator, or end of line) and > this will confuse -insheet-. > > Joseph Coveney has had a presentation on using ODBC and datasets with > unicode "Working with ODBC data sources in Stata--tips and techniques" > if you have access to it, you may find some tips there. (I don't have > it, but I'd love to see it myself). Here is a link to the abstract: > http://ideas.repec.org/p/boc/asug04/10.html > > Perhaps you could just configure an ODBC datasource and read your data > from there, rather then parsing UTF-16 yourself (which should not be > very difficult anyways). Most importantly you need to know whether one > character is always two bytes (according to > http://en.wikipedia.org/wiki/UTF-16 it can be 4 as well). If it is > always 2, determine which 2 stand for the field separator, read a > line, skip until you meet the separator, start reading data and stop > when you hit the next separator, decode/output the number, trash the > rest of the line. > > Best regards, > Sergiy Radyakin > > On Tue, Sep 23, 2008 at 2:35 PM, Austin Nichols <austinnichols@gmail.com> wrote: >> Dan Weitzenfeld : >> Stata's -file- command can deal with this file; see -help file- for >> examples of writing a loop to process a file. But converting in >> another program, then using -infile- or -insheet-, is likely easier. >> The optimal approach depends on how often you will face this situation >> again in future... >> >> On Tue, Sep 23, 2008 at 2:28 PM, Steven Samuels >> <sjhsamuels@earthlink.net> wrote: >>> Dan, I don't know if Stata can read unicode. The -help- for -insheet- >>> states it is for ASCII text. One possibility; use a text editor to add >>> double quotes (") at the beginning and end of lines and on either side of >>> the commas. This may read everything as character. Then convert the convert >>> back to real only the variable you want. >>> >>> -Steve >>> >>> On Sep 23, 2008, at 2:19 PM, Dan Weitzenfeld wrote: >>> >>>> I've been informed that the files are written in unicode, utf-16. Can >>>> Stata read this? >>>> >>>> On Tue, Sep 23, 2008 at 11:08 AM, Dan Weitzenfeld >>>> <dan.weitzenfeld@emsense.com> wrote: >>>>> >>>>> Thanks Sergiy, I did not know about that command. Below is a line >>>>> from my hexdump: >>>>> >>>>> 130 | 304b ff1f 002c 0031 002c 0032 000d 000a | >>>>> 0K...,.1.,.2.... >>>>> >>>>> I also noticed this when I ran with option Analyze: >>>>> >>>>> Line-end characters >>>>> \r\n (Windows) 0 >>>>> \r by itself (Mac) 5 >>>>> \n by itself (Unix) 5 >>>>> >>>>> which looks suspicious to me. I'll talk to the tech guys who made this >>>>> file. >>>>> Thanks again Sergiy. >>>>> >>>>> >>>>> >>>>> On Tue, Sep 23, 2008 at 10:51 AM, Sergiy Radyakin >>>>> <serjradyakin@gmail.com> wrote: >>>>>> >>>>>> Dear Dan, >>>>>> >>>>>> how data "looks like" depends on, which software "looks" at it. From >>>>>> what I see in your message, there is double-byte encoding of letters >>>>>> which may cause a problem. >>>>>> >>>>>> I suggest you first "look" at your data byte-by-byte, to find a >>>>>> pattern you need, then filter your data based on that pattern. >>>>>> Use >>>>>> -hexdump- filename >>>>>> to see how your data is structured. Check that you are using correct >>>>>> separator "comma" and not "tab", that "comma" in your file is indeed a >>>>>> standard ASCII "comma" and not some weird two-bytes comma, that a >>>>>> "comma" byte (44) is not used for encoding other characters, etc. >>>>>> >>>>>> Perhaps you could post a portion of output from hexdump here if this >>>>>> does not contradict any rules of the list. >>>>>> >>>>>> Regards, Sergiy Radyakin >>>>>> >>>>>> >>>>>> On Tue, Sep 23, 2008 at 1:09 PM, Dan Weitzenfeld >>>>>> <dan.weitzenfeld@emsense.com> wrote: >>>>>>> >>>>>>> Hi All, >>>>>>> Quick but strange question. I'm trying to insheet a comma-delimited >>>>>>> file with Japanese in it. For example, the first line looks like: >>>>>>> >>>>>>> あなたはこのＣＭが好きですか？,0,とても好き >>>>>>> >>>>>>> The only information I need is the second variable, the 0, which will >>>>>>> always be numeric. >>>>>>> >>>>>>> However, when I insheet the file, I get nonsense: >>>>>>> >>>>>>> þÿ0B0j0_0o0S0nÿ#ÿ-0LY}0M0g0Y0Kÿ 0h0f0‚Y}0M >>>>>>> >>>>>>> which would be okay, except that the second variable always comes in as >>>>>>> blank. >>>>>>> >>>>>>> Does anyone know of a solution for this? >>>>>>> >>>>>>> Thanks in advance, >>>>>>> Dan >>>>>>> >>>>>>> * >>>>>>> * For searches and help try: >>>>>>> * http://www.stata.com/help.cgi?search >>>>>>> * http://www.stata.com/support/statalist/faq >>>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>>> >>>>>> >>>>>> * >>>>>> * For searches and help try: >>>>>> * http://www.stata.com/help.cgi?search >>>>>> * http://www.stata.com/support/statalist/faq >>>>>> * http://www.ats.ucla.edu/stat/stata/ >>>>>> >>>>> >>>> >>>> * >>>> * For searches and help try: >>>> * http://www.stata.com/help.cgi?search >>>> * http://www.stata.com/support/statalist/faq >>>> * http://www.ats.ucla.edu/stat/stata/ >>> >>> >>> * >>> * For searches and help try: >>> * http://www.stata.com/help.cgi?search >>> * http://www.stata.com/support/statalist/faq >>> * http://www.ats.ucla.edu/stat/stata/ >>> >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ >> > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Insheeting Japanese***From:*"Dan Weitzenfeld" <dan.weitzenfeld@emsense.com>

**Re: st: Insheeting Japanese***From:*"Sergiy Radyakin" <serjradyakin@gmail.com>

**Re: st: Insheeting Japanese***From:*"Dan Weitzenfeld" <dan.weitzenfeld@emsense.com>

**Re: st: Insheeting Japanese***From:*"Dan Weitzenfeld" <dan.weitzenfeld@emsense.com>

**Re: st: Insheeting Japanese***From:*Steven Samuels <sjhsamuels@earthlink.net>

**Re: st: Insheeting Japanese***From:*"Austin Nichols" <austinnichols@gmail.com>

**Re: st: Insheeting Japanese***From:*"Sergiy Radyakin" <serjradyakin@gmail.com>

- Prev by Date:
**Re: st: Apple Script to Comment Lines in Text Wranger/BBEdit** - Next by Date:
**Re: st: Apple Script to Comment Lines in Text Wranger/BBEdit** - Previous by thread:
**Re: st: Insheeting Japanese** - Next by thread:
**Re: st: Insheeting Japanese** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |