Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: reading HTML source in Chinese but get a messy code

From	Sergiy Radyakin <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: reading HTML source in Chinese but get a messy code
Date	Mon, 10 Jun 2013 02:14:04 -0400

following statement in the documentation:"str# variables require #
bytes per observation".Sergiy.

On Mon, Jun 10, 2013 at 1:06 AM, Sergiy Radyakin <[email protected]> wrote:
> Tony,
>
> if your choice of package is based solely on whether it supports
> unicode or not, I would probably recommend Microsoft's Excel or
> OpenOffice's Calc. However since you are in this forum, you probably
> intend to do some statistical processing of that information. In that
> case what is that analysis? In many cases you actually don't need to
> see the text, but just rely on the package to handle it. If you don't
> want to specify, consider SPSS or SAS which (according to the
> manufacturers) both support unicode. I have also asked if new Stata 13
> supports unicode and hope for the best. If you want to harvest
> information about the user profiles, you will need to check with the
> site owner whether this would be permitted, and if you have valid
> scientific needs to do it, perhaps, the owner might simply pass that
> information to you in an organized way. From what it appears on this
> page you can pull some information like ID, age, gender, phone, and
> province, and the rest (name, address) is hardly of any value for
> statistical processing.
>
> Best, Sergiy
>
> On Sat, Jun 8, 2013 at 9:53 PM, Li Chuntao (Tony) <[email protected]> wrote:
>> Dear Sergiy,
>>
>>     Thank you for your advice. Actually i need the whole lines of
>> information from Line 2~9. Maybe Stata just cannot handle it because
>> of the unicode problem. If you know any package can do it, please
>> advice.
>>
>> thanks again
>>
>> Tony
>>
>>
>> On Sun, Jun 9, 2013 at 12:09 AM, Sergiy Radyakin <[email protected]> wrote:
>>> Tony, after visiting the link I see in lines 2-9 characters in
>>> Chinese. Stata will not show you these characters because Stata does
>>> not work with unicode. To see the file through Stata's eyes, go to the
>>> link you posted in FireFox, then go to the menu View-->Character
>>> Encoding-->More Encodings-->West European-->Western(Windows-1252).
>>> This is what you can import into Stata and, yes, it does look messy.
>>> This is the best you can get with it. The good thing is that if you
>>> process your data in Stata and then output the same messy text you
>>> will end up with a very readable text, but readable elsewhere (e.g. in
>>> notepad or a browser). To cut it short, if your analysis requires e.g.
>>> search of a substring in a text - you might do it by searching for
>>> byte sequences, and those sequences would not look intuitive at all.
>>> But if it is something more involved then you might want to rethink
>>> the choice of a package to do it. Perhaps if you describe the broad
>>> goal of what you are doing it would be easier to advise.
>>> Best, Sergiy
>>>
>>> On Sat, Jun 8, 2013 at 11:20 AM, Li Chuntao (Tony) <[email protected]> wrote:
>>>> Dear Prof. Nick,
>>>>
>>>>    Line 2 to 9 are what i want, from the page of
>>>> http://html2text.theinfo.org/?url=++http%3A%2F%2Fqq.ico.la%2Fqq459322464.html
>>>>
>>>> thanks
>>>>
>>>> Tony
>>>>
>>>>
>>>> On Sat, Jun 8, 2013 at 11:08 PM, Nick Cox <[email protected]> wrote:
>>>>> OK.
>>>>>
>>>>> Looking at the file in a text editor shows that alternate lines are
>>>>> blank. I don't know which lines are data for you.
>>>>> Nick
>>>>> [email protected]
>>>>>
>>>>>
>>>>> On 8 June 2013 16:04, Li Chuntao (Tony) <[email protected]> wrote:
>>>>>> Yes, of course i understand my own code. Here i just want to display
>>>>>> the first two lines to show that there is a messay output and seeking
>>>>>> helps.
>>>>>>
>>>>>> Thank you, Nick, for your always kind help helpfulness
>>>>>>
>>>>>> Tony
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Jun 8, 2013 at 10:59 PM, Nick Cox <[email protected]> wrote:
>>>>>>> Your own code doesn't seem well matched to the input. In your first
>>>>>>> post you were looping over the lines of the file, reading them one by
>>>>>>> one and then processing them. You have abandoned that here. Do you
>>>>>>> understand what the original Mata code does?
>>>>>>> Nick
>>>>>>> [email protected]
>>>>>>>
>>>>>>>
>>>>>>> On 8 June 2013 15:51, Li Chuntao (Tony) <[email protected]> wrote:
>>>>>>>> Well, it still does not work, as can be seen from the output of the
>>>>>>>> following codes:
>>>>>>>>
>>>>>>>> I mean, the output from http://html2text.theinfo.org seems quite
>>>>>>>> clean, but it turns to a messay when i tried to read it into Stata,
>>>>>>>> weather by insheet using or by the Mata code followed.
>>>>>>>>
>>>>>>>> Do anyone have such an experience?
>>>>>>>>
>>>>>>>>
>>>>>>>> thanks
>>>>>>>>
>>>>>>>> Chuntao
>>>>>>>>
>>>>>>>>
>>>>>>>> copy "http://html2text.theinfo.org/?url=++http%3A%2F%2Fqq.ico.la%2Fqq459322464.html";
>>>>>>>> d:\temp.txt, replace
>>>>>>>> mata:
>>>>>>>>         fh = fopen("d:\temp.txt", "r")
>>>>>>>>         junk=fget(fh)
>>>>>>>>                 junk
>>>>>>>>         junk=fget(fh)
>>>>>>>>                 junk
>>>>>>>>
>>>>>>>>         }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 7, 2013 at 2:36 AM, Sergiy Radyakin <[email protected]> wrote:
>>>>>>>>> Chuntao,
>>>>>>>>>
>>>>>>>>> adding to Nick's comments, you don't have to parse HTML code yourself
>>>>>>>>> as this is a pretty standard task. For your purposes the following
>>>>>>>>> should yield a pretty clean file:
>>>>>>>>> http://html2text.theinfo.org/?url=http%3A%2F%2Fqq.ico.la%2Fqq459322466.html
>>>>>>>>>
>>>>>>>>> where you supply your URL as a parameter.
>>>>>>>>>
>>>>>>>>> Best, Sergiy Radyakin
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jun 6, 2013 at 12:58 PM, Nick Cox <[email protected]> wrote:
>>>>>>>>>> If a file contains junk in lines 1 to 31, don't skip lines 1 to 34!
>>>>>>>>>>
>>>>>>>>>> A more fundamental point is that this is HTML:
>>>>>>>>>>
>>>>>>>>>> 1. So, lines will necessarily include HTML markup code in many if not
>>>>>>>>>> all lines. You will need to strip those too, or interpret them.
>>>>>>>>>>
>>>>>>>>>> 2. Mark-up code won't necessarily be interpretable if you ignore previous lines.
>>>>>>>>>>
>>>>>>>>>> In this particular case, there are many references to yet other files,
>>>>>>>>>> perhaps not of concern to you.
>>>>>>>>>>
>>>>>>>>>> I can't read Chinese, so that is far as I go.
>>>>>>>>>>
>>>>>>>>>> Nick
>>>>>>>>>> [email protected]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 6 June 2013 14:36, Li Chuntao (Tony) <[email protected]> wrote:
>>>>>>>>>>> Dear Listers,
>>>>>>>>>>>
>>>>>>>>>>>        I want to import the following HTML source files:
>>>>>>>>>>>
>>>>>>>>>>>         http://qq.ico.la/qq459322466.html
>>>>>>>>>>>
>>>>>>>>>>>         The source file contains some information in Chinese, which is
>>>>>>>>>>> located in line 32 to 73.
>>>>>>>>>>>
>>>>>>>>>>>          i tried to import the information by using the following code:
>>>>>>>>>>>
>>>>>>>>>>> clear all
>>>>>>>>>>> set obs 500
>>>>>>>>>>> copy  "http://qq.ico.la/qq459322466.html"; d:\qq.txt, replace
>>>>>>>>>>>
>>>>>>>>>>> mata:
>>>>>>>>>>>         fh = fopen("d:\qq.txt", "r")
>>>>>>>>>>>         for(i=1; i<=34; i++) {
>>>>>>>>>>>         junk=fget(fh)
>>>>>>>>>>>         }
>>>>>>>>>>>         for(i=; i<=20; i++) {
>>>>>>>>>>>         junk=fget(fh)
>>>>>>>>>>>         junk
>>>>>>>>>>>         }
>>>>>>>>>>>
>>>>>>>>>>> end
>>>>>>>>>>>
>>>>>>>>>>> but the result data in memory is only a messy.
>>>>>>>>>>>
>>>>>>>>>>> Similar code has been used for other webpage, thanks to Prof. Kit
>>>>>>>>>>> Baum, as can be seen following:
>>>>>>>>>>>
>>>>>>>>>>> clear all
>>>>>>>>>>> set obs 500
>>>>>>>>>>> local stkcd="000002"
>>>>>>>>>>> gen str20 date="2012.12.31"
>>>>>>>>>>> copy "http://stockdata.stock.hexun.com/2008/lr.aspx?stockid=`stkcd'&accountdate=2012.12.31"
>>>>>>>>>>>  d:\date.txt, replace
>>>>>>>>>>> mata:
>>>>>>>>>>>         fh = fopen("d:\date.txt", "r")
>>>>>>>>>>>         for(i=1; i<=444; i++) {
>>>>>>>>>>>         junk=fget(fh)
>>>>>>>>>>>         }
>>>>>>>>>>>
>>>>>>>>>>> Can someone familiar with Chinese encoding give me some hits?
>>>>>>>>>>>
>>>>>>>>>>> Best
>>>>>>>>>>>
>>>>>>>>>>> Chuntao
>>>>>>>>>>> *
>>>>>>>>>>> *   For searches and help try:
>>>>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>>>> *
>>>>>>>>>> *   For searches and help try:
>>>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>>> *
>>>>>>>>> *   For searches and help try:
>>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>>> *
>>>>>>>> *   For searches and help try:
>>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>>> *
>>>>>>> *   For searches and help try:
>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>> *
>>>>>> *   For searches and help try:
>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>> *
>>>>> *   For searches and help try:
>>>>> *   http://www.stata.com/help.cgi?search
>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: reading HTML source in Chinese but get a messy code
  - From: "Li Chuntao (Tony)" <[email protected]>
- Re: st: reading HTML source in Chinese but get a messy code
  - From: Nick Cox <[email protected]>
- Re: st: reading HTML source in Chinese but get a messy code
  - From: Sergiy Radyakin <[email protected]>
- Re: st: reading HTML source in Chinese but get a messy code
  - From: "Li Chuntao (Tony)" <[email protected]>
- Re: st: reading HTML source in Chinese but get a messy code
  - From: Nick Cox <[email protected]>
- Re: st: reading HTML source in Chinese but get a messy code
  - From: "Li Chuntao (Tony)" <[email protected]>
- Re: st: reading HTML source in Chinese but get a messy code
  - From: Nick Cox <[email protected]>
- Re: st: reading HTML source in Chinese but get a messy code
  - From: "Li Chuntao (Tony)" <[email protected]>
- Re: st: reading HTML source in Chinese but get a messy code
  - From: Sergiy Radyakin <[email protected]>
- Re: st: reading HTML source in Chinese but get a messy code
  - From: "Li Chuntao (Tony)" <[email protected]>
- Re: st: reading HTML source in Chinese but get a messy code
  - From: Sergiy Radyakin <[email protected]>

Prev by Date: st: Combining graph with the same scale
Next by Date: Re: st: Combining graph with the same scale
Previous by thread: Re: st: reading HTML source in Chinese but get a messy code
Next by thread: [no subject]
Index(es):
- Date
- Thread