Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: reading HTML source in Chinese but get a messy code


From   Nick Cox <njcoxstata@gmail.com>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: reading HTML source in Chinese but get a messy code
Date   Sat, 8 Jun 2013 16:08:42 +0100

OK.

Looking at the file in a text editor shows that alternate lines are
blank. I don't know which lines are data for you.
Nick
njcoxstata@gmail.com


On 8 June 2013 16:04, Li Chuntao (Tony) <leechtcn@gmail.com> wrote:
> Yes, of course i understand my own code. Here i just want to display
> the first two lines to show that there is a messay output and seeking
> helps.
>
> Thank you, Nick, for your always kind help helpfulness
>
> Tony
>
>
>
> On Sat, Jun 8, 2013 at 10:59 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>> Your own code doesn't seem well matched to the input. In your first
>> post you were looping over the lines of the file, reading them one by
>> one and then processing them. You have abandoned that here. Do you
>> understand what the original Mata code does?
>> Nick
>> njcoxstata@gmail.com
>>
>>
>> On 8 June 2013 15:51, Li Chuntao (Tony) <leechtcn@gmail.com> wrote:
>>> Well, it still does not work, as can be seen from the output of the
>>> following codes:
>>>
>>> I mean, the output from http://html2text.theinfo.org seems quite
>>> clean, but it turns to a messay when i tried to read it into Stata,
>>> weather by insheet using or by the Mata code followed.
>>>
>>> Do anyone have such an experience?
>>>
>>>
>>> thanks
>>>
>>> Chuntao
>>>
>>>
>>> copy "http://html2text.theinfo.org/?url=++http%3A%2F%2Fqq.ico.la%2Fqq459322464.html";
>>> d:\temp.txt, replace
>>> mata:
>>>         fh = fopen("d:\temp.txt", "r")
>>>         junk=fget(fh)
>>>                 junk
>>>         junk=fget(fh)
>>>                 junk
>>>
>>>         }
>>>
>>>
>>>
>>> On Fri, Jun 7, 2013 at 2:36 AM, Sergiy Radyakin <serjradyakin@gmail.com> wrote:
>>>> Chuntao,
>>>>
>>>> adding to Nick's comments, you don't have to parse HTML code yourself
>>>> as this is a pretty standard task. For your purposes the following
>>>> should yield a pretty clean file:
>>>> http://html2text.theinfo.org/?url=http%3A%2F%2Fqq.ico.la%2Fqq459322466.html
>>>>
>>>> where you supply your URL as a parameter.
>>>>
>>>> Best, Sergiy Radyakin
>>>>
>>>>
>>>> On Thu, Jun 6, 2013 at 12:58 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>>>>> If a file contains junk in lines 1 to 31, don't skip lines 1 to 34!
>>>>>
>>>>> A more fundamental point is that this is HTML:
>>>>>
>>>>> 1. So, lines will necessarily include HTML markup code in many if not
>>>>> all lines. You will need to strip those too, or interpret them.
>>>>>
>>>>> 2. Mark-up code won't necessarily be interpretable if you ignore previous lines.
>>>>>
>>>>> In this particular case, there are many references to yet other files,
>>>>> perhaps not of concern to you.
>>>>>
>>>>> I can't read Chinese, so that is far as I go.
>>>>>
>>>>> Nick
>>>>> njcoxstata@gmail.com
>>>>>
>>>>>
>>>>> On 6 June 2013 14:36, Li Chuntao (Tony) <leechtcn@gmail.com> wrote:
>>>>>> Dear Listers,
>>>>>>
>>>>>>        I want to import the following HTML source files:
>>>>>>
>>>>>>         http://qq.ico.la/qq459322466.html
>>>>>>
>>>>>>         The source file contains some information in Chinese, which is
>>>>>> located in line 32 to 73.
>>>>>>
>>>>>>          i tried to import the information by using the following code:
>>>>>>
>>>>>> clear all
>>>>>> set obs 500
>>>>>> copy  "http://qq.ico.la/qq459322466.html"; d:\qq.txt, replace
>>>>>>
>>>>>> mata:
>>>>>>         fh = fopen("d:\qq.txt", "r")
>>>>>>         for(i=1; i<=34; i++) {
>>>>>>         junk=fget(fh)
>>>>>>         }
>>>>>>         for(i=; i<=20; i++) {
>>>>>>         junk=fget(fh)
>>>>>>         junk
>>>>>>         }
>>>>>>
>>>>>> end
>>>>>>
>>>>>> but the result data in memory is only a messy.
>>>>>>
>>>>>> Similar code has been used for other webpage, thanks to Prof. Kit
>>>>>> Baum, as can be seen following:
>>>>>>
>>>>>> clear all
>>>>>> set obs 500
>>>>>> local stkcd="000002"
>>>>>> gen str20 date="2012.12.31"
>>>>>> copy "http://stockdata.stock.hexun.com/2008/lr.aspx?stockid=`stkcd'&accountdate=2012.12.31"
>>>>>>  d:\date.txt, replace
>>>>>> mata:
>>>>>>         fh = fopen("d:\date.txt", "r")
>>>>>>         for(i=1; i<=444; i++) {
>>>>>>         junk=fget(fh)
>>>>>>         }
>>>>>>
>>>>>> Can someone familiar with Chinese encoding give me some hits?
>>>>>>
>>>>>> Best
>>>>>>
>>>>>> Chuntao
>>>>>> *
>>>>>> *   For searches and help try:
>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>> *
>>>>> *   For searches and help try:
>>>>> *   http://www.stata.com/help.cgi?search
>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index