Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: reading HTML source in Chinese but get a messy code


From   Nick Cox <njcoxstata@gmail.com>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: reading HTML source in Chinese but get a messy code
Date   Sat, 8 Jun 2013 15:59:00 +0100

Your own code doesn't seem well matched to the input. In your first
post you were looping over the lines of the file, reading them one by
one and then processing them. You have abandoned that here. Do you
understand what the original Mata code does?
Nick
njcoxstata@gmail.com


On 8 June 2013 15:51, Li Chuntao (Tony) <leechtcn@gmail.com> wrote:
> Well, it still does not work, as can be seen from the output of the
> following codes:
>
> I mean, the output from http://html2text.theinfo.org seems quite
> clean, but it turns to a messay when i tried to read it into Stata,
> weather by insheet using or by the Mata code followed.
>
> Do anyone have such an experience?
>
>
> thanks
>
> Chuntao
>
>
> copy "http://html2text.theinfo.org/?url=++http%3A%2F%2Fqq.ico.la%2Fqq459322464.html";
> d:\temp.txt, replace
> mata:
>         fh = fopen("d:\temp.txt", "r")
>         junk=fget(fh)
>                 junk
>         junk=fget(fh)
>                 junk
>
>         }
>
>
>
> On Fri, Jun 7, 2013 at 2:36 AM, Sergiy Radyakin <serjradyakin@gmail.com> wrote:
>> Chuntao,
>>
>> adding to Nick's comments, you don't have to parse HTML code yourself
>> as this is a pretty standard task. For your purposes the following
>> should yield a pretty clean file:
>> http://html2text.theinfo.org/?url=http%3A%2F%2Fqq.ico.la%2Fqq459322466.html
>>
>> where you supply your URL as a parameter.
>>
>> Best, Sergiy Radyakin
>>
>>
>> On Thu, Jun 6, 2013 at 12:58 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>>> If a file contains junk in lines 1 to 31, don't skip lines 1 to 34!
>>>
>>> A more fundamental point is that this is HTML:
>>>
>>> 1. So, lines will necessarily include HTML markup code in many if not
>>> all lines. You will need to strip those too, or interpret them.
>>>
>>> 2. Mark-up code won't necessarily be interpretable if you ignore previous lines.
>>>
>>> In this particular case, there are many references to yet other files,
>>> perhaps not of concern to you.
>>>
>>> I can't read Chinese, so that is far as I go.
>>>
>>> Nick
>>> njcoxstata@gmail.com
>>>
>>>
>>> On 6 June 2013 14:36, Li Chuntao (Tony) <leechtcn@gmail.com> wrote:
>>>> Dear Listers,
>>>>
>>>>        I want to import the following HTML source files:
>>>>
>>>>         http://qq.ico.la/qq459322466.html
>>>>
>>>>         The source file contains some information in Chinese, which is
>>>> located in line 32 to 73.
>>>>
>>>>          i tried to import the information by using the following code:
>>>>
>>>> clear all
>>>> set obs 500
>>>> copy  "http://qq.ico.la/qq459322466.html"; d:\qq.txt, replace
>>>>
>>>> mata:
>>>>         fh = fopen("d:\qq.txt", "r")
>>>>         for(i=1; i<=34; i++) {
>>>>         junk=fget(fh)
>>>>         }
>>>>         for(i=; i<=20; i++) {
>>>>         junk=fget(fh)
>>>>         junk
>>>>         }
>>>>
>>>> end
>>>>
>>>> but the result data in memory is only a messy.
>>>>
>>>> Similar code has been used for other webpage, thanks to Prof. Kit
>>>> Baum, as can be seen following:
>>>>
>>>> clear all
>>>> set obs 500
>>>> local stkcd="000002"
>>>> gen str20 date="2012.12.31"
>>>> copy "http://stockdata.stock.hexun.com/2008/lr.aspx?stockid=`stkcd'&accountdate=2012.12.31"
>>>>  d:\date.txt, replace
>>>> mata:
>>>>         fh = fopen("d:\date.txt", "r")
>>>>         for(i=1; i<=444; i++) {
>>>>         junk=fget(fh)
>>>>         }
>>>>
>>>> Can someone familiar with Chinese encoding give me some hits?
>>>>
>>>> Best
>>>>
>>>> Chuntao
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index