Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: reading HTML source in Chinese but get a messy code

From	"Li Chuntao (Tony)" <[email protected]>
To	[email protected]
Subject	Re: st: reading HTML source in Chinese but get a messy code
Date	Sat, 8 Jun 2013 23:20:42 +0800

Dear Prof. Nick,

   Line 2 to 9 are what i want, from the page of
http://html2text.theinfo.org/?url=++http%3A%2F%2Fqq.ico.la%2Fqq459322464.html

thanks

Tony


On Sat, Jun 8, 2013 at 11:08 PM, Nick Cox <[email protected]> wrote:
> OK.
>
> Looking at the file in a text editor shows that alternate lines are
> blank. I don't know which lines are data for you.
> Nick
> [email protected]
>
>
> On 8 June 2013 16:04, Li Chuntao (Tony) <[email protected]> wrote:
>> Yes, of course i understand my own code. Here i just want to display
>> the first two lines to show that there is a messay output and seeking
>> helps.
>>
>> Thank you, Nick, for your always kind help helpfulness
>>
>> Tony
>>
>>
>>
>> On Sat, Jun 8, 2013 at 10:59 PM, Nick Cox <[email protected]> wrote:
>>> Your own code doesn't seem well matched to the input. In your first
>>> post you were looping over the lines of the file, reading them one by
>>> one and then processing them. You have abandoned that here. Do you
>>> understand what the original Mata code does?
>>> Nick
>>> [email protected]
>>>
>>>
>>> On 8 June 2013 15:51, Li Chuntao (Tony) <[email protected]> wrote:
>>>> Well, it still does not work, as can be seen from the output of the
>>>> following codes:
>>>>
>>>> I mean, the output from http://html2text.theinfo.org seems quite
>>>> clean, but it turns to a messay when i tried to read it into Stata,
>>>> weather by insheet using or by the Mata code followed.
>>>>
>>>> Do anyone have such an experience?
>>>>
>>>>
>>>> thanks
>>>>
>>>> Chuntao
>>>>
>>>>
>>>> copy "http://html2text.theinfo.org/?url=++http%3A%2F%2Fqq.ico.la%2Fqq459322464.html";
>>>> d:\temp.txt, replace
>>>> mata:
>>>>         fh = fopen("d:\temp.txt", "r")
>>>>         junk=fget(fh)
>>>>                 junk
>>>>         junk=fget(fh)
>>>>                 junk
>>>>
>>>>         }
>>>>
>>>>
>>>>
>>>> On Fri, Jun 7, 2013 at 2:36 AM, Sergiy Radyakin <[email protected]> wrote:
>>>>> Chuntao,
>>>>>
>>>>> adding to Nick's comments, you don't have to parse HTML code yourself
>>>>> as this is a pretty standard task. For your purposes the following
>>>>> should yield a pretty clean file:
>>>>> http://html2text.theinfo.org/?url=http%3A%2F%2Fqq.ico.la%2Fqq459322466.html
>>>>>
>>>>> where you supply your URL as a parameter.
>>>>>
>>>>> Best, Sergiy Radyakin
>>>>>
>>>>>
>>>>> On Thu, Jun 6, 2013 at 12:58 PM, Nick Cox <[email protected]> wrote:
>>>>>> If a file contains junk in lines 1 to 31, don't skip lines 1 to 34!
>>>>>>
>>>>>> A more fundamental point is that this is HTML:
>>>>>>
>>>>>> 1. So, lines will necessarily include HTML markup code in many if not
>>>>>> all lines. You will need to strip those too, or interpret them.
>>>>>>
>>>>>> 2. Mark-up code won't necessarily be interpretable if you ignore previous lines.
>>>>>>
>>>>>> In this particular case, there are many references to yet other files,
>>>>>> perhaps not of concern to you.
>>>>>>
>>>>>> I can't read Chinese, so that is far as I go.
>>>>>>
>>>>>> Nick
>>>>>> [email protected]
>>>>>>
>>>>>>
>>>>>> On 6 June 2013 14:36, Li Chuntao (Tony) <[email protected]> wrote:
>>>>>>> Dear Listers,
>>>>>>>
>>>>>>>        I want to import the following HTML source files:
>>>>>>>
>>>>>>>         http://qq.ico.la/qq459322466.html
>>>>>>>
>>>>>>>         The source file contains some information in Chinese, which is
>>>>>>> located in line 32 to 73.
>>>>>>>
>>>>>>>          i tried to import the information by using the following code:
>>>>>>>
>>>>>>> clear all
>>>>>>> set obs 500
>>>>>>> copy  "http://qq.ico.la/qq459322466.html"; d:\qq.txt, replace
>>>>>>>
>>>>>>> mata:
>>>>>>>         fh = fopen("d:\qq.txt", "r")
>>>>>>>         for(i=1; i<=34; i++) {
>>>>>>>         junk=fget(fh)
>>>>>>>         }
>>>>>>>         for(i=; i<=20; i++) {
>>>>>>>         junk=fget(fh)
>>>>>>>         junk
>>>>>>>         }
>>>>>>>
>>>>>>> end
>>>>>>>
>>>>>>> but the result data in memory is only a messy.
>>>>>>>
>>>>>>> Similar code has been used for other webpage, thanks to Prof. Kit
>>>>>>> Baum, as can be seen following:
>>>>>>>
>>>>>>> clear all
>>>>>>> set obs 500
>>>>>>> local stkcd="000002"
>>>>>>> gen str20 date="2012.12.31"
>>>>>>> copy "http://stockdata.stock.hexun.com/2008/lr.aspx?stockid=`stkcd'&accountdate=2012.12.31"
>>>>>>>  d:\date.txt, replace
>>>>>>> mata:
>>>>>>>         fh = fopen("d:\date.txt", "r")
>>>>>>>         for(i=1; i<=444; i++) {
>>>>>>>         junk=fget(fh)
>>>>>>>         }
>>>>>>>
>>>>>>> Can someone familiar with Chinese encoding give me some hits?
>>>>>>>
>>>>>>> Best
>>>>>>>
>>>>>>> Chuntao
>>>>>>> *
>>>>>>> *   For searches and help try:
>>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>>> *
>>>>>> *   For searches and help try:
>>>>>> *   http://www.stata.com/help.cgi?search
>>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>>> *
>>>>> *   For searches and help try:
>>>>> *   http://www.stata.com/help.cgi?search
>>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>>> *   http://www.ats.ucla.edu/stat/stata/
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: reading HTML source in Chinese but get a messy code
  - From: Sergiy Radyakin <[email protected]>

References:
- st: reading HTML source in Chinese but get a messy code
  - From: "Li Chuntao (Tony)" <[email protected]>
- Re: st: reading HTML source in Chinese but get a messy code
  - From: Nick Cox <[email protected]>
- Re: st: reading HTML source in Chinese but get a messy code
  - From: Sergiy Radyakin <[email protected]>
- Re: st: reading HTML source in Chinese but get a messy code
  - From: "Li Chuntao (Tony)" <[email protected]>
- Re: st: reading HTML source in Chinese but get a messy code
  - From: Nick Cox <[email protected]>
- Re: st: reading HTML source in Chinese but get a messy code
  - From: "Li Chuntao (Tony)" <[email protected]>
- Re: st: reading HTML source in Chinese but get a messy code
  - From: Nick Cox <[email protected]>

Prev by Date: Re: st: reading HTML source in Chinese but get a messy code
Next by Date: Re: st: reading HTML source in Chinese but get a messy code
Previous by thread: Re: st: reading HTML source in Chinese but get a messy code
Next by thread: Re: st: reading HTML source in Chinese but get a messy code
Index(es):
- Date
- Thread